Large language model (LLM) unlearning is becoming vital due to regulations like GDPR's right to be forgotten and the need to remove copyrighted or sensitive content, as retraining models is often impractical. To effectively evaluate unlearning algorithms, researchers developed LUME (LLM Unlearning with Multitask Evaluations).
LUME stands out as a comprehensive new benchmark. It uniquely addresses limitations of prior evaluations by including three distinct tasks: synthetic creative novels, synthetic biographies with sensitive PII, and real public biographies. This multi-task approach, especially the inclusion of PII, provides extensive coverage for assessing algorithm performance. Effectiveness is measured using metrics like Regurgitation Rate, Knowledge Test Accuracy, Membership Inference Attack (MIA) success, and overall Model Utility on MMLU.
Experiments on LUME revealed that current unlearning algorithms struggle to sufficiently remove information from the forget set without causing substantial degradation in the model's performance on the retain set and its overall utility. Some methods also show high privacy leakage risks. The benchmark, developed by Amazon AGI, UCLA, UIUC, EPFL, and University of Minnesota, is publicly available and includes fine-tuned 1B and 7B parameter models, with larger models planned.
Learn more about LUME at:https://assets.amazon.science/47/cc/602c0d16409aa9c668467388b0a9/lume-llm-unlearning-with-multitask-evaluations.pdf