Authors:
(1) Wanru Zhao, University of Cambridge, Shanghai AI Laboratory with Equal contribution;
(2) Yaxin Du, Shanghai Jiao Tong University with Equal contribution;
(3) Nicholas D. Lane, University of Cambridge and Flower Labs;
(4) Siheng Chen, Shanghai AI Laboratory and Shanghai Jiao Tong University;
(5) Yanfeng Wang, Shanghai AI Laboratory and Shanghai Jiao Tong University.
Table of Links
- Abstract and Introduction
- Motivation and Setup: How low-quality data affects the performance of Collaborative Training
- Proposed Workflow for Data Quality Control
- Experiments
- Conclusion and Future Work, and References
- A. Related Work
- B. Heterogeneity Settings
- C. Experimental Details
- D. Ablation study of Unified Scoring with Anchor Data
- E. Examples for low-and high- quality Data
A RELATED WORK
Large Language Models The remarkable achievements of large language models (LLM) have recently impacted the field of natural language processing. OpenAI’s GPT-4 (OpenAI et al., 2023), for instance, has demonstrated exceptional capabilities in various generative tasks including question answering. LLaMA (Touvron et al., 2023a;b), an open-source large language modelwith 7 to 65 billion parameters, offers an alternative platform for research. These developments have sparked interest in adapting LLMs for medical applications. Yet, most medical models are fine-tuned based on LLaMA on a small medical corpus, resulting in a deficiency of comprehensive medical knowledge integration. There has been recent efforts on training LLMs for medical domains, for example, BioBert (Lee et al., 2020), BioMedGPT (Luo et al., 2023), PMC-LLama (Wu et al., 2023). These domain-specific LLMs have been exclusively trained on medical corpora. However, there has been a lack of collaborative or federated training work in medical LLMs.
Data Valuation and Attribution The seminal work on data valuation/attribution of Koh & Liang (2017) proposes attribution via approximate influence functions. It identifies training samples most responsible for a given prediction by estimating the effect of removing or slightly modifying a single training sample. In a related approach, TracIn (Pruthi et al., 2020) estimates the influence of each sample in training set on the test example by measuring the change in loss from gradient updates of mini-batches. Another related line of work has utilized Shapley values (Lundberg & Lee, 2017) to ascribe value to data, but Shapley values often require exponential time to compute.
Federated Learning Federated Learning has garnered significant attention as a distributed machine learning paradigm. It shifts the traditional model training process by sharing model parameters instead of raw data. With Federated Averaging (FedAvg) (McMahan et al., 2017), participating clients train models using their own private datasets locally, and the updated model parameters are aggregated on the server. This preserves the privacy of the underlying data while collectively benefiting from the knowledge gained during the training process (Konecnˇ y et al., 2016). Recent ` work (Zhao et al., 2024) proposes a federated parameter-efficient fine-tuning paradigm for large language models, demonstrating not only its advantages in data- and parameter-efficiency but also in better generalization and stability. Despite abundant research made on federated fine-tuning LLMs, data quality control in federated manner remains under-explored.
This paper is available on arxiv under CC BY 4.0 DEED license.