Authors:
(1) Wanru Zhao, University of Cambridge, Shanghai AI Laboratory with Equal contribution;
(2) Yaxin Du, Shanghai Jiao Tong University with Equal contribution;
(3) Nicholas D. Lane, University of Cambridge and Flower Labs;
(4) Siheng Chen, Shanghai AI Laboratory and Shanghai Jiao Tong University;
(5) Yanfeng Wang, Shanghai AI Laboratory and Shanghai Jiao Tong University.
Table of Links
- Abstract and Introduction
- Motivation and Setup: How low-quality data affects the performance of Collaborative Training
- Proposed Workflow for Data Quality Control
- Experiments
- Conclusion and Future Work, and References
- A. Related Work
- B. Heterogeneity Settings
- C. Experimental Details
- D. Ablation study of Unified Scoring with Anchor Data
- E. Examples for low-and high- quality Data
2 MOTIVATION AND SETUP: HOW LOW-QUALITY DATA AFFECTS THE PERFORMANCE OF COLLABORATIVE TRAINING
In our paper, we identify two unique challenges for federated fine-tuning of LLMs in terms of data quality. 1) Real low-quality data Firstly, we aim to highlight three prevalent patterns of low-quality data observed in real-world corpora: cut, deletion and exchange. The cut category encompasses scenarios where content is truncated due to word limit constraints, deletion pertains to instances where critical terminologies are absent from the corpus, and exchange refers to examples containing entirely incorrect information. We provide specific examples of these categories in Appendix E.
2) Quality heterogeneity Quality heterogeneity refers to the variability in the quality of data collection across different clients in federated learning. Given that federated learning often encompasses a vast number of clients, each with varying capabilities in data synthesis, it is impractical to assume uniformity in data quality among all participants. Consequently, some clients may possess a higher proportion of low-quality data compared to others, highlighting the absence of a uniform standard for data sample quality across all participants. We provide two Non-IID settings in Appendix B.
In our preliminary experiments, we consider the two factors above, adjust the proportion of low-quality data of composition of PMCLLama (Wu et al., 2023) and Medalpacaflashcards (Han et al., 2023) datasets in federated training, shown in Figure 1. Higher scores indicate better performance (for more details about the metrics, see Appendix C.2). The key observation is, the quality of the training data has a significant effect on the performance of collaborative training: low-quality data consistently lead to worse influence on all the metrics.
This paper is available on arxiv under CC BY 4.0 DEED license.