Authors:
(1) Wanru Zhao, University of Cambridge, Shanghai AI Laboratory with Equal contribution;
(2) Yaxin Du, Shanghai Jiao Tong University with Equal contribution;
(3) Nicholas D. Lane, University of Cambridge and Flower Labs;
(4) Siheng Chen, Shanghai AI Laboratory and Shanghai Jiao Tong University;
(5) Yanfeng Wang, Shanghai AI Laboratory and Shanghai Jiao Tong University.
Table of Links
- Abstract and Introduction
- Motivation and Setup: How low-quality data affects the performance of Collaborative Training
- Proposed Workflow for Data Quality Control
- Experiments
- Conclusion and Future Work, and References
- A. Related Work
- B. Heterogeneity Settings
- C. Experimental Details
- D. Ablation study of Unified Scoring with Anchor Data
- E. Examples for low-and high- quality Data
ABSTRACT
In the current landscape of foundation model training, there is a significant reliance on public domain data, which is nearing exhaustion according to recent research. To further scale up, it is crucial to incorporate collaboration among multiple specialized and high-quality private domain data sources. However, the challenge of training models locally without sharing private data presents numerous obstacles in data quality control. To tackle this issue, we propose a data quality control pipeline for federated fine-tuning of foundation models. This pipeline computes scores reflecting the quality of training data and determines a global threshold for a unified standard, aiming for improved global performance. Our experiments show that the proposed quality control pipeline facilitates the effectiveness and reliability of the model training, leading to better performance.
1 INTRODUCTION
As businesses, products, and services spring up around large language models (LLMs, which we define as having more than one billion parameters), recent work have shown that one could attain better performance by training on more high-quality data (Kaplan et al., 2020; Hoffmann et al., 2022; Zhou et al., 2023). However, Villalobos et al. (2022) estimate that even high-quality English language data will be exhausted by the year 2024. What should we do when we run out of public data?
One solution is to exploit vast private data from various institutions, including enterprises and user devices. To unlock the potential of private data, it is essential to address the following two significant issues. First, it is imperative to preserve the privacy of all participants to protect their interests (Mondschein & Monda, 2019). To address this, one could adopt federated learning (McMahan et al., 2017), a collaborative machine learning framework that trains a model across multiple clients with their local private data, without exchanging any raw data. Second, enhancing model training requires the control the data quality from each participant. Due to the inability to directly access private data, quality control for these data poses significant challenges, which is the focus of this work.
Previously, data quality control relied heavily on manual selection processes (Touvron et al., 2023b;a). This approach, while commonly used, presented significant challenges due to the high volume of data, leading to substantial costs. Recent advancements have seen the introduction of automated low-quality data filters (Computer, 2023), such as perplexity filters (Muennighoff et al., 2023) and deduplication filters (Lee et al., 2021). These automated methods are designed to reduce data volume and enhance training efficiency in centralized settings, while their effectiveness in data quality control within collaborative environments remains to be explored.
In our paper, we propose an automated data quality control pipeline for federated fine-tuning of large language models (LLMs), showcasing notable performance improvements in mixed-quality data environments. Specifically, we incorporate data valuation algorithms to serve as scoring functions, enabling fine-grained evaluation of individual training sample quality. Furthermore, we establish a unified data quality standard using a minimal set of anchor data, addressing the challenge of heterogeneity in data quality across federated institutions. Adopting this approach, we effectively eliminate low-quality data, thereby enhancing model performance and ensuring privacy preservation. Leveraging the collaboration of multiple private domain data sources, opens up new possibilities in the face of real-world public data exhaustion.
This paper is available on arxiv under CC BY 4.0 DEED license.