Language models (LMs) have become fundamental in natural language processing (NLP), enabling text generation, translation, and sentiment analysis tasks. These models demand vast amounts of training data to function accurately and efficiently. However, the quality and curation of these datasets are critical to the performance of LMs. This field focuses on refining the data collection and preparation methods to enhance the models’ effectiveness.
A significant challenge in developing effective language models is improving training datasets. High-quality datasets are essential for training models that generalize well across various tasks, but creating such datasets is complex. It involves filtering out irrelevant or harmful content, removing duplicates, and selecting the most useful data sources.
Existing methods for dataset curation typically involve heuristic-based filtering, deduplication, and data sourcing from extensive web crawls. While these methods have provided some success, they often need more standardized benchmarks, leading to consistency in the performance evaluation of language models. This variability makes it difficult to determine the most effective data curation strategies, which hinders progress in the field.
Researchers from Apple, the University of Washington, and many other institutions have introduced DataComp for Language Models (DCLM) to address these issues. They have recently open-sourced the DCIM models and datasets on the Hugging Face Platform. The open source release comprises DCLM-7B, DCLM-1B, dclm-7b-it, DCLM-7B-8k, dclm-baseline-1.0, and dclm-baseline-1.0-parquet. This innovative testbed allows controlled experiments with large datasets to improve language models. The DCLM framework includes a comprehensive corpus of 240 trillion tokens from Common Crawl, effective pretraining recipes based on the OpenLM framework, and a broad suite of 53 downstream evaluations. This setup provides a standardized approach to dataset curation, enabling consistent and comparable experiments.
DCLM offers a structured workflow for researchers. Participants can choose scales ranging from 412M to 7B parameters and experiment with data curation strategies such as deduplication, filtering, and data mixing. Researchers can train models on curated datasets using a standardized training recipe and specific hyperparameters. The performance of these models is then evaluated on a suite of downstream tasks, providing a clear measure of dataset quality. This systematic approach helps identify the most effective data curation strategies.
The introduction of DCLM has led to notable improvements in language model training. For instance, a baseline dataset created using DCLM enabled the training of a 7B parameter language model from scratch. This model achieved a 64% 5-shot accuracy on the MMLU benchmark with 2.6 trillion training tokens. This performance represents a 6.6 percentage point improvement over the previous state-of-the-art open-data language model, MAP-Neo, while using 40% less computing. The DCLM baseline model also performed comparably to Mistral-7B-v0.3 and Llama 3 8B, which required significantly more computational resources.
The DCLM framework’s effectiveness is further demonstrated by its scalability. Researchers conducted extensive experiments at different scales, from 400M to over 7B parameters, using DCLM-Pool, a corpus of 240 trillion tokens derived from Common Crawl. These experiments highlighted the critical role of model-based filtering in assembling high-quality training sets. The DCLM baseline dataset, created through this rigorous process, consistently outperformed other open-source datasets like RefinedWeb and RedPajama in various evaluations.
The research team also explored the impact of various data curation techniques. They compared text extraction methods, such as resiliparse and trafilatura, and found that these approaches significantly improved downstream performance compared to Common Crawl’s pre-extracted text. The team investigated several model-based quality filtering strategies, ultimately determining that the fastText OH-2.5 + ELI5 classifier was the most effective, providing a substantial lift in accuracy.
In conclusion, the introduction of DCLM enables researchers to conduct controlled experiments and identify the most effective strategies for improving language models by providing a standardized and systematic approach to dataset curation. The DCLM framework sets a new benchmark for dataset quality and demonstrates the potential for significant performance improvements with reduced computational resources.
The post Apple AI Released a 7B Open-Source Language Model Trained on 2.5T Tokens on Open Datasets appeared first on MarkTechPost.