Star 历史趋势
数据来源: GitHub API · 生成自 Stargazers.cn
README.md
Dolma 3
Dolma 3 consists of three datasets constructed for the OLMo 3 family of models: Dolma 3 Mix, a diverse 5.9T-token pre-training dataset, Dolma 3 Dolmino Mix, a 100B-token mid-training dataset targeting performance improvements in math, code, QA, instruction and thinking, and Dolma 3 Longmino Mix, 50B tokens of long context data. This repository contains descriptions and code necessary for reconstructing the Dolma 3 datasets.
S3 paths in config refer to internal Ai2 buckets, but datasets/tools/s3_to_hf.py can be used to get the closest final dataset in HuggingFace, if available.
For further details, please refer to the OLMo 3 paper and the OLMo 3 website.