Star 历史趋势
数据来源: GitHub API · 生成自 Stargazers.cn
README.md

Dolma 3

Dolma 3 consists of three datasets constructed for the OLMo 3 family of models: Dolma 3 Mix, a diverse 5.9T-token pre-training dataset, Dolma 3 Dolmino Mix, a 100B-token mid-training dataset targeting performance improvements in math, code, QA, instruction and thinking, and Dolma 3 Longmino Mix, 50B tokens of long context data. This repository contains descriptions and code necessary for reconstructing the Dolma 3 datasets.

S3 paths in config refer to internal Ai2 buckets, but datasets/tools/s3_to_hf.py can be used to get the closest final dataset in HuggingFace, if available.

For further details, please refer to the OLMo 3 paper and the OLMo 3 website.

关于 About

No description, website, or topics provided.

语言 Languages

Python52.2%
Jupyter Notebook47.8%

提交活跃度 Commit Activity

代码提交热力图
过去 52 周的开发活跃度
132
Total Commits
峰值: 56次/周
Less
More

核心贡献者 Contributors