Star 历史趋势
数据来源: GitHub API · 生成自 Stargazers.cn
README.md

https://pypi.org/project/nemo-curator codecov https://pypi.org/project/nemo-curator/ NVIDIA-NeMo/Curator https://github.com/NVIDIA-NeMo/Curator/releases https://github.com/Naereen/badges/

NVIDIA NeMo Curator

GPU-accelerated data curation for training better AI models, faster. Scale from laptop to multi-node clusters with modular pipelines for text, images, video, and audio.

Part of the NVIDIA NeMo software suite for managing the AI agent lifecycle.

What You Can Do

ModalityKey CapabilitiesGet Started
TextDeduplication • Classification • Quality Filtering • Language DetectionText Guide
ImageAesthetic Filtering • NSFW Detection • Embedding Generation • DeduplicationImage Guide
VideoScene Detection • Clip Extraction • Motion Filtering • DeduplicationVideo Guide
AudioASR Transcription • Quality Assessment • WER FilteringAudio Guide

Quick Start

# Install for your modality uv pip install "nemo-curator[text_cuda12]" # Run the quickstart example python tutorials/quickstart.py

Full setup: Installation GuideDockerTutorials


Features by Modality

Text Curation

Process and curate high-quality text datasets for large language model (LLM) training with multilingual support.

CategoryFeaturesDocumentation
Data SourcesCommon Crawl • Wikipedia • ArXiv • Custom datasetsLoad Data
Quality Filtering30+ heuristic filters • fastText classification • GPU-accelerated classifiers for domain, quality, safety, and content typeQuality Assessment
DeduplicationExact • Fuzzy (MinHash LSH) • Semantic (GPU-accelerated)Deduplication
ProcessingText cleaning • Language identificationContent Processing

Image Curation

Curate large-scale image datasets for vision language models (VLMs) and generative AI training.

CategoryFeaturesDocumentation
Data LoadingWebDataset format • Large-scale image-text pairsLoad Data
EmbeddingsCLIP embeddings for semantic analysisEmbeddings
FilteringAesthetic quality scoring • NSFW detectionFilters

Video Curation

Process large-scale video corpora with distributed, GPU-accelerated pipelines for world foundation models (WFMs).

CategoryFeaturesDocumentation
Data LoadingLocal paths • S3-compatible storage • HTTP(S) URLsLoad Data
ClippingFixed-stride splitting • Scene-change detection (TransNetV2)Clipping
ProcessingGPU H.264 encoding • Frame extraction • Motion filtering • Aesthetic filteringProcessing
EmbeddingsCosmos-Embed1 for clip-level embeddingsEmbeddings
DeduplicationK-means clustering • Pairwise similarity for near-duplicatesDeduplication

Audio Curation

Prepare high-quality speech datasets for automatic speech recognition (ASR) and multimodal AI training.

CategoryFeaturesDocumentation
Data LoadingLocal files • Custom manifests • Public datasets (FLEURS)Load Data
ASR ProcessingNeMo Framework pretrained models • Automatic transcriptionASR Inference
Quality AssessmentWord Error Rate (WER) calculation • Duration analysis • Quality-based filteringQuality Assessment
IntegrationText curation workflow integration for multimodal pipelinesText Integration

Why NeMo Curator?

Performance at Scale

NeMo Curator leverages NVIDIA RAPIDS™ libraries such as cuDF, cuML, and cuGraph along with Ray to scale workloads across multi-node, multi-GPU environments.

Proven Results:

  • 16× faster fuzzy deduplication on 8 TB RedPajama v2 (1.78 trillion tokens)
  • 40% lower total cost of ownership (TCO) compared to CPU-based alternatives
  • Near-linear scaling from one to four H100 80 GB nodes (2.05 hrs → 0.50 hrs)

Performance benchmarks showing 16x speed improvement, 40% cost savings, and near-linear scaling

Quality Improvements

Data curation modules measurably improve model performance. In ablation studies using a 357M-parameter GPT model trained on curated Common Crawl data:

Model accuracy improvements across curation pipeline stages

Results: Progressive improvements in zero-shot downstream task performance through text cleaning, deduplication, and quality filtering stages.


Learn More

ResourceLinks
DocumentationMain DocsAPI ReferenceConcepts
TutorialsTextImageVideoAudio
DeploymentInstallationInfrastructure
CommunityGitHub DiscussionsIssues

Contribute

We welcome community contributions! Please refer to CONTRIBUTING.md for guidelines.

关于 About

Scalable data pre processing and curation toolkit for LLMs
datadata-curationdata-prepdata-preparationdata-processingdata-processing-pipelinesdata-qualitydatacurationdatarecipesdeduplicationfast-data-processingfine-tuninglarge-language-modelslarge-scale-data-processingllmllm-data-qualityllmappspythonsemantic-deduplication

语言 Languages

Python83.6%
MDX15.7%
CSS0.2%
Shell0.2%
TypeScript0.2%
Dockerfile0.1%
Makefile0.0%

提交活跃度 Commit Activity

代码提交热力图
过去 52 周的开发活跃度
415
Total Commits
峰值: 34次/周
Less
More

核心贡献者 Contributors