Star 历史趋势
数据来源: GitHub API · 生成自 Stargazers.cn
README.md

Frontier Evals

Code for evals measuring frontier model capabilities.

Usage

Requirements

We manage environments with uv. Install uv once, then run uv sync (or uv pip install -r ...) inside the project of interest to create its virtual environment from the checked-in uv.lock.

Running Evals

Each eval directory documents how to reproduce runs, configure models, and interpret results. Start with the suite README.md, then consult any scripts under scripts/ or runtime_*/ directories for orchestration details. When in doubt:

  1. cd into the eval directory.
  2. uv sync to install dependencies.
  3. Follow the local instructions in the README.md.

Contributing

Layout

.
├── pyproject.toml             # Shared tooling configuration (Ruff, Black, etc.)
└── project/
    ├── common/                # Shared libraries
    ├── evmbench/              # EVMBench eval
    ├── paperbench/            # PaperBench eval
    └── swelancer/             # SWE-Lancer eval

Each eval directory is its own isolated project with a README.md, pyproject.toml and uv.lock.

Development Workflow

  • Create or activate the environment for the project you are working on with uv. Example for PaperBench:
    • cd project/paperbench
    • uv sync
    • uv run pytest
  • Code style and linting use Ruff (with autofix profiles in pyproject.toml and project/common/tooling/ruff_autofix_minimal.toml) and Black. Run uv run ruff check --fix or use the provided Poe/make tasks where available.
  • Shared utilities live under project/common; changes there may affect multiple evals. Bump the relevant editable dependencies if you create new shared subpackages.

关于 About

OpenAI Frontier Evals

语言 Languages

Python55.9%
Solidity25.1%
Shell10.9%
Rust6.4%
JavaScript0.8%
Dockerfile0.4%
TypeScript0.3%
Just0.1%
HTML0.1%
CSS0.1%

提交活跃度 Commit Activity

代码提交热力图
过去 52 周的开发活跃度
66
Total Commits
峰值: 27次/周
Less
More

核心贡献者 Contributors