Frontier Evals

Code for evals measuring frontier model capabilities.

PaperBench: End-to-end replication of state-of-the-art AI papers. Paper | Blog
SWE-Lancer: Real freelance software engineering tasks with end-to-end tests. Paper | Blog
EVMbench: Smart contract security tasks. Paper | Blog

Usage

Requirements

We manage environments with uv. Install uv once, then run uv sync (or uv pip install -r ...) inside the project of interest to create its virtual environment from the checked-in uv.lock.

Running Evals

Each eval directory documents how to reproduce runs, configure models, and interpret results. Start with the suite README.md, then consult any scripts under scripts/ or runtime_*/ directories for orchestration details. When in doubt:

cd into the eval directory.
uv sync to install dependencies.
Follow the local instructions in the README.md.

Contributing

Layout

.
├── pyproject.toml             # Shared tooling configuration (Ruff, Black, etc.)
└── project/
    ├── common/                # Shared libraries
    ├── evmbench/              # EVMBench eval
    ├── paperbench/            # PaperBench eval
    └── swelancer/             # SWE-Lancer eval

Each eval directory is its own isolated project with a README.md, pyproject.toml and uv.lock.

Development Workflow

Create or activate the environment for the project you are working on with uv. Example for PaperBench:
- cd project/paperbench
- uv sync
- uv run pytest
Code style and linting use Ruff (with autofix profiles in pyproject.toml and project/common/tooling/ruff_autofix_minimal.toml) and Black. Run uv run ruff check --fix or use the provided Poe/make tasks where available.
Shared utilities live under project/common; changes there may affect multiple evals. Bump the relevant editable dependencies if you create new shared subpackages.

Frontier Evals

Usage

Requirements

Running Evals

Contributing

Layout

Development Workflow

关于 About

语言 Languages

提交活跃度 Commit Activity

核心贡献者 Contributors