Star 历史趋势
数据来源: GitHub API · 生成自 Stargazers.cn
README.md

PixelRAG — Visual Retrieval-Augmented Generation

Official codebase for PIXELRAG: Web Screenshots Beat Text for Retrieval-Augmented Generation

Yichuan Wang*, Zhifei Li*, Zirui Wang, Paul Teiletche, Lesheng Jin
Matei Zaharia†, Joseph E. Gonzalez†, Sewon Min

* Equal contribution   † Equal advising
Work done at Berkeley SkyLab & BAIR & Berkeley NLP

Search any document by how it looks, not just the text it contains.

CI Live demo Status Slack License

What it is · Give Claude eyes · How it works · Pipelines


pip install pixelrag

The two core operations — render a page to screenshots, search a visual index:

# Render any page or document to screenshot tiles pixelshot https://en.wikipedia.org/wiki/Python --output ./tiles # Search a hosted index of 8.28M Wikipedia pages — no setup, runs against the live API curl -X POST https://api.pixelrag.ai/search \ -H "Content-Type: application/json" \ -d '{"queries": [{"text": "What is the capital of France?"}], "n_docs": 5}'

Live, hosted endpointhttps://api.pixelrag.ai serves a pre-built index of 8.28M Wikipedia pages. No setup, no API key. It even takes an image as the query (visual search) — see the API reference →.

Or try it in the browser at pixelrag.ai, or run the demo notebook in Colab Open In Colab — it renders a page and searches the hosted index, with the images inline.

What it is

PixelRAG renders documents — web pages, PDFs, images — as screenshots and retrieves over the images directly. Visual structure that HTML parsing throws away — tables, charts, layout, infographics — stays intact, so the reader model can actually answer questions about it. Wikipedia's 8.28M articles ship as a pre-built index; the pipeline itself is general-purpose.

Give Claude eyes

The renderer also ships as a Claude Code plugin — the pixelbrowse skill. Instead of fetching raw HTML, Claude screenshots a page with pixelshot and reads the image, so it sees charts, diagrams, tables, and layout the way a person does.

Install it — no clone needed. Install the pixelshot CLI so it's on your PATH (use uv tool or pipx to keep it isolated yet always available to Claude — a plain pip install into a project venv may leave pixelshot off PATH):

uv tool install pixelrag # pixelshot on PATH (or: pipx install pixelrag) claude plugin marketplace add StarTrail-org/PixelRAG claude plugin install pixelbrowse@pixelrag-plugins

Then just ask Claude to look at a page:

claude -p "screenshot https://news.ycombinator.com and summarize the top stories" claude -p "screenshot https://arxiv.org/abs/2404.12387 and explain the key findings"

Or use the slash command in an interactive session: /screenshot https://example.com. No MCP server, no backend: the skill just calls pixelshot (Playwright/CDP) on your machine.

How it works

Text-based RAG parses to text and loses the table; PixelRAG renders to screenshot tiles and keeps it

Text-based RAG parses the page to text chunks and loses the table — the reader can't find the answer. PixelRAG renders the page to screenshot tiles, retrieves the right tile, and the reader reads the number straight off the image.

Two pieces make this work: (1) rendering documents to images instead of parsing them to text, and (2) a Qwen3-VL-Embedding model, LoRA-fine-tuned on screenshot data, that embeds page images into a space where visual content is retrievable.

Pipelines

Capture is the standalone pixelshot command; the rest of the pipeline runs through the pixelrag umbrella — pixelrag <stage>. Install only the stages you need:

CommandWhat it doesInstall
pixelshotDocument → image tiles (Playwright CDP, PDF)pip install pixelrag
pixelrag chunk · embed · build-indexTiles → vectors → FAISS indexpip install 'pixelrag[embed]'
pixelrag indexOrchestrates the full pipeline: source → ingest → embed → indexpip install 'pixelrag[index]'
pixelrag serveFAISS search API (FastAPI, CPU or GPU)pip install 'pixelrag[serve]'
render ←── index ──→ embed       serve (independent)       train → serve (HTTP)

train is a separate uv project with its own pinned env (torch==2.9.1+cu129, transformers==4.57.1, cuDNN 9.20) — install it from inside train/, not from the root.

Search a pre-built index

pip install 'pixelrag[serve]' # Download a pre-built index from Hugging Face. The dataset repo holds four FAISS indexes # (base/LoRA Wikipedia pixel, Wikipedia text, news pixel); grab just the base one (~217G) here. huggingface-cli download StarTrail-org/pixelrag-faiss-indexes \ --repo-type dataset --include "search_index_normed_v2/*" --local-dir ./index # Serve, then query pixelrag serve --index-dir ./index/search_index_normed_v2 --port 30001 curl -X POST http://localhost:30001/search \ -H "Content-Type: application/json" \ -d '{"queries": [{"text": "What is the capital of France?"}], "n_docs": 5}'

Build an index from your own documents

Works on Linux (CUDA) and macOS (Apple Silicon / MPS)device: auto picks the best backend.

pip install 'pixelrag[index]' # Create pixelrag.yaml cat > pixelrag.yaml << 'EOF' source: type: local path: ./my_docs embed: model: Qwen/Qwen3-VL-Embedding-2B device: auto # cuda on Linux, mps on macOS, cpu as fallback output: ./my_index EOF # Build, then serve pixelrag index build pixelrag serve --index-dir ./my_index --port 30001
Try it: index a PDF and search it locally

No GPU required — runs on macOS (Apple Silicon) or any machine with Python 3.10+.

pip install 'pixelrag[index]' # 1. Grab a sample PDF (or use your own) curl -L -o paper.pdf https://raw.githubusercontent.com/StarTrail-org/PixelRAG/main/assets/pixelrag-paper.pdf # 2. Create config (device: auto picks MPS on Mac, CUDA on Linux) cat > pixelrag.yaml << 'EOF' source: type: local path: ./paper.pdf embed: model: Qwen/Qwen3-VL-Embedding-2B device: auto output: ./paper_index EOF # 3. Build the index (~3 min on Apple M-series, ~1 min on GPU) pixelrag index build # 4. Serve it pixelrag serve --index-dir ./paper_index --port 30001 # 5. Search — should return page 2 (the overview diagram) curl -X POST http://localhost:30001/search \ -H "Content-Type: application/json" \ -d '{"queries": [{"text": "Overview of PixelRAG and the diagram"}], "n_docs": 1}'

Render a page programmatically

from pixelrag_render import render_url # render a single page to tiles — e.g. for an agent to read tiles = render_url("https://en.wikipedia.org/wiki/Python", "./tiles")

The same rendering is available as a CLI — pixelshot ships with pip install pixelrag:

# Web page → tiles (headless Chromium via CDP) pixelshot https://en.wikipedia.org/wiki/Python -o ./tiles # PDF → tiles (requires poppler; install the pdf extra: pip install 'pixelrag[pdf]') curl -sL -o paper.pdf https://arxiv.org/pdf/2503.09516 pixelshot paper.pdf -o ./tiles --dpi 200 # URLs and local files can be mixed freely pixelshot https://github.com/StarTrail-org/PixelRAG paper.pdf -o ./tiles

Chrome on Windows/macOS — the bundled turbo headless_shell auto-installs on linux-x64 only. Elsewhere, pixelshot uses your system Chrome/Chromium (or Playwright's Chromium), auto-detected from the standard install locations. Point it at a specific binary with CHROME_PATH=/path/to/chrome if it isn't found automatically. Each render runs in an isolated, throwaway Chrome profile, so it works even while you have Chrome open.

Embed tools (standalone)

Each stage runs independently, without the orchestrator:

pip install 'pixelrag[embed]' pixelrag chunk --tiles-dir ./tiles pixelrag embed --shard-dir ./tiles --output-dir ./embeddings --gpu-ids 0,1 pixelrag build-index --embeddings-dir ./embeddings --output-dir ./index

Training

Fine-tuning lives in train/ — a separate uv project (wiki-screenshot-training) with its own pinned env. It LoRA-fine-tunes Qwen/Qwen3-VL-Embedding-2B for webpage retrieval; run it from inside train/ (cd train && uv sync). See train/README.md for the full recipe.

You don't need to retrain to use the model — the trained adapters are published at Chrisyichuan/wiki-screenshot-embedding-lora.

We also release the full training set (Chrisyichuan/screenshot-training-natural-filtered-v2), so you can adapt other backbones yourself — a larger Qwen, or any other embedding model. The data curation pipeline (LLM-augmented query generation, filtering, hard-negative mining) is documented in train/docs/synthetic_data_pipeline.md.

Acknowledgments

Thanks to Rulin Shao for support.

Thanks also to Claude Code and OpenAI Codex for supporting open-source contributors with credits and plans, which we earned by working on LEANN.

This work is done by the Berkeley Sky Computing Lab, BAIR, and the Berkeley NLP Group.

License

Apache-2.0

关于 About

The end of web parsing. The beginning of scalable pixel-native search.
agentaimemorymultimodalragsearchsearchenginevisionvlm

语言 Languages

Python74.2%
Markdown16.0%
TypeScript6.8%
Shell1.7%
JavaScript0.7%
CSS0.4%
JSON0.2%
YAML0.0%

提交活跃度 Commit Activity

代码提交热力图
过去 52 周的开发活跃度
48
Total Commits
峰值: 27次/周
Less
More

核心贡献者 Contributors