Star 历史趋势
数据来源: GitHub API · 生成自 Stargazers.cn
README.md

CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence

arXiv Hugging Face dataset Dataset on ModelScope License MIT

📖 English  |  简体中文

If you like our project, please give us a star ⭐ on GitHub for the latest update.


🔎 Overview

CiteVQA is a document visual question answering benchmark for faithful evidence attribution. Unlike conventional DocVQA datasets that only score the final answer, CiteVQA requires a model to answer a question with evidence grounded in the source document at the element level. The benchmark is designed to evaluate whether a system can not only answer correctly, but also cite the right supporting region in long, real-world PDFs.

The dataset contains 1,897 questions built from 711 PDFs across 7 macro-domains and 30 sub-domains, with an average of 40.6 pages per document. It covers both English and Chinese documents, and includes single-document as well as multi-document settings.

The evaluation covers three dataset types:

  • Single-Doc: Single-document question answering.
  • Multi (1-Gold): Multi-document QA with exactly one gold document.
  • Multi (N-Gold): Multi-document QA with multiple gold documents.

CiteVQA overview

Overview of CiteVQA. Left: a prediction is counted as correct only when the answer is correct and the cited evidence region is both relevant and spatially aligned with the gold evidence under Strictly Attributed Accuracy (SAA). Right top: dataset statistics show that CiteVQA emphasizes long, realistic PDFs. Right bottom: existing MLLMs exhibit a substantial gap between answer accuracy and evidence-grounded accuracy.

✨ Highlights

  • Joint answer-and-evidence evaluation: Evaluates both answer correctness and citation faithfulness.
  • Element-level evidence: Structured gold evidence features bounding boxes, page, and document indices.
  • Long-document setting: Focuses on multi-page PDFs with realistic lengths and complex layouts.
  • Cross-domain and bilingual: Spans 7 domains, 30 sub-domains, and two languages (en, zh).
  • Multi-document reasoning: Features cross-document questions that require evidence aggregation.
  • Three evaluation settings: Supports Single-Doc, Multi (1-Gold), and Multi (N-Gold).

⚙️ Setup

Install dependencies:

pip install -r requirements.txt

Optional CJK font configuration for PDF rendering:

Expand font setup for Chinese PDFs
apt install fonts-noto-cjk poppler-data

cat > /etc/fonts/conf.d/99-pdf-cjk.conf << 'EOF'
<?xml version="1.0"?>
<!DOCTYPE fontconfig SYSTEM "fonts.dtd">
<fontconfig>
  <alias><family>STSong-Light</family><prefer><family>Noto Serif CJK SC</family></prefer></alias>
  <alias><family>STSong</family><prefer><family>Noto Serif CJK SC</family></prefer></alias>
  <alias><family>SimSun</family><prefer><family>Noto Serif CJK SC</family></prefer></alias>
  <alias><family>FangSong</family><prefer><family>Noto Serif CJK SC</family></prefer></alias>
  <alias><family>KaiTi</family><prefer><family>Noto Serif CJK SC</family></prefer></alias>
  <alias><family>SimHei</family><prefer><family>Noto Sans CJK SC</family></prefer></alias>
  <alias><family>Microsoft YaHei</family><prefer><family>Noto Sans CJK SC</family></prefer></alias>
</fontconfig>
EOF

fc-cache -f

📦 Data

From the repository root, you can fetch the benchmark files from Hugging Face into data/, then download the source PDFs:

pip install -U "huggingface_hub[cli]"
hf download opendatalab/CiteVQA --repo-type dataset --local-dir .
python data/download/download_pdfs.py --workers 16 --out data/pdf --csv data/download/pdf_source.csv

From the repository root, you can also fetch the benchmark files from ModelScope into data/, then download the source PDFs:

pip install -U modelscope
modelscope download --dataset OpenDataLab/CiteVQA --local_dir .
python data/download/download_pdfs.py --workers 16 --out data/pdf --csv data/download/pdf_source.csv

The PDF downloader reads data/download/pdf_source.csv and saves all files to data/pdf/.

If you run into dataset or download issues, jump to the Contact section.

Download Arguments
OptionDefaultDescription
--csvpdf_source.csvCSV file containing PDF URLs
--outpdfOutput directory
--workers16Concurrent download workers
--timeout120Timeout per file in seconds
--retries3Retry count
--no-skip-Re-download existing files

🚀 Inference and Evaluation

bash run.sh provides a demo for evaluating GPT-5.4. Edit the API settings in run.sh, then run:

bash run.sh

Reference workflow:

# API config
API_TYPE=openai
API_KEY=YOUR_API_KEY
BASE_URL=YOUR_BASE_URL

# Inference
python infer/run.py \
  --api ${API_TYPE} \
  --model MODEL_NAME \
  --base_url ${BASE_URL} \
  --api_key ${API_KEY} \
  --workers 4 \
  --out outputs/infer/MODEL_NAME.json

# Evaluation
python eval/run.py \
  --judge_api ${API_TYPE} \
  --judge_model JUDGE_MODEL_NAME \
  --judge_api_key ${API_KEY} \
  --base_url ${BASE_URL} \
  --input outputs/infer/MODEL_NAME.json \
  --out outputs/eval/MODEL_NAME.json \
  --workers 24

# Summary
python eval/summarize.py \
  --input outputs/eval/MODEL_NAME.json \
  --out_dir outputs/eval/MODEL_NAME

🧭 Inference Arguments

Inference Arguments
OptionRequiredDescription
--apiYesopenai, genai, or anthropic
--modelYesModel name
--api_keyYesAPI key
--base_urlNoAPI base URL
--workersNoNumber of workers, default 4
--outNoOutput JSON path
--benchmarkNoBenchmark path, default data/data_items.json
--limitNoSample limit, 0 means all
--max_pdf_mbNoCompress PDFs larger than this size in MB

📏 Evaluation Arguments

Evaluation Arguments
OptionRequiredDescription
--inputYesInference output JSON
--judge_apiNoJudge API type, default openai
--judge_modelNoJudge model name, default gpt-4o
--judge_api_keyYesJudge API key
--base_urlNoAPI base URL
--metricsNoMetrics list, default recall,rel
--workersNoNumber of workers
--outNoOutput JSON path
--limitNoSample limit

🗂️ Repository Structure

CiteVQA/
├── data/
│   ├── validation/
│   │   └── CiteVQA.json         # Benchmark QA pairs
│   ├── pdf/                     # Downloaded PDFs
│   └── download/
│       ├── pdf_source.csv       # PDF metadata & URLs
│       └── download_pdfs.py     # PDF download script
├── infer/
│   └── run.py                   # Inference script
├── eval/
│   ├── run.py                   # Evaluation script
│   └── summarize.py             # Summary table generator
├── prompts/                     # System & user prompts
├── outputs/                     # Inference & evaluation outputs
├── requirements.txt
└── run.sh                       # Demo script

📊 Evaluation Metrics

MetricMeaning
RecallWhether predicted evidence overlaps with crucial ground-truth evidence
Relevance (Rel.)Whether the cited evidence semantically supports the answer
Answer Correctness (Ans.)Whether the answer is correct
SAAStrict Attributed Accuracy: answer and evidence must both be valid
Page RecallWhether the correct page is identified
Precision / F1Precision and overlap quality of predicted evidence

SAA is the core metric of CiteVQA.

🏆 Evaluation Result

We evaluated 20 state-of-the-art MLLMs on CiteVQA using a unified prompt template. The results show that faithful evidence attribution remains substantially harder than answer-only scoring.

  • Best overall SAA: Gemini-3.1-Pro-Preview reaches 76.0 SAA with 86.1 answer score.
  • Best answer accuracy: GPT-5.4 reaches 87.1 answer score, but its SAA drops to 59.0.
  • Best open-source model: Qwen3-VL-235B-A22B reaches 22.5 SAA with 72.3 answer score.
  • Key finding: a large gap between Ans. and SAA appears across models, highlighting the benchmark's Attribution Hallucination challenge.

Full overall results:

ModelCategoryRec.Rel.Ans.SAA
Gemini-3.1-Pro-PreviewClosed-source MLLMs66.083.686.176.0
Gemini-3-Flash-PreviewClosed-source MLLMs45.475.784.565.4
GPT-5.4Closed-source MLLMs31.067.587.159.0
Gemini-2.5-ProClosed-source MLLMs27.459.882.247.0
Seed2.0-ProClosed-source MLLMs28.554.981.344.1
GPT-5.2Closed-source MLLMs18.256.671.533.7
Qwen3.6-PlusClosed-source MLLMs7.725.085.917.5
GLM-5V-TurboClosed-source MLLMs14.929.249.612.8
Qwen3-VL-235B-A22BOpen-source Large MLLMs11.335.372.322.5
Gemma-4-31BOpen-source Large MLLMs11.635.069.820.2
Kimi-K2.5Open-source Large MLLMs6.226.874.319.1
Qwen3.5-397B-A17BOpen-source Large MLLMs5.424.676.518.3
Qwen3.5-27BOpen-source Large MLLMs5.325.375.617.3
Qwen3-VL-32BOpen-source Large MLLMs6.630.572.317.3
Qwen3.5-122B-A10BOpen-source Large MLLMs3.919.073.614.8
Qwen3.5-9BOpen-source Small MLLMs1.614.765.011.1
Qwen3.5-35B-A3BOpen-source Small MLLMs1.713.776.410.7
Qwen3-VL-30B-A3BOpen-source Small MLLMs3.514.662.28.2
Qwen3-VL-8BOpen-source Small MLLMs1.014.761.27.5
Gemma-4-26B-A4BOpen-source Small MLLMs3.017.948.46.2

📬 Contact

Since the PDF sources are downloaded from external links, issues such as broken links or data accessibility problems may occur during download. If you encounter any download-related problems, please email wzr@stu.pku.edu.cn.

📚 Citation

@article{ma2026citevqa,
  title={CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence},
  author={Ma, Dongsheng and Li, Jiayu and Wang, Zhengren and Wang, Yijie and Kong, Jiahao and Zeng, Weijun and Xiao, Jutao and Yang, Jie and Zhang, Wentao and Wang, Bin and He, Conghui},
  journal={arXiv preprint arXiv:2605.12882},
  year={2026}
}

🙏 Acknowledgements

  • MinerU for document parsing.
  • ViDoRe V3 and other open-source datasets (SPIQA, MedQA, PubMedQA, MaintNorm, PolicyBench) for inspiring our benchmark construction.

📄 License

This project is licensed under the MIT License. See the LICENSE file for details.

©️ Copyright Notice

CiteVQA is provided for academic research and non-commercial use only. We fully respect the rights of original copyright holders. If any rights holder believes that the inclusion, indexing, or use of any relevant content in this benchmark is inappropriate, please contact OpenDataLab@pjlab.org.cn. We will verify the request and remove or update the relevant content when appropriate.

关于 About

No description, website, or topics provided.

语言 Languages

Python98.8%
Shell1.2%

提交活跃度 Commit Activity

代码提交热力图
过去 52 周的开发活跃度
10
Total Commits
峰值: 9次/周
Less
More

核心贡献者 Contributors