Star 历史趋势
数据来源: GitHub API · 生成自 Stargazers.cn
README.md

Xberg

One Rust engine — 96 file formats, 306 programming languages, native bindings for 15 languages, dual model runtimes, 6 output formats, OCR from any backend, embeddings, structured LLM extraction, token reduction, and more.

Xberg is the next iteration of Kreuzberg. Same document-intelligence engine, rebuilt and rebranded under a fresh v1 line.

Feed documents → get clean text, tables, metadata, transcripts, code intelligence · Run it library, CLI, REST API, or MCP server · No GPU needed · Stream multi-GB files · Cache results.

Documents · Images · Spreadsheets · Email · Archives · Code · Audio · Video

crates.io npm PyPI License: MIT

Quick start · What you get · Capabilities · CLI · Docs


Extracting clean Markdown from a PDF in the CLI

Feed any document—get structured text. Extract, batch, stream, or crawl.


What you get

Xberg is a full content-intelligence engine. One Rust core with fast, accurate extraction from 96 file formats and 306 programming languages. Language bindings for Rust, Python, Node.js, Go, Java, C#, Ruby, PHP, Elixir, Dart, Swift, Zig, WASM, Kotlin, and C FFI. Use it as a library, CLI tool, REST API, or MCP server.

What it doesHow
Extract from 96 formatsPDFs, Office, images, HTML, email, archives, scientific publications, and code — intelligent MIME detection, streaming for large files.
6 output formatsPlain text, Markdown, Djot, HTML, JSON tree structure, or Structured (JSON with OCR metadata and bounding boxes).
Code intelligenceFunctions, classes, imports, symbols, docstrings from 306 programming languages. Syntax-aware chunking for RAG pipelines.
Crawl & recurseFollow URLs, extract documents from within documents (nested archives, embedded PDFs). Auto/Document/Crawl modes.
OCR on demandTesseract, PaddleOCR, Candle, or VLM backends — fallback chains, extensible via plugins. Confidence scores. Language auto-detection.
TranscriptionWhisper ONNX for audio/video tracks (MP3, M4A, WAV, WebM, MP4).
Embeddings & searchLocal (ONNX models) or provider-hosted (OpenAI, Anthropic, Google, 143 providers via liter-llm). Reranking.
Structured outputsLLM-powered extraction — local (Ollama, LM Studio, vLLM) or remote (OpenAI, Anthropic, Google).
EnrichmentNER, redaction, summarization, translation, QR code detection, page classification, keyword extraction (YAKE/RAKE), language detection, layout detection, table extraction, token reduction (TOON).
Batch & parallelProcess 100s of documents in parallel. Per-file timeouts. Configurable batch concurrency (max_concurrent_extractions).
CachingContent-hash cache keys — skip re-extraction when the file and config are unchanged.
DeploymentLibrary, CLI (12 commands), REST API (xberg serve), MCP server (9 tools, 3 prompts, 4 resources), Docker.

Demos

Xberg CLI: extract, batch, detect, formats, cache, serve, mcp

The CLI: 12 commands for extraction, caching, serving, and MCP.

OCR from a scanned image with confidence scores and bounding boxes

OCR with confidence scores and bounding boxes. Switch backends without code changes.

Crawling a website and extracting all linked documents

Web crawl: fetch a page, follow links, extract all documents recursively.

MCP server integration with Claude Desktop showing extraction tools and prompts

MCP server: AI agents extract documents, detect formats, warm models, manage cache.

REST API: POST a document, get JSON extraction results with streaming support

REST API: stream large files, get JSON or Markdown, one endpoint for all formats.


Installation

Language Packages

Python
pip install xberg

See Python README for full documentation.

Node.js / TypeScript
npm install @xberg-io/xberg

See Node.js README for full documentation.

Rust
cargo add xberg

See Rust README for full documentation.

Go
go get github.com/xberg-io/xberg

See Go README for full documentation.

Java

Available on Maven Central as io.xberg:xberg. See Java README for the dependency snippet.

C#
dotnet add package Xberg

See C# README for full documentation.

Ruby
gem install xberg

See Ruby README for full documentation.

PHP
composer require xberg-io/xberg

See PHP README for full documentation.

Elixir

Add {:xberg, "~> 1.0"} to your mix.exs dependencies. See Elixir README for full documentation.

WebAssembly
npm install @xberg-io/xberg-wasm

See WebAssembly README for full documentation.

Kotlin (Android)

Available on Maven Central as io.xberg:xberg-android. See Kotlin README for the dependency snippet.

Swift

Add via Swift Package Manager. See Swift README for full documentation.

Dart / Flutter
dart pub add xberg

See Dart README for full documentation.

Zig

Add via zig fetch. See Zig README for full documentation.

C/C++ (FFI)

Build from source as part of this workspace. See C (FFI) README for full documentation.

CLI & Deployment

CLI Tool
brew install xberg-io/tap/xberg

12 commands: extract, batch, detect, formats, version, cache (stats/clear/manifest/warm), serve, mcp, api, embed, chunk, completions.

See CLI usage guide for detailed documentation.

Docker
docker pull ghcr.io/xberg-io/xberg:latest

Run in API, CLI, or MCP modes. See Docker guide for examples.

REST API Server
xberg serve --host 0.0.0.0 --port 8000

One POST endpoint handles all formats. Returns JSON or Markdown. Stream large files. See API server guide.

MCP Server
xberg mcp --transport stdio

9 tools (extract, extract_batch, detect_mime_type, cache_stats, list_formats, cache_clear, get_version, cache_manifest, cache_warm). 3 prompts (extract_document, extract_with_ocr, semantic_search). 4 resources (formats, models, OCR languages, embedding presets).

Add to Claude Desktop or Cursor:

{
  "mcpServers": {
    "xberg": { "command": "xberg", "args": ["mcp"] }
  }
}

See MCP integration guide.

AI Coding Assistants

Install the Xberg plugin from xberg-io/plugins. Ships extraction APIs, OCR backends, configuration, and language conventions.

Claude Code
/plugin marketplace add xberg-io/plugins
/plugin install xberg@xberg
Codex CLI
/plugins add https://github.com/xberg-io/plugins

Search for xberg and select Install Plugin.

Cursor

Settings → Plugins → Add from URL → https://github.com/xberg-io/plugins, then select xberg.

Gemini CLI
gemini extensions install https://github.com/xberg-io/plugins
Factory Droid
droid plugin marketplace add https://github.com/xberg-io/plugins
droid plugin install xberg@xberg
GitHub Copilot CLI
copilot plugin marketplace add https://github.com/xberg-io/plugins
copilot plugin install xberg@xberg
opencode

Add to opencode.json:

{
  "$schema": "https://opencode.ai/config.json",
  "plugin": ["@xberg-io/opencode-xberg"]
}

Quick Start

Extract text from a document:

use xberg::{extract, ExtractInput, ExtractionConfig};

#[tokio::main]
async fn main() -> xberg::Result<()> {
    let config = ExtractionConfig::default();
    let output = extract(
        ExtractInput::from_uri("document.pdf"),
        &config
    ).await?;

    println!("{}", output.results[0].content);
    Ok(())
}

Common use cases — see Quick start guide for language-specific examples, OCR, batch processing, and API configuration.


Capabilities

Full feature list

Supported File Formats (96)

96 file formats across 8 major categories with intelligent format detection and comprehensive metadata extraction.

Office Documents

CategoryFormatsCapabilities
Word Processing.docx, .docm, .doc, .dotx, .dotm, .dot, .odt, .pagesFull text, tables, images, metadata, styles
Spreadsheets.xlsx, .xlsm, .xlsb, .xls, .xla, .xlam, .xltm, .xltx, .xlt, .ods, .numbersSheet data, formulas, cell metadata, charts
Presentations.pptx, .pptm, .ppt, .ppsx, .potx, .potm, .pot, .keySlides, speaker notes, images, metadata
PDF.pdfText, tables, images, metadata, OCR support
eBooks.epub, .fb2Chapters, metadata, embedded resources
Database.dbfTable data extraction, field type support
Hangul.hwp, .hwpxKorean document format, text extraction

Images (OCR-Enabled)

CategoryFormatsFeatures
Raster.png, .jpg, .jpeg, .gif, .webp, .bmp, .tiff, .tifOCR, table detection, EXIF metadata, dimensions, color space
Advanced.jp2, .jpx, .jpm, .mj2, .jbig2, .jb2, .pnm, .pbm, .pgm, .ppmOCR via pure-Rust JPEG2000 decoder, JBIG2 support, table detection
HEIC family.heic, .heics, .heif, .avif, .avcsEXIF metadata, optional pixel decoding
Vector.svgDOM parsing, embedded text, graphics metadata

Audio & Video

CategoryFormatsFeatures
Audio.mp3, .mpga, .m4a, .wav, .webmWhisper transcription
Video audio track.mp4, .mpeg, .webmAudio-track transcription only

Web & Data

CategoryFormatsFeatures
Markup.html, .htm, .xhtml, .xml, .svgDOM parsing, metadata (Open Graph, Twitter Card), link extraction
Structured Data.json, .yaml, .yml, .toml, .csv, .tsvSchema detection, nested structures, validation
Text & Markdown.txt, .md, .markdown, .djot, .mdx, .rst, .org, .rtfCommonMark, GFM, Djot, MDX, reStructuredText, Org Mode

Email & Archives

CategoryFormatsFeatures
Email.eml, .msg, .pstHeaders, body (HTML/plain), attachments, threading
Archives.zip, .tar, .tgz, .gz, .7zFile listing, nested archives, metadata, recursive extraction

Academic & Scientific

CategoryFormatsFeatures
Citations.bib, .ris, .nbib, .enwStructured parsing: RIS, PubMed/MEDLINE, EndNote XML, BibTeX/BibLaTeX
Scientific.tex, .latex, .typ, .typst, .jats, .ipynbLaTeX, Typst, Jupyter notebooks, PubMed JATS
Publishing.fb2, .docbook, .dbk, .docbook4, .docbook5, .opmlFictionBook, DocBook XML, OPML outlines

Code Intelligence (306 Languages)

Extract structure from 306 programming languages via tree-sitter:

FeatureDescription
Structure ExtractionFunctions, classes, methods, structs, interfaces, enums
Import/Export AnalysisModule dependencies, re-exports, wildcard imports
Symbol ExtractionVariables, constants, type aliases, properties
Docstring ParsingGoogle, NumPy, Sphinx, JSDoc, RustDoc, and 10+ formats
Syntax-Aware ChunkingSplit code by semantic boundaries for RAG pipelines
DiagnosticsParse errors with line/column positions

Powered by tree-sitter-language-pack.

Output Formats (6)

FormatUse caseExample
PlainRaw text, no markup"Chapter 1\nIntroduction"
MarkdownReadable, structured, RAG-friendly"# Chapter 1\n## Introduction"
DjotModern lightweight markupSimilar to Markdown but stricter
HTMLStyled, browser-ready<h1>Chapter 1</h1>
JSONMachine-readable tree structureHierarchical sections with heading levels
StructuredOCR metadata, bounding boxesJSON with elements[] containing {text, bbox, confidence}

Deployment Modes

ModeCommandTransportUse case
Libraryxberg::extract()Async functionsEmbed in your application
CLIxberg extract document.pdf12 commandsScripts, batch jobs, CI/CD
REST APIxberg serveHTTP POSTMicroservice, serverless deployment
MCP Serverxberg mcpstdio or HTTPClaude, Cursor, IDE agents
Dockerdocker run ghcr.io/xberg-io/xbergAll modesContainer deployment

OCR Backends

  • Tesseract — Native C FFI (Linux/macOS/Windows) and WASM (browser)
  • PaddleOCR — ONNX Runtime, mobile-optimized models
  • Candle — Pure Rust, CPU-only, lightweight
  • VLM — GPT-4 Vision, Claude Vision, Gemini Vision, or 143 providers via liter-llm

Fallback chains. Extensible via plugin system.

Embeddings

Local (ONNX Runtime):

  • Preset models: fast, balanced (default), quality, multilingual
  • Dimensions: 384, 768, 1024

Provider-hosted:

  • OpenAI, Anthropic, Google, Hugging Face, Mistral, Cohere, and 143 providers total
  • Via liter-llm integration

Reranking:

  • Local ONNX rerankers (cross-encoder models)
  • Provider-hosted: Cohere Rerank, others

Structured LLM Extraction

Local engines: Ollama, LM Studio, vLLM

Remote: OpenAI, Anthropic, Google, Mistral, Cohere, and 143 providers via liter-llm

Schema validation. Temperature, top-p, frequency penalty tuning.

Enrichment

  • NER — GLiNER or LLM-based entity recognition
  • Redaction — Mask PII (phone, email, SSN, credit card, addresses)
  • Summarization — Document and section summaries via LLM
  • Translation — Multi-language via LLM
  • Page Classification — Tag document pages (cover, toc, content, etc.)
  • QR Code Detection — Extract and decode QR codes from images
  • Keyword Extraction — YAKE or RAKE algorithms
  • Language Detection — Detect document language
  • Layout Detection — RT-DETR + TATR models for document structure
  • Table Extraction — Cell-level structure and content
  • Token Reduction — TOON wire format (~30–50% fewer tokens than JSON)

CLI Reference

All 12 commands
CommandSubcommandsPurpose
extractExtract text from a single document (path, URL, or stdin)
batchExtract from multiple documents in parallel
detectIdentify MIME type of a file
formatsList all 96 supported formats and MIME types
versionShow Xberg version
cachestats, clear, manifest, warmManage extraction cache and models
serveStart REST API server (default: http://127.0.0.1:8000)
mcpStart MCP server (stdio or HTTP transport)
apischemaOutput OpenAPI 3.1 specification
embedGenerate embeddings for text (local or provider-hosted)
chunkSplit text into chunks (text, markdown, YAML, or semantic)
completionsGenerate shell completion scripts

Run xberg --help or xberg <command> --help for detailed options.


Documentation

Full guides, API references for every binding, format reference, and configuration docs live at xberg.io.


Contributing

Contributions are welcome! See CONTRIBUTING.md for guidelines.

Join our Discord community for questions and discussion.


Part of Xberg.dev

Xberg is one of six open-source projects from Kreuzberg, Inc.:

  • Xberg — document intelligence: text, tables, metadata from 91+ formats with optional OCR.
  • Xberg Enterprise — managed extraction API with SDKs, dashboards, and observability.
  • crawlberg — web crawling and scraping with HTML→Markdown and headless-Chrome fallback.
  • html-to-markdown — fast, lossless HTML→Markdown engine.
  • liter-llm — universal LLM API client with native bindings for 14 languages and 143 providers.
  • tree-sitter-language-pack — tree-sitter grammars and code-intelligence primitives.
  • alef — the polyglot binding generator that produces every per-language binding across the 5 polyglot repos.

License

MIT License (MIT) — see LICENSE for details.

关于 About

A polyglot document intelligence framework with a Rust core. Extract text, metadata, images, and structured information from PDFs, Office documents, images, and 97+ formats. Available for Rust, Python, Ruby, Java, Go, PHP, Elixir, C#, R, C, TypeScript (Node/Bun/Wasm/Deno)- or use via CLI, REST API, or MCP server.
buncsharpdocument-intelligenceelixirffigolangjavametadata-extractionnodepdf-extractionpdfiumphppythonragrubyrusttable-extractiontesseracttext-extractionwasm

语言 Languages

Rust93.8%
PHP2.2%
Python1.6%
Shell1.2%
Java0.2%
Jinja0.2%
PowerShell0.2%
JavaScript0.2%
TypeScript0.1%
C0.1%
Elixir0.1%
Ruby0.1%
Go0.1%
Swift0.0%
C++0.0%
M40.0%
MAXScript0.0%
TeX0.0%
RenderScript0.0%

提交活跃度 Commit Activity

代码提交热力图
过去 52 周的开发活跃度
6275
Total Commits
峰值: 328次/周
Less
More

核心贡献者 Contributors