Star 历史趋势
数据来源: GitHub API · 生成自 Stargazers.cn
README.md

LiteParse

CI | npm version | License | Docs

out

LiteParse is a standalone OSS PDF parsing tool focused exclusively on fast and light parsing. It provides high-quality spatial text parsing with bounding boxes, without proprietary LLM features or cloud dependencies. Everything runs locally on your machine.

Hitting the limits of local parsing? For complex documents (dense tables, multi-column layouts, charts, handwritten text, or scanned PDFs), you'll get significantly better results with LlamaParse, our cloud-based document parser built for production document pipelines. LlamaParse handles the hard stuff so your models see clean, structured data and markdown.

👉 Sign up for LlamaParse free

Overview

  • Fast Text Parsing: Spatial text parsing using PDF.js
  • Flexible OCR System:
    • Built-in: Tesseract.js (zero setup, works out of the box!)
    • HTTP Servers: Plug in any OCR server (EasyOCR, PaddleOCR, custom)
    • Standard API: Simple, well-defined OCR API specification
  • Screenshot Generation: Generate high-quality page screenshots for LLM agents
  • Multiple Output Formats: JSON and Text
  • Bounding Boxes: Precise text positioning information
  • Standalone Binary: No cloud dependencies, runs entirely locally
  • Multi-platform: Linux, macOS (Intel/ARM), Windows

Installation

CLI Tool

Option 1: Global Install (Recommended)

Install globally via npm to use the lit command anywhere:

npm i -g @llamaindex/liteparse

Then use it:

lit parse document.pdf lit screenshot document.pdf

For macOS and Linux users, liteparse can be also installed via brew:

brew tap run-llama/liteparse brew install llamaindex-liteparse

Option 2: Install from Source

You can clone the repo and install the CLI globally from source:

git clone https://github.com/run-llama/liteparse.git
cd liteparse
npm run build
npm pack
npm install -g ./liteparse-*.tgz

Agent Skill

You can use liteparse as an agent skill, downloading it with the skills CLI tool:

npx skills add run-llama/llamaparse-agent-skills --skill liteparse

Or copy-pasting the SKILL.md file to your own skills setup.

Usage

Parse Files

# Basic parsing lit parse document.pdf # Parse with specific format lit parse document.pdf --format json -o output.md # Parse specific pages lit parse document.pdf --target-pages "1-5,10,15-20" # Parse without OCR lit parse document.pdf --no-ocr # Parse a remote PDF curl -sL https://example.com/report.pdf | lit parse -

Batch Parsing

You can also parse an entire directory of documents:

lit batch-parse ./input-directory ./output-directory

Generate Screenshots

Screenshots are essential for LLM agents to extract visual information that text alone cannot capture.

# Screenshot all pages lit screenshot document.pdf -o ./screenshots # Screenshot specific pages lit screenshot document.pdf --target-pages "1,3,5" -o ./screenshots # Custom DPI lit screenshot document.pdf --dpi 300 -o ./screenshots # Screenshot page range lit screenshot document.pdf --target-pages "1-10" -o ./screenshots

Library Usage

Install as a dependency in your project:

npm install @llamaindex/liteparse # or pnpm add @llamaindex/liteparse
import { LiteParse } from '@llamaindex/liteparse'; const parser = new LiteParse({ ocrEnabled: true }); const result = await parser.parse('document.pdf'); console.log(result.text);

Buffer / Uint8Array Input

You can pass raw bytes directly instead of a file path, which is useful for remote files:

import { LiteParse } from '@llamaindex/liteparse'; import { readFile } from 'fs/promises'; const parser = new LiteParse(); // From a file read const pdfBytes = await readFile('document.pdf'); const result = await parser.parse(pdfBytes); // From an HTTP response const response = await fetch('https://example.com/document.pdf'); const buffer = Buffer.from(await response.arrayBuffer()); const result2 = await parser.parse(buffer);

Non-PDF buffers (images, Office documents) are written to a temp directory for format conversion. Screenshots also work with buffer input:

const screenshots = await parser.screenshot(pdfBytes, [1, 2, 3]);

CLI Options

Parse Command

$ lit parse --help
Usage: lit parse [options] <file>

Parse a document file (PDF, DOCX, XLSX, PPTX, images, etc.)

Options:
  -o, --output <file>     Output file path
  --format <format>       Output format: json|text (default: "text")
  --ocr-server-url <url>  HTTP OCR server URL (uses Tesseract if not provided)
  --no-ocr                Disable OCR
  --ocr-language <lang>   OCR language(s) (default: "en")
  --num-workers <n>       Number of pages to OCR in parallel (default: CPU cores - 1)
  --max-pages <n>         Max pages to parse (default: "10000")
  --target-pages <pages>  Target pages (e.g., "1-5,10,15-20")
  --dpi <dpi>             DPI for rendering (default: "150")
  --no-precise-bbox       Disable precise bounding boxes
  --preserve-small-text   Preserve very small text
  --password <password>   Password for encrypted/protected documents
  --config <file>         Config file (JSON)
  -q, --quiet             Suppress progress output
  -h, --help              display help for command

Batch Parse Command

$ lit batch-parse --help
Usage: lit batch-parse [options] <input-dir> <output-dir>

Parse multiple documents in batch mode (reuses PDF engine for efficiency)

Options:
  --format <format>       Output format: json|text (default: "text")
  --ocr-server-url <url>  HTTP OCR server URL (uses Tesseract if not provided)
  --no-ocr                Disable OCR
  --ocr-language <lang>   OCR language(s) (default: "en")
  --num-workers <n>       Number of pages to OCR in parallel (default: CPU cores - 1)
  --max-pages <n>         Max pages to parse per file (default: "10000")
  --dpi <dpi>             DPI for rendering (default: "150")
  --no-precise-bbox       Disable precise bounding boxes
  --recursive             Recursively search input directory
  --extension <ext>       Only process files with this extension (e.g., ".pdf")
  --password <password>   Password for encrypted/protected documents (applied to all files)
  --config <file>         Config file (JSON)
  -q, --quiet             Suppress progress output
  -h, --help              display help for command

Screenshot Command

$ lit screenshot --help
Usage: lit screenshot [options] <file>

Generate screenshots of PDF pages

Options:
  -o, --output-dir <dir>  Output directory for screenshots (default: "./screenshots")
  --target-pages <pages>  Page numbers to screenshot (e.g., "1,3,5" or "1-5")
  --dpi <dpi>             DPI for rendering (default: "150")
  --format <format>       Image format: png|jpg (default: "png")
  --password <password>   Password for encrypted/protected documents
  --config <file>         Config file (JSON)
  -q, --quiet             Suppress progress output
  -h, --help              display help for command

OCR Setup

Default: Tesseract.js

# Tesseract is enabled by default lit parse document.pdf # Specify language lit parse document.pdf --ocr-language fra # Disable OCR lit parse document.pdf --no-ocr

By default, Tesseract.js downloads language data from the internet on first use. For offline or air-gapped environments, set the TESSDATA_PREFIX environment variable to a directory containing pre-downloaded .traineddata files:

export TESSDATA_PREFIX=/path/to/tessdata lit parse document.pdf --ocr-language eng

You can also pass tessdataPath in the library config:

const parser = new LiteParse({ tessdataPath: '/path/to/tessdata' });

Optional: HTTP OCR Servers

For higher accuracy or better performance, you can use an HTTP OCR server. We provide ready-to-use example wrappers for popular OCR engines:

You can integrate any OCR service by implementing the simple LiteParse OCR API specification (see OCR_API_SPEC.md).

The API requires:

  • POST /ocr endpoint
  • Accepts file and language parameters
  • Returns JSON: { results: [{ text, bbox: [x1,y1,x2,y2], confidence }] }

See the example servers in ocr/easyocr/ and ocr/paddleocr/ as templates.

For the complete OCR API specification, see OCR_API_SPEC.md.

Multi-Format Input Support

LiteParse supports automatic conversion of various document formats to PDF before parsing. This makes it unique compared to other PDF-only parsing tools!

Supported Input Formats

Office Documents (via LibreOffice)

  • Word: .doc, .docx, .docm, .odt, .rtf
  • PowerPoint: .ppt, .pptx, .pptm, .odp
  • Spreadsheets: .xls, .xlsx, .xlsm, .ods, .csv, .tsv

Just install the dependency and LiteParse will automatically convert these formats to PDF for parsing:

# macOS brew install --cask libreoffice # Ubuntu/Debian apt-get install libreoffice # Windows choco install libreoffice-fresh # might require admin permissions

For Windows, you might need to add the path to the directory containing LibreOffice CLI executable (generally C:\Program Files\LibreOffice\program) to the environment variables and re-start the machine.

Images (via ImageMagick)

  • Formats: .jpg, .jpeg, .png, .gif, .bmp, .tiff, .webp, .svg

Just install ImageMagick and LiteParse will convert images to PDF for parsing (with OCR):

# macOS brew install imagemagick # Ubuntu/Debian apt-get install imagemagick # Windows choco install imagemagick.app # might require admin permissions

Environment Variables

VariableDescription
TESSDATA_PREFIXPath to a directory containing Tesseract .traineddata files. Used for offline/air-gapped environments where Tesseract.js cannot download language data from the internet.
LITEPARSE_TMPDIROverride the temp directory used for format conversion and intermediate files. Defaults to the OS temp directory (os.tmpdir()). Useful in containerized or read-only filesystem environments.

Configuration

You can configure parsing options via CLI flags or a JSON config file. The config file allows you to set sensible defaults and override as needed.

Config File Example

Create a liteparse.config.json file:

{ "ocrLanguage": "en", "ocrEnabled": true, "maxPages": 1000, "dpi": 150, "outputFormat": "json", "preciseBoundingBox": true, "preserveVerySmallText": false, "password": "optional_password" }

For HTTP OCR servers, just add ocrServerUrl:

{ "ocrServerUrl": "http://localhost:8828/ocr", "ocrLanguage": "en", "outputFormat": "json" }

Use with:

lit parse document.pdf --config liteparse.config.json

Development

We provide a fairly rich AGENTS.md/CLAUDE.md that we recommend using to help with development + coding agents.

# Install dependencies npm install # Build TypeScript (Linux/macOs) npm run build # Build Typescript (Windows) npm run build:windows # Watch mode npm run dev # Test parsing npm test

License

Apache 2.0

Credits

Built on top of:

关于 About

A fast, helpful, and open-source document parser
document-ocrdocument-processingocrocr-recognitionpdfpdf-parsertext-extraction

语言 Languages

TypeScript72.2%
Python26.8%
Shell0.6%
JavaScript0.3%
Dockerfile0.2%

提交活跃度 Commit Activity

代码提交热力图
过去 52 周的开发活跃度
341
Total Commits
峰值: 99次/周
Less
More

核心贡献者 Contributors