Star 历史趋势
数据来源: GitHub API · 生成自 Stargazers.cn
README.md
OpenKB (by PageIndex)

OpenKB — Open LLM Knowledge Base

Scale to long documents  •  Reasoning-based retrieval  •  Native multi-modality  •  No Vector DB


📑 What is OpenKB

OpenKB (Open Knowledge Base) is an open-source system (in CLI) that compiles raw documents into a structured, interlinked wiki-style knowledge base using LLMs, powered by PageIndex for vectorless long document retrieval.

The idea is based on a concept described by Andrej Karpathy: LLMs generate summaries, concept pages, and cross-references, all maintained automatically. Knowledge compounds over time instead of being re-derived on every query.

Why not traditional RAG?

Traditional RAG rediscovers knowledge from scratch on every query. Nothing accumulates. OpenKB compiles knowledge once into a persistent wiki, then keeps it current. Cross-references already exist. Contradictions are flagged. Synthesis reflects everything consumed.

Features

  • Broad format support — PDF, Word, Markdown, PowerPoint, HTML, Excel, text, and more via markitdown
  • Scale to long documents — Long and complex documents are handled via PageIndex tree indexing, enabling accurate, vectorless long-context retrieval
  • Native multi-modality — Retrieves and understands figures, tables, and images, not just text
  • Compiled Wiki — LLM manages and compiles your documents into summaries, concept pages, and cross-links, all kept in sync
  • Query — Ask questions (one-off) against your wiki. The LLM navigates your compiled knowledge to answer
  • Interactive Chat — Multi-turn conversations with persisted sessions you can resume across runs
  • Lint — Health checks find contradictions, gaps, orphans, and stale content
  • Watch mode — Drop files into raw/, wiki updates automatically
  • Obsidian compatible — Wiki is plain .md files with [[wikilinks]]. Open in Obsidian for graph view and browsing

🚀 Getting Started

Install

pip install openkb
Other install options
  • Latest from GitHub:

    pip install git+https://github.com/VectifyAI/OpenKB.git
  • Install from source (editable, for development):

    git clone https://github.com/VectifyAI/OpenKB.git cd OpenKB pip install -e .

Quick Start

# 1. Create a directory for your knowledge base mkdir my-kb && cd my-kb # 2. Initialize the knowledge base openkb init # 3. Add documents openkb add paper.pdf openkb add ~/papers/ # Add a whole directory # 4. Ask a question openkb query "What are the main findings?" # 5. Or chat interactively openkb chat

Set up your LLM

OpenKB comes with multi-LLM support (e.g., OpenAI, Claude, Gemini) via LiteLLM (pinned to a safe version).

Set your model during openkb init, or in .openkb/config.yaml, using provider/model LiteLLM format (like anthropic/claude-sonnet-4-6). OpenAI models can omit the prefix (like gpt-5.4).

Create a .env file with your LLM API key:

LLM_API_KEY=your_llm_api_key

🧩 How OpenKB Works

Architecture

raw/                              You drop files here
 │
 ├─ Short docs ──→ markitdown ──→ LLM reads full text
 │                                     │
 ├─ Long PDFs ──→ PageIndex ────→ LLM reads document trees
 │                                     │
 │                                     ▼
 │                         Wiki Compilation (using LLM)
 │                                     │
 ▼                                     ▼
wiki/
 ├── index.md            Knowledge base overview
 ├── log.md              Operations timeline
 ├── AGENTS.md           Wiki schema (LLM instructions)
 ├── sources/            Full-text conversions
 ├── summaries/          Per-document summaries
 ├── concepts/           Cross-document synthesis ← the good stuff
 ├── explorations/       Saved query results
 └── reports/            Lint reports

Short vs. Long Document Handling

Short documentsLong documents (PDF ≥ 20 pages)
Convertmarkitdown → MarkdownPageIndex → tree index + summaries
ImagesExtracted inline (pymupdf)Extracted by PageIndex
LLM readsFull textDocument trees
Resultsummary + conceptssummary + concepts

Short docs are read in full by the LLM. Long PDFs are indexed by PageIndex into a hierarchical tree with summaries. The LLM reads the tree instead of the full text, enabling better retrieval from long documents.

Knowledge Compilation

When you add a document, the LLM:

  1. Generates a summary page
  2. Reads existing concept pages
  3. Creates or updates concepts with cross-document synthesis
  4. Updates the index and log

A single source might touch 10-15 wiki pages. Knowledge accumulates: each document enriches the existing wiki rather than sitting in isolation.

⚙️ Usage

Commands

CommandDescription
openkb initInitialize a new knowledge base (interactive)
openkb add <file_or_dir>Add documents and compile to wiki
openkb query "question"Ask a question over the knowledge base (use --save to save the answer to wiki/explorations/)
openkb chatStart an interactive multi-turn chat (use --resume, --list, --delete to manage sessions)
openkb watchWatch raw/ and auto-compile new files
openkb lintRun structural + knowledge health checks
openkb listList indexed documents and concepts
openkb statusShow knowledge base stats

Interactive Chat

openkb chat opens an interactive chat session over your wiki knowledge base. Unlike the one-shot openkb query, each turn carries the conversation history, so you can dig into a topic without re-typing context.

openkb chat # start a new session openkb chat --resume # resume the most recent session openkb chat --resume 20260411 # resume by id (unique prefix works) openkb chat --list # list all sessions openkb chat --delete <id> # delete a session

Inside a chat, type / to access slash commands (Tab to complete):

  • /help — list available commands
  • /status — show knowledge base status
  • /list — list all documents
  • /add <path> — add a document or directory without leaving the chat
  • /save [name] — export the transcript to wiki/explorations/
  • /clear — start a fresh session (the current one stays on disk)
  • /lint — run knowledge base lint
  • /exit — exit (Ctrl-D also works)

Configuration

Settings are initialized by openkb init, and stored in .openkb/config.yaml:

model: gpt-5.4 # LLM model (any LiteLLM-supported provider) language: en # Wiki output language pageindex_threshold: 20 # PDF pages threshold for PageIndex

Model names use provider/model LiteLLM format (OpenAI models can omit the prefix):

ProviderModel example
OpenAIgpt-5.4
Anthropicanthropic/claude-sonnet-4-6
Geminigemini/gemini-3.1-pro-preview

PageIndex Integration

Long documents are challenging for LLMs due to context limits, context rot, and summarization loss. PageIndex solves this with vectorless, reasoning-based retrieval — building a hierarchical tree index that lets LLMs reason over the index for context-aware retrieval.

PageIndex runs locally by default using the open-source version, with no external dependencies required.

Optional: Cloud Support

For large or complex PDFs, PageIndex Cloud can be used to access additional capabilities, including:

  • OCR support for scanned PDFs (via hosted VLM models)
  • Faster structure generation
  • Scalable indexing for large documents

Set PAGEINDEX_API_KEY in your .env to enable cloud features:

PAGEINDEX_API_KEY=your_pageindex_api_key

AGENTS.md

The wiki/AGENTS.md file defines wiki structure and conventions. It's the LLM's instruction manual for maintaining the wiki. Customize it to change how your wiki is organized.

At runtime, the LLM reads AGENTS.md from disk, so your edits take effect immediately.

Using with Obsidian

OpenKB's wiki is a directory of Markdown files with [[wikilinks]]. Obsidian renders it natively.

  1. Open wiki/ as an Obsidian vault
  2. Browse summaries, concepts, and explorations
  3. Use graph view to see knowledge connections
  4. Use Obsidian Web Clipper to add web articles to raw/

🧭 Learn More

Compared to Karpathy's Approach

Karpathy's workflowOpenKB
Short documentsLLM reads directlymarkitdown → LLM reads
Long documentsContext limits, context rotPageIndex tree index
Supported formatsWeb clipper → .mdPDF, Word, PPT, Excel, HTML, text, CSV, .md
Wiki compilationLLM agentLLM agent (same)
Q&AQuery over wikiWiki + PageIndex retrieval

The Stack

  • PageIndex — Vectorless, reasoning-based document indexing and retrieval
  • markitdown — Universal file-to-markdown conversion
  • OpenAI Agents SDK — Agent framework (supports non-OpenAI models via LiteLLM)
  • LiteLLM — Multi-provider LLM gateway
  • Click — CLI framework
  • watchdog — Filesystem monitoring

Roadmap

  • Extend long document handling to non-PDF formats
  • Scale to large document collections with nested folder support
  • Hierarchical concept (topic) indexing for massive knowledge bases
  • Database-backed storage engine
  • Web UI for browsing and managing wikis

Contributing

Contributions are welcome! Please submit a pull request, or open an issue for bugs or feature requests. For larger changes, consider opening an issue first to discuss the approach.

License

Apache 2.0. See LICENSE.

Support Us

If you find OpenKB useful, please give us a star 🌟 — and check out PageIndex too!

TwitterLinkedInContact Us

关于 About

OpenKB: Open LLM Knowledge Base
agentsaiknowledge-basellmragretrieval

语言 Languages

Python100.0%

提交活跃度 Commit Activity

代码提交热力图
过去 52 周的开发活跃度
90
Total Commits
峰值: 46次/周
Less
More

核心贡献者 Contributors