omnicache-ai

Unified multi-layer caching for AI Agent pipelines.
Drop it in front of any LLM call, embedding, retrieval query, or agent workflow
to eliminate redundant API calls and cut latency and cost.

Why omnicache-ai?

Every AI agent pipeline makes the same expensive calls repeatedly:

Without caching	With omnicache-ai
Every LLM call billed at full token cost	Identical prompts returned instantly, zero tokens
Embeddings re-computed on every request	Vectors stored and reused across sessions
Vector search re-run for same queries	Retrieval results cached by query + top-k
Agent state lost between runs	Session context persisted across turns
Semantically identical questions treated as new	Cosine similarity match returns cached answer

vs. GPTCache / LiteLLM / redis-vl

GPTCache (8k ⭐) was the closest open-source competitor — but it has been effectively unmaintained since 2024. omnicache-ai fills that gap with a far more complete, modern, and actively maintained solution.

Feature	omnicache-ai	GPTCache	LiteLLM	redis-vl
13 framework adapters	✅	3	1 (gateway)	❌
Adaptive semantic threshold	✅	❌	❌	❌
Streaming cache	✅	❌	❌	❌
Tiered backend (L1 + L2)	✅	❌	❌	❌
Stampede protection	✅	❌	❌	❌
Provider prompt cache tracking	✅	❌	partial	❌
Qdrant + Weaviate backends	✅	❌	❌	❌
Prometheus + OTEL export	✅	❌	✅	❌
Multi-tenant namespacing	✅	❌	❌	❌
Actively maintained (2026)	✅	❌ dead	✅	✅

LiteLLM is a gateway/proxy (different problem domain). redis-vl is Redis-only with no framework adapters. GPTCache is the direct predecessor — omnicache-ai is what GPTCache should have become.

Key Features

Cache Layers

Layer	Class	What it caches	Serialization
LLM Response	`ResponseCache`	Model output keyed by model + messages + params	pluggable
Embeddings	`EmbeddingCache`	`np.ndarray` vectors keyed by model + text	`np.tobytes()`
Retrieval	`RetrievalCache`	Document lists keyed by query + retriever + top-k	pluggable
Context/Session	`ContextCache`	Conversation turns keyed by session ID + turn index	pluggable
Semantic	`SemanticCache`	Answers reused for semantically similar queries (cosine ≥ threshold)	pluggable
Adaptive Semantic	`AdaptiveSemanticCache`	SemanticCache + auto-tuning threshold + multi-turn guard	pluggable
Streaming	`StreamingResponseCache`	Buffers streamed LLM chunks; replays from cache as generator	pluggable
Prompt Cache	`PromptCacheLayer`	Injects Anthropic `cache_control`; tracks provider cache savings	—

Storage Backends

Backend	Class	Extras	Best For
In-Memory (LRU)	`InMemoryBackend`	— (core)	Dev, testing, single-process
Async In-Memory	`AsyncInMemoryBackend`	— (core)	FastAPI, async frameworks
Disk	`DiskBackend`	— (core)	Persistent, single-machine
Redis	`RedisBackend`	`[redis]`	Shared across processes / services
Tiered (L1 + L2)	`TieredBackend`	— (core)	Memory speed + Redis persistence
FAISS	`FAISSBackend`	`[vector-faiss]`	High-speed in-process vector search
ChromaDB	`ChromaBackend`	`[vector-chroma]`	Persistent vector store + metadata
Qdrant	`QdrantBackend`	`[vector-qdrant]`	Fastest production vector DB (22ms)
Weaviate	`WeaviateBackend`	`[vector-weaviate]`	Native hybrid search (vector + BM25)

Framework Adapters

Framework	Class	Hook Point	Async
OpenAI SDK	`OpenAICacheAdapter`	`client.chat.completions.create`	✅`achat_create`
Anthropic SDK	`AnthropicCacheAdapter`	`client.messages.create`	✅`amessages_create`
Google ADK	`GoogleADKCacheAdapter`	`Agent.run()` / `run_async()`	✅`arun`
OpenAI Agents SDK	`OpenAIAgentsCacheAdapter`	`Runner.run()` / `run_sync()`	✅`arun`
LlamaIndex LLM	`LlamaIndexLLMCacheAdapter`	`complete()` / `chat()` / async variants	✅`acomplete` / `achat`
LlamaIndex QueryEngine	`LlamaIndexQueryCacheAdapter`	`query()` / `aquery()`	✅`aquery`
Claude Agent SDK	`ClaudeAgentCacheAdapter`	`claude_code_sdk.query()` async generator	✅ (async generator)
LangChain ≥ 0.2	`LangChainCacheAdapter`	`BaseCache` — `lookup` / `update`	✅`alookup` / `aupdate`
LangGraph ≥ 0.1 / 1.x	`LangGraphCacheAdapter`	`BaseCheckpointSaver` — `get_tuple` / `put` / `list`	✅`aget_tuple` / `aput` / `alist`
AutoGen ≥ 0.4	`AutoGenCacheAdapter`	`AssistantAgent.run()` / `arun()`	✅`arun`
AutoGen 0.2.x	`AutoGenCacheAdapter`	`ConversableAgent.generate_reply()`	—
CrewAI ≥ 0.28	`CrewAICacheAdapter`	`Crew.kickoff()`	✅`kickoff_async`
Agno ≥ 0.1	`AgnoCacheAdapter`	`Agent.run()` / `arun()`	✅`arun`
A2A ≥ 0.2	`A2ACacheAdapter`	`process()` / `wrap()` decorator	✅`aprocess`

Middleware

Class	Wraps	Async
`LLMMiddleware`	Any sync LLM callable	—
`AsyncLLMMiddleware`	Any async LLM callable	✅
`EmbeddingMiddleware`	Any sync/async embed function	✅
`RetrieverMiddleware`	Any sync/async retriever	✅

Core Engine

Component	Class	Description
Orchestrator	`CacheManager`	Central hub — wires backend, key builder, TTL policy, invalidation
Key Builder	`CacheKeyBuilder`	`namespace:type:sha256[:16]` canonical keys
Metrics	`CacheMetrics`	Hit/miss/eviction counters + provider cache hits + cost saved
Serializer	`Serializer`	Pluggable encode/decode — `PickleSerializer` (default), `JsonSerializer`
Compressor	`Compressor`	Optional compression — `GzipCompressor`, `NoopCompressor` (default)
Stampede Shield	`StampedeShield`	Per-key `threading.Lock` prevents concurrent duplicate LLM calls
Request Config	`RequestConfig`	Per-request TTL / threshold / `skip_cache` overrides
Cache Warmer	`CacheWarmer`	Bulk pre-populate from query lists or CSV files
TTL Policy	`TTLPolicy`	Global + per-layer TTL overrides
Eviction	`EvictionPolicy`	LRU / TTL-only strategies, wired into `InMemoryBackend`
Invalidation	`InvalidationEngine`	Tag-based bulk eviction
Multi-Tenant	`CacheManager.for_tenant(id)`	Scoped manager with per-tenant key namespacing, shared backend
Settings	`OmnicacheSettings`	Dataclass + `from_env()` for 12-factor config
Prometheus	`PrometheusExporter`	`/metrics` HTTP endpoint — requires `[observability]`
OpenTelemetry	`OpenTelemetryExporter`	Push metrics to OTEL collector — requires `[observability]`

AI Agent Pipeline Architecture

Where Cache Layers Sit in a Full AI Pipeline

flowchart TD
    User(["👤 User / Application"])

    User -->|query| Adapters

    subgraph Adapters["🔌 Framework Adapters (13)"]
        direction LR
        OAI["OpenAI SDK"]
        ANT["Anthropic SDK"]
        GADK["Google ADK"]
        OAIA["OpenAI Agents SDK"]
        LLI["LlamaIndex"]
        CLA["Claude Agent SDK"]
        LC["LangChain"]
        LG["LangGraph"]
        AG["AutoGen"]
        CR["CrewAI"]
        AN["Agno"]
        A2["A2A"]
    end

    Adapters -->|intercepted call| MW

    subgraph MW["⚙️ Middleware"]
        direction LR
        LLM_MW["LLMMiddleware"]
        EMB_MW["EmbeddingMiddleware"]
        RET_MW["RetrieverMiddleware"]
    end

    MW -->|cache lookup| Layers

    subgraph Layers["🗂️ Cache Layers (omnicache-ai)"]
        direction TB
        ASC["AdaptiveSemanticCache\n(auto-tuning threshold)"]
        SC["SemanticCache\n(cosine similarity)"]
        RC["ResponseCache\n(LLM output, stampede-safe)"]
        SRC["StreamingResponseCache\n(buffered stream replay)"]
        EC["EmbeddingCache\n(np.ndarray)"]
        REC["RetrievalCache\n(documents)"]
        CC["ContextCache\n(session turns)"]
        PC["PromptCacheLayer\n(provider cache_control)"]
    end

    Layers -->|hit → return| User
    Layers -->|miss → forward| Core

    subgraph Core["🧠 Core Engine"]
        direction LR
        CM["CacheManager\n+ for_tenant()"]
        KB["CacheKeyBuilder\nnamespace:type:sha256"]
        MT["CacheMetrics\nhit_rate · cost_saved"]
        IE["InvalidationEngine\ntag-based eviction"]
        TP["TTLPolicy\nper-layer TTLs"]
        SS["StampedeShield\nper-key lock"]
        OBS["Observability\nPrometheus · OTEL"]
    end

    Core <-->|read / write| Backends

    subgraph Backends["💾 Storage Backends (9)"]
        direction LR
        MEM["InMemoryBackend\n(LRU)"]
        AMEM["AsyncInMemoryBackend\n(asyncio)"]
        TIER["TieredBackend\n(L1 + L2)"]
        DISK["DiskBackend\n(diskcache)"]
        REDIS["RedisBackend\n[redis]"]
        FAISS["FAISSBackend\n[vector-faiss]"]
        CHROMA["ChromaBackend\n[vector-chroma]"]
        QDRANT["QdrantBackend\n[vector-qdrant]"]
        WEAV["WeaviateBackend\n[vector-weaviate]"]
    end

    Core -->|miss| LLM_CALL

    subgraph LLM_CALL["🤖 Actual AI Work (on cache miss only)"]
        direction LR
        LLM["LLM API\ngpt-4o / claude / gemini"]
        EMB["Embedder\ntext-embedding-3"]
        VDB["Vector DB\npinecone / weaviate"]
        TOOLS["Tools / APIs"]
    end

    LLM_CALL -->|result| Core
    Core -->|store + return| User

    style Layers fill:#1e3a5f,color:#fff,stroke:#3b82f6
    style Backends fill:#1a3326,color:#fff,stroke:#22c55e
    style Adapters fill:#3b1f5e,color:#fff,stroke:#a855f7
    style MW fill:#3b2a0f,color:#fff,stroke:#f59e0b
    style Core fill:#1e2a3b,color:#fff,stroke:#64748b
    style LLM_CALL fill:#3b1a1a,color:#fff,stroke:#ef4444

Cache Layer Responsibilities in the Pipeline

flowchart LR
    Q(["Query"])

    Q --> S0
    subgraph S0["① Adaptive Semantic Layer"]
        ASC["AdaptiveSemanticCache\nauto-tuning threshold\n+ multi-turn guard"]
    end

    S0 -->|miss| S1
    subgraph S1["② Semantic Layer"]
        SC["SemanticCache\ncosine similarity ≥ threshold\n→ skip everything below"]
    end

    S1 -->|miss| S2
    subgraph S2["③ Response Layer"]
        RC["ResponseCache\nexact model+msgs+params\nhash match · stampede-safe"]
    end

    S2 -->|miss| S2b
    subgraph S2b["④ Streaming Layer"]
        SRC["StreamingResponseCache\nbuffered stream replay\nfor streaming LLMs"]
    end

    S2b -->|miss| S3
    subgraph S3["⑤ Retrieval Layer"]
        REC["RetrievalCache\nquery + retriever + top-k\nhash match"]
    end

    S3 -->|miss| S4
    subgraph S4["⑥ Embedding Layer"]
        EC["EmbeddingCache\nmodel + text hash match\nreturns np.ndarray"]
    end

    S4 -->|miss| S5
    subgraph S5["⑦ Context Layer"]
        CC["ContextCache\nsession_id + turn_index\nreturns message history"]
    end

    S5 -->|all miss| API["🤖 LLM / API Call\n+ PromptCacheLayer\n(provider cache_control)"]

    API -->|result| Store["Store in all\nrelevant layers\n+ CacheMetrics update"]
    Store --> R(["Response"])

    S0 -->|hit ⚡| R
    S1 -->|hit ⚡| R
    S2 -->|hit ⚡| R
    S2b -->|hit ⚡| R
    S3 -->|hit ⚡| R
    S4 -->|hit ⚡| R
    S5 -->|hit ⚡| R

    style S0 fill:#1a3a2a,color:#fff,stroke:#10b981
    style S1 fill:#4c1d95,color:#fff,stroke:#7c3aed
    style S2 fill:#1e3a5f,color:#fff,stroke:#3b82f6
    style S2b fill:#1e2a4a,color:#fff,stroke:#60a5fa
    style S3 fill:#14532d,color:#fff,stroke:#22c55e
    style S4 fill:#713f12,color:#fff,stroke:#f59e0b
    style S5 fill:#7f1d1d,color:#fff,stroke:#ef4444

Backend Selection by Use Case

flowchart TD
    Start(["Which backend?"])

    Start --> Q1{"Multiple processes\nor services?"}
    Q1 -->|Yes| Q1b{"Also need\nfast local reads?"}
    Q1b -->|Yes| TIERED["TieredBackend\nL1=InMemory + L2=Redis\n[core]"]
    Q1b -->|No| REDIS["RedisBackend\n[redis]"]

    Q1 -->|No| Q2{"Need vector\nsimilarity?"}

    Q2 -->|Yes| Q3{"Need hybrid\nsearch (vector+BM25)?"}
    Q3 -->|Yes| WEAV["WeaviateBackend\n[vector-weaviate]"]
    Q3 -->|No| Q3b{"Production scale\n(10M+ vectors)?"}
    Q3b -->|Yes| QDRANT["QdrantBackend\n22ms p95 · [vector-qdrant]"]
    Q3b -->|No| Q3c{"Persist\nto disk?"}
    Q3c -->|Yes| CHROMA["ChromaBackend\n[vector-chroma]"]
    Q3c -->|No| FAISS["FAISSBackend\n[vector-faiss]"]

    Q2 -->|No| Q4{"Async framework\n(FastAPI / LangGraph)?"}
    Q4 -->|Yes| AMEM["AsyncInMemoryBackend\n[core]"]
    Q4 -->|No| Q5{"Survive\nrestarts?"}
    Q5 -->|Yes| DISK["DiskBackend\n[core]"]
    Q5 -->|No| MEM["InMemoryBackend\n[core]"]

    style REDIS fill:#dc2626,color:#fff
    style TIERED fill:#b45309,color:#fff
    style FAISS fill:#2563eb,color:#fff
    style CHROMA fill:#7c3aed,color:#fff
    style QDRANT fill:#0e7490,color:#fff
    style WEAV fill:#0f766e,color:#fff
    style DISK fill:#d97706,color:#fff
    style MEM fill:#059669,color:#fff
    style AMEM fill:#047857,color:#fff

Installation

Requirements

Python ≥ 3.12
Core dependencies: diskcache, numpy (installed automatically)

pip (PyPI)

# Minimal — in-memory + disk backends
pip install omnicache-ai

# ── Framework adapters ──────────────────────────────────────────────
pip install 'omnicache-ai[openai]'          # OpenAI SDK adapter
pip install 'omnicache-ai[anthropic]'       # Anthropic SDK adapter
pip install 'omnicache-ai[google-adk]'      # Google ADK adapter
pip install openai-agents                   # OpenAI Agents SDK adapter
pip install 'omnicache-ai[llamaindex]'      # LlamaIndex LLM + QueryEngine adapters
pip install claude-code-sdk                 # Claude Agent SDK adapter
pip install 'omnicache-ai[langchain]'       # LangChain ≥ 0.2
pip install 'omnicache-ai[langgraph]'       # LangGraph ≥ 0.1 / 1.x
pip install 'omnicache-ai[autogen]'         # AutoGen legacy (pyautogen 0.2.x)
pip install 'autogen-agentchat>=0.4'        # AutoGen new API (separate package)
pip install 'omnicache-ai[crewai]'          # CrewAI ≥ 0.28 / 1.x
pip install 'omnicache-ai[agno]'            # Agno ≥ 0.1 / 2.x
pip install 'a2a-sdk>=0.3' omnicache-ai     # A2A SDK ≥ 0.2

# ── Storage backends ────────────────────────────────────────────────
pip install 'omnicache-ai[redis]'           # Redis
pip install 'omnicache-ai[vector-faiss]'    # FAISS vector search
pip install 'omnicache-ai[vector-chroma]'   # ChromaDB vector store
pip install 'omnicache-ai[vector-qdrant]'   # Qdrant (22ms p95, fastest)
pip install 'omnicache-ai[vector-weaviate]' # Weaviate hybrid search

# ── Observability ────────────────────────────────────────────────────
pip install 'omnicache-ai[observability]'   # Prometheus + OpenTelemetry exporters

# ── Common combos ───────────────────────────────────────────────────
pip install 'omnicache-ai[langchain,redis]'
pip install 'omnicache-ai[langgraph,vector-qdrant]'

# ── Everything ──────────────────────────────────────────────────────
pip install 'omnicache-ai[all]'

uv

uv add omnicache-ai
uv add 'omnicache-ai[langchain,redis]'
uv add 'omnicache-ai[all]'

From source

git clone https://github.com/ashishpatel26/omnicache-ai.git
cd omnicache-ai
uv sync --dev         # installs all dev + core deps
uv run pytest         # verify install

Verify

python -c "import omnicache_ai; print(omnicache_ai.__version__)"
# 0.3.0

Environment variable configuration

Variable	Default	Values
`OMNICACHE_BACKEND`	`memory`	`memory` · `disk` · `redis`
`OMNICACHE_REDIS_URL`	`redis://localhost:6379/0`	Any Redis URL
`OMNICACHE_DISK_PATH`	`/tmp/omnicache`	Any writable path
`OMNICACHE_DEFAULT_TTL`	`3600`	Seconds;`0` = no expiry
`OMNICACHE_NAMESPACE`	`omnicache`	Key prefix string
`OMNICACHE_SEMANTIC_THRESHOLD`	`0.95`	Float 0–1
`OMNICACHE_TTL_EMBEDDING`	`86400`	Per-layer override
`OMNICACHE_TTL_RETRIEVAL`	`3600`	Per-layer override
`OMNICACHE_TTL_CONTEXT`	`1800`	Per-layer override
`OMNICACHE_TTL_RESPONSE`	`600`	Per-layer override

export OMNICACHE_BACKEND=redis
export OMNICACHE_REDIS_URL=redis://localhost:6379/0
export OMNICACHE_DEFAULT_TTL=3600

from omnicache_ai import CacheManager, OmnicacheSettings

manager = CacheManager.from_settings(OmnicacheSettings.from_env())

Quick Start

from omnicache_ai import CacheManager, InMemoryBackend, CacheKeyBuilder

manager = CacheManager(
    backend=InMemoryBackend(),
    key_builder=CacheKeyBuilder(namespace="myapp"),
)

manager.set("my_key", {"result": 42}, ttl=60)
value = manager.get("my_key")  # {"result": 42}

LangChain in 3 lines

from langchain_core.globals import set_llm_cache
from omnicache_ai import CacheManager, InMemoryBackend, CacheKeyBuilder
from omnicache_ai.adapters.langchain_adapter import LangChainCacheAdapter

set_llm_cache(LangChainCacheAdapter(CacheManager(backend=InMemoryBackend(), key_builder=CacheKeyBuilder())))
# Every ChatOpenAI / ChatAnthropic call is now cached automatically

Cache Layers

LLM Response Cache

Cache the string or dict output of any LLM call, keyed by model + messages + params.

from omnicache_ai import CacheManager, InMemoryBackend, CacheKeyBuilder, ResponseCache

manager = CacheManager(backend=InMemoryBackend(), key_builder=CacheKeyBuilder(namespace="myapp"))
cache = ResponseCache(manager)

messages = [{"role": "user", "content": "What is 2+2?"}]

cache.set(messages, "4", model_id="gpt-4o")
answer = cache.get(messages, model_id="gpt-4o")  # "4"

# get_or_generate — calls generator only on cache miss
def call_llm(msgs):
    return openai_client.chat.completions.create(...).choices[0].message.content

answer = cache.get_or_generate(messages, call_llm, model_id="gpt-4o")

Embedding Cache

from omnicache_ai import EmbeddingCache

emb_cache = EmbeddingCache(manager)

vec = emb_cache.get_or_compute(
    text="Hello world",
    compute_fn=lambda t: embed_model.encode(t),
    model_id="text-embedding-3-small",
)

Retrieval Cache

from omnicache_ai import RetrievalCache

ret_cache = RetrievalCache(manager)

docs = ret_cache.get_or_retrieve(
    query="What is RAG?",
    retrieve_fn=lambda q: vectorstore.similarity_search(q, k=5),
    retriever_id="my-vectorstore",
    top_k=5,
)

Context / Session Cache

from omnicache_ai import ContextCache

ctx_cache = ContextCache(manager)

ctx_cache.set(session_id="user-123", turn_index=0, messages=[...])
history = ctx_cache.get(session_id="user-123", turn_index=0)

ctx_cache.invalidate_session("user-123")  # clear all turns for this session

Semantic Cache

Returns a cached answer for semantically similar queries (cosine ≥ threshold). Requires pip install 'omnicache-ai[vector-faiss]'.

from omnicache_ai import SemanticCache
from omnicache_ai.backends.memory_backend import InMemoryBackend
from omnicache_ai.backends.vector_backend import FAISSBackend

sem_cache = SemanticCache(
    exact_backend=InMemoryBackend(),
    vector_backend=FAISSBackend(dim=1536),
    embed_fn=lambda text: embed_model.encode(text),  # returns np.ndarray
    threshold=0.95,
)

sem_cache.set("What is the capital of France?", "Paris")

sem_cache.get("What is the capital of France?")       # "Paris" — exact
sem_cache.get("Which city is the capital of France?") # "Paris" — semantic hit

Middleware (Decorator Pattern)

Wrap any sync/async LLM callable without changing its signature.

from omnicache_ai import LLMMiddleware, CacheKeyBuilder, ResponseCache

middleware = LLMMiddleware(response_cache, key_builder, model_id="gpt-4o")

@middleware
def call_llm(messages: list[dict]) -> str:
    return openai_client.chat.completions.create(...).choices[0].message.content

wrapped = middleware(call_llm)  # or wrap an existing callable

from omnicache_ai import AsyncLLMMiddleware

@AsyncLLMMiddleware(response_cache, key_builder, model_id="gpt-4o")
async def call_llm_async(messages):
    return await async_client.chat(messages)

Same pattern: EmbeddingMiddleware, RetrieverMiddleware

Framework Adapters

LangChain

from langchain_core.globals import set_llm_cache
from omnicache_ai.adapters.langchain_adapter import LangChainCacheAdapter

set_llm_cache(LangChainCacheAdapter(manager))

llm = ChatOpenAI(model="gpt-4o")
response = llm.invoke("What is 2+2?")  # cached on second call

LangGraph

Compatible with langgraph ≥ 0.1 and ≥ 1.0 — adapter auto-detects the API version.

from omnicache_ai.adapters.langgraph_adapter import LangGraphCacheAdapter

saver = LangGraphCacheAdapter(manager)
graph = StateGraph(MyState).compile(checkpointer=saver)

result = graph.invoke({"messages": [...]}, config={"configurable": {"thread_id": "t1"}})

AutoGen

# autogen-agentchat 0.4+ (new API)
from autogen_agentchat.agents import AssistantAgent
from omnicache_ai.adapters.autogen_adapter import AutoGenCacheAdapter

agent = AssistantAgent("assistant", model_client=...)
cached = AutoGenCacheAdapter(agent, manager)
result = await cached.arun("What is 2+2?")

# pyautogen 0.2.x (legacy)
from autogen import ConversableAgent
agent = ConversableAgent(name="assistant", llm_config={...})
cached = AutoGenCacheAdapter(agent, manager)
reply = cached.generate_reply(messages=[{"role": "user", "content": "Hi"}])

CrewAI

from crewai import Crew
from omnicache_ai.adapters.crewai_adapter import CrewAICacheAdapter

crew = Crew(agents=[...], tasks=[...])
cached_crew = CrewAICacheAdapter(crew, manager)

result = cached_crew.kickoff(inputs={"topic": "AI trends"})
result = await cached_crew.kickoff_async(inputs={"topic": "AI trends"})

Agno

from agno.agent import Agent
from omnicache_ai.adapters.agno_adapter import AgnoCacheAdapter

agent = Agent(model=..., tools=[...])
cached = AgnoCacheAdapter(agent, manager)

response = cached.run("Summarize the latest AI research")
response = await cached.arun("Summarize the latest AI research")

Google ADK

from google.adk.agents import Agent
from omnicache_ai.adapters.google_adk_adapter import GoogleADKCacheAdapter

agent = Agent(name="research_agent", model="gemini-2.0-flash", instruction="...")
cached = GoogleADKCacheAdapter(agent, manager)

result = cached.run("Summarise the quarterly report")    # live call
result = cached.run("Summarise the quarterly report")    # instant from cache
result = await cached.arun("Async task")

OpenAI Agents SDK

from agents import Agent, Runner
from omnicache_ai.adapters.openai_agents_adapter import OpenAIAgentsCacheAdapter

agent = Agent(name="assistant", instructions="Be concise", model="gpt-4o")
adapter = OpenAIAgentsCacheAdapter(manager)

result = adapter.run(agent, "What is RAG?")
result = await adapter.arun(agent, "What is RAG?")  # async

LlamaIndex

from llama_index.llms.openai import OpenAI
from omnicache_ai.adapters.llamaindex_adapter import (
    LlamaIndexLLMCacheAdapter,
    LlamaIndexQueryCacheAdapter,
)

# LLM cache
cached_llm = LlamaIndexLLMCacheAdapter(OpenAI(model="gpt-4o"), manager)
response = cached_llm.complete("What is vector search?")

# QueryEngine (RAG) cache
engine = index.as_query_engine()
cached_engine = LlamaIndexQueryCacheAdapter(engine, manager)
response = cached_engine.query("What are the key findings?")

Claude Agent SDK

from omnicache_ai.adapters.claude_agent_adapter import ClaudeAgentCacheAdapter

adapter = ClaudeAgentCacheAdapter(manager)

async for msg in adapter.query("Fix the import error in utils.py", options=options):
    print(msg)  # streams on first call, replays from cache on second

OpenAI SDK

import openai
from omnicache_ai.adapters.openai_adapter import OpenAICacheAdapter

client = openai.OpenAI()
adapter = OpenAICacheAdapter(client, manager)

response = adapter.chat_create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Hello"}],
)
# Second call with identical args returns instantly from cache

# Async
client = openai.AsyncOpenAI()
adapter = OpenAICacheAdapter(client, manager)
response = await adapter.achat_create(model="gpt-4o", messages=[...])

Anthropic SDK

import anthropic
from omnicache_ai.adapters.anthropic_adapter import AnthropicCacheAdapter

client = anthropic.Anthropic()
adapter = AnthropicCacheAdapter(client, manager)

response = adapter.messages_create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Hello"}],
)

# Async
client = anthropic.AsyncAnthropic()
adapter = AnthropicCacheAdapter(client, manager)
response = await adapter.amessages_create(model="claude-sonnet-4-6", ...)

A2A (Agent-to-Agent)

from omnicache_ai.adapters.a2a_adapter import A2ACacheAdapter

adapter = A2ACacheAdapter(manager, agent_id="planner")

# Explicit call
result = adapter.process(handler_fn, task_payload)
result = await adapter.aprocess(async_handler, task_payload)

# As a decorator
@adapter.wrap
def handle_task(payload: dict) -> dict:
    return downstream_agent.process(payload)

v0.2.0 Features

Cache Metrics

manager = CacheManager(backend=InMemoryBackend(), key_builder=CacheKeyBuilder())

manager.get("key")          # miss
manager.set("key", "val")
manager.get("key")          # hit

print(manager.metrics.snapshot())
# {'hits': 1, 'misses': 1, 'evictions': 0, 'sets': 1, 'hit_rate': 0.5, 'miss_rate': 0.5}

Pluggable Serializer

from omnicache_ai import JsonSerializer, ResponseCache

# Use JSON instead of pickle (safer for Redis with untrusted data)
cache = ResponseCache(manager, serializer=JsonSerializer())

Stampede Protection

ResponseCache.get_or_generate() uses a per-key lock automatically — under concurrency, only one thread calls the LLM; others wait and read from cache.

Tiered Backend (L1 + L2)

from omnicache_ai import TieredBackend, InMemoryBackend
from omnicache_ai.backends.redis_backend import RedisBackend

backend = TieredBackend(
    l1=InMemoryBackend(max_size=1000),   # fast, local
    l2=RedisBackend(url="redis://..."),  # persistent, shared
    l1_ttl=300,                          # 5-min local copy
)
manager = CacheManager(backend=backend, key_builder=CacheKeyBuilder())

Compression

from omnicache_ai import GzipCompressor, CacheManager

manager = CacheManager(
    backend=InMemoryBackend(),
    key_builder=CacheKeyBuilder(),
    compressor=GzipCompressor(level=6),  # compress all stored bytes
)

Streaming Response Cache

from omnicache_ai import StreamingResponseCache

stream_cache = StreamingResponseCache(manager)

def stream_fn(messages):
    return openai_client.chat.completions.create(
        model="gpt-4o", messages=messages, stream=True
    )

# First call: live stream + buffered to cache
# Second call: replays chunks from cache at full speed
for chunk in stream_cache.get_or_stream(messages, stream_fn, model_id="gpt-4o"):
    print(chunk.choices[0].delta.content or "", end="", flush=True)

Async Backend

from omnicache_ai import AsyncInMemoryBackend

# Use in async frameworks — does not block the event loop
backend = AsyncInMemoryBackend(max_size=10_000)
value = await backend.get("key")
await backend.set("key", "value", ttl=60)

Tag-Based Invalidation

from omnicache_ai import InvalidationEngine, InMemoryBackend, CacheManager, CacheKeyBuilder

manager = CacheManager(
    backend=InMemoryBackend(),
    key_builder=CacheKeyBuilder(),
    invalidation_engine=InvalidationEngine(InMemoryBackend()),
)

manager.set("key1", "v1", tags=["model:gpt-4o", "env:prod"])
manager.set("key2", "v2", tags=["model:gpt-4o"])

count = manager.invalidate("model:gpt-4o")  # removes both entries

# ResponseCache / ContextCache tag automatically
from omnicache_ai import ResponseCache, ContextCache
rc = ResponseCache(manager)
rc.invalidate_model("gpt-4o")           # remove all gpt-4o responses

ctx = ContextCache(manager)
ctx.invalidate_session("user-123")      # clear all session turns

Backends

Backend	Extra	Use case
`InMemoryBackend`	—	Dev, testing, single-process
`DiskBackend`	—	Survives restarts, single-machine
`RedisBackend`	`[redis]`	Shared cache across processes/services
`FAISSBackend`	`[vector-faiss]`	Semantic/vector similarity search
`ChromaBackend`	`[vector-chroma]`	Persistent vector store with metadata

from omnicache_ai.backends.redis_backend import RedisBackend
from omnicache_ai.backends.disk_backend import DiskBackend

manager = CacheManager(backend=RedisBackend(url="redis://localhost:6379/0"), ...)
manager = CacheManager(backend=DiskBackend(path="/var/cache/omnicache"), ...)

Custom Backend

Implement the CacheBackend Protocol — no inheritance required (structural typing):

from omnicache_ai.backends.base import CacheBackend
from typing import Any

class MyBackend:
    def get(self, key: str) -> Any | None: ...
    def set(self, key: str, value: Any, ttl: int | None = None) -> None: ...
    def delete(self, key: str) -> None: ...
    def exists(self, key: str) -> bool: ...
    def clear(self) -> None: ...
    def close(self) -> None: ...

assert isinstance(MyBackend(), CacheBackend)  # True

Project Structure

omnicache_ai/
├── __init__.py                   # Public API surface (28 exports)
├── __main__.py                   # CLI: stats | flush | inspect <key>
├── config/
│   └── settings.py               # OmnicacheSettings dataclass + from_env()
├── backends/
│   ├── base.py                   # CacheBackend + VectorBackend Protocols
│   ├── async_base.py             # AsyncCacheBackend Protocol
│   ├── memory_backend.py         # InMemoryBackend (LRU, thread-safe, RLock)
│   ├── async_memory_backend.py   # AsyncInMemoryBackend (asyncio.Lock)
│   ├── disk_backend.py           # DiskBackend (diskcache, process-safe)
│   ├── redis_backend.py          # RedisBackend [optional: redis]
│   ├── tiered_backend.py         # TieredBackend (L1 memory + L2 any backend)
│   └── vector_backend.py         # FAISSBackend + ChromaBackend [optional]
├── core/
│   ├── key_builder.py            # namespace:type:sha256[:16] canonical keys
│   ├── metrics.py                # CacheMetrics (hits/misses/evictions/hit_rate)
│   ├── serializer.py             # Serializer protocol, PickleSerializer, JsonSerializer
│   ├── compressor.py             # Compressor protocol, GzipCompressor, NoopCompressor
│   ├── stampede.py               # StampedeShield (per-key threading.Lock)
│   ├── policies.py               # TTLPolicy, EvictionPolicy (wired into InMemoryBackend)
│   ├── invalidation.py           # Tag-based InvalidationEngine
│   └── cache_manager.py          # Central orchestrator + from_settings()
├── layers/
│   ├── embedding_cache.py        # np.ndarray ↔ bytes serialization
│   ├── retrieval_cache.py        # list[Document], pluggable serializer
│   ├── context_cache.py          # session_id + turn_index keyed
│   ├── response_cache.py         # model + messages + params keyed, stampede-safe
│   ├── semantic_cache.py         # exact → vector two-tier lookup
│   └── streaming_cache.py        # StreamingResponseCache (sync + async generators)
├── middleware/
│   ├── llm_middleware.py         # LLMMiddleware + AsyncLLMMiddleware
│   ├── embedding_middleware.py   # EmbeddingMiddleware
│   └── retriever_middleware.py   # RetrieverMiddleware
└── adapters/
    ├── openai_adapter.py         # OpenAICacheAdapter (chat.completions.create)
    ├── anthropic_adapter.py      # AnthropicCacheAdapter (messages.create)
    ├── langchain_adapter.py      # BaseCache (lookup/update/alookup/aupdate)
    ├── langgraph_adapter.py      # BaseCheckpointSaver (get_tuple/put/list + async)
    ├── autogen_adapter.py        # AssistantAgent 0.4+ + ConversableAgent 0.2.x
    ├── crewai_adapter.py         # Crew.kickoff() + kickoff_async()
    ├── agno_adapter.py           # Agent.run() + arun()
    └── a2a_adapter.py            # process() + aprocess() + @wrap

Development

# Clone and install with dev deps
git clone https://github.com/ashishpatel26/omnicache-ai
cd omnicache-ai
uv sync --dev

# Install pre-commit hooks (runs automatically on every commit)
uv run pre-commit install

# Run all tests
uv run pytest

# With coverage report
uv run pytest --cov=omnicache_ai --cov-report=term-missing

# Lint + format + type check (via pre-commit)
uv run pre-commit run --all-files

# Run specific layer tests
uv run pytest tests/layers/ tests/core/ -v

# Run adapter tests (requires optional deps)
uv run pytest tests/adapters/ -v

Contributing

We welcome contributions of all kinds — bug fixes, new backends, new adapters, documentation, and performance improvements.

File	Purpose
CONTRIBUTING.md	Full dev setup, coding standards, how to add backends/adapters
CODE_OF_CONDUCT.md	Contributor Covenant 2.1 — community standards
.github/ISSUE_TEMPLATE/bug_report.yml	Structured bug report with version, backend, reproduction fields
.github/ISSUE_TEMPLATE/feature_request.yml	Feature request with area dropdown, motivation, solution fields
.github/pull_request_template.md	PR checklist: type, changes, tests, breaking changes

Quick start for contributors:

git clone https://github.com/ashishpatel26/omnicache-ai
cd omnicache-ai
uv sync --dev
uv run pre-commit install
uv run pytest  # all green before you start

Open an issue or Discussion before starting large changes.

License

MIT — see LICENSE

_{Built with ❤️ for the AI engineering community}

omnicache-ai

Table of Contents

Why omnicache-ai?

vs. GPTCache / LiteLLM / redis-vl

Key Features

Cache Layers

Storage Backends

Framework Adapters

Middleware

Core Engine

AI Agent Pipeline Architecture

Where Cache Layers Sit in a Full AI Pipeline

Cache Layer Responsibilities in the Pipeline

Backend Selection by Use Case

Installation

Requirements

pip (PyPI)

uv

From source

Verify

Environment variable configuration

Quick Start

LangChain in 3 lines

Cache Layers

LLM Response Cache

Embedding Cache

Retrieval Cache

Context / Session Cache

Semantic Cache

Middleware (Decorator Pattern)

Framework Adapters

LangChain

LangGraph

AutoGen

CrewAI

Agno

Google ADK

OpenAI Agents SDK

LlamaIndex

Claude Agent SDK

OpenAI SDK

Anthropic SDK

A2A (Agent-to-Agent)

v0.2.0 Features

Cache Metrics

Pluggable Serializer

Stampede Protection

Tiered Backend (L1 + L2)

Compression

Streaming Response Cache

Async Backend

Tag-Based Invalidation

Backends

Custom Backend

Project Structure

Development

Contributing

License

关于 About

语言 Languages

提交活跃度 Commit Activity

核心贡献者 Contributors