# Integration Guide

You don't need to run the Headroom proxy. Headroom is a compression library that works with **any** LLM client, proxy, or framework.

## Pick Your Path

| You have... | Use this | Setup |
|-------------|----------|-------|
| Any Python app | [`compress()`](#compress-function) | 2 lines |
| LiteLLM | [LiteLLM callback](#litellm) | 1 line |
| A Python proxy (FastAPI, custom) | [ASGI middleware](#asgi-middleware) | 1 line |
| Claude Code / Cursor | [Headroom proxy](#proxy) | 1 env var |
| Agno agents | [Agno integration](#agno) | Wrap model |
| LangChain | [LangChain integration](#langchain) | Wrap model |
| Non-Python app | [Headroom proxy](#proxy) | HTTP |
| TypeScript SDK | [`compress()`](#typescript-sdk) | `npm install headroom-ai` |
| Vercel AI SDK | [`headroomMiddleware()`](#typescript-sdk) | Middleware adapter |
| OpenAI Node SDK | [`withHeadroom()`](#typescript-sdk) | Client wrapper |
| Anthropic TS SDK | [`withHeadroom()`](#typescript-sdk) | Client wrapper |

---

## compress() Function

The simplest integration. Works with any LLM client.

```python
from headroom import compress

# Before sending to your LLM:
result = compress(messages, model="claude-sonnet-4-5-20250929")
response = your_client.create(messages=result.messages)  # Fewer tokens, same answer

print(f"Saved {result.tokens_saved} tokens ({result.compression_ratio:.0%})")
```

### With Anthropic SDK

```python
from anthropic import Anthropic
from headroom import compress

client = Anthropic()
messages = [
    {"role": "user", "content": "What went wrong?"},
    {"role": "assistant", "content": "Let me check.", "tool_use": [...]},
    {"role": "user", "content": [{"type": "tool_result", "content": huge_json}]},
]

compressed = compress(messages, model="claude-sonnet-4-5-20250929")
response = client.messages.create(
    model="claude-sonnet-4-5-20250929",
    messages=compressed.messages,
    max_tokens=1000,
)
```

### With OpenAI SDK

```python
from openai import OpenAI
from headroom import compress

client = OpenAI()
messages = [
    {"role": "user", "content": "Analyze these results"},
    {"role": "tool", "content": big_json_output, "tool_call_id": "call_1"},
]

compressed = compress(messages, model="gpt-4o")
response = client.chat.completions.create(
    model="gpt-4o",
    messages=compressed.messages,
)
```

### With LiteLLM (direct)

```python
import litellm
from headroom import compress

messages = [...]
compressed = compress(messages, model="bedrock/claude-sonnet")
response = litellm.completion(model="bedrock/claude-sonnet", messages=compressed.messages)
```

### With any HTTP client

```python
import httpx
from headroom import compress

compressed = compress(messages, model="claude-sonnet-4-5-20250929")
httpx.post("https://api.anthropic.com/v1/messages", json={
    "model": "claude-sonnet-4-5-20250929",
    "messages": compressed.messages,
}, headers={"X-Api-Key": api_key, "anthropic-version": "2023-06-01"})
```

### What compress() returns

```python
result = compress(messages, model="gpt-4o")
result.messages           # list[dict] — compressed messages, same format as input
result.tokens_before      # int — original token count
result.tokens_after       # int — compressed token count
result.tokens_saved       # int — tokens removed
result.compression_ratio  # float — 0.0 (no savings) to 1.0 (100% removed)
result.transforms_applied # list[str] — what ran (e.g., ["router:smart_crusher:0.35"])
```

---

## LiteLLM

If you're already using LiteLLM as your LLM gateway, add Headroom as a callback:

```python
import litellm
from headroom.integrations.litellm_callback import HeadroomCallback

litellm.callbacks = [HeadroomCallback()]

# All calls now compressed automatically
response = litellm.completion(model="gpt-4o", messages=[...])
response = litellm.completion(model="bedrock/claude-sonnet", messages=[...])
response = litellm.completion(model="azure/gpt-4o", messages=[...])
```

The callback compresses messages in LiteLLM's `pre_call_hook` before they're sent to the provider. Works with all 100+ LiteLLM-supported providers.

### With LiteLLM Proxy

If you run LiteLLM as a proxy server, use the ASGI middleware instead:

```python
# In your LiteLLM proxy startup
from litellm.proxy.proxy_server import app
from headroom.integrations.asgi import CompressionMiddleware

app.add_middleware(CompressionMiddleware)
```

Or use the callback in your LiteLLM config:

```yaml
# litellm_config.yaml
litellm_settings:
  callbacks: ["headroom.integrations.litellm_callback.HeadroomCallback"]
```

---

## ASGI Middleware

Drop-in middleware for any ASGI application (FastAPI, Starlette, LiteLLM proxy, custom proxies).

```python
from headroom.integrations.asgi import CompressionMiddleware

# FastAPI
app = FastAPI()
app.add_middleware(CompressionMiddleware)

# Starlette
app = Starlette(routes=[...])
app.add_middleware(CompressionMiddleware)

# LiteLLM proxy
from litellm.proxy.proxy_server import app
app.add_middleware(CompressionMiddleware)
```

The middleware intercepts POST requests to `/v1/messages`, `/v1/chat/completions`, `/v1/responses`, and `/chat/completions`. All other requests pass through untouched.

Response headers include:
- `x-headroom-compressed: true` — compression was applied
- `x-headroom-tokens-saved: 1234` — tokens removed

---

## Proxy

The Headroom proxy is a standalone HTTP server. Best for non-Python apps or tools that only support base URL configuration (Claude Code, Cursor).

```bash
pip install "headroom-ai[all]"
headroom proxy --port 8787
```

```bash
# Claude Code
ANTHROPIC_BASE_URL=http://localhost:8787 claude

# Cursor / Any OpenAI client
OPENAI_BASE_URL=http://localhost:8787/v1 cursor
```

### With Cloud Providers

```bash
# AWS Bedrock
headroom proxy --backend bedrock --region us-east-1

# Google Vertex AI
headroom proxy --backend vertex_ai --region us-central1

# Azure OpenAI
headroom proxy --backend azure

# OpenRouter (400+ models)
OPENROUTER_API_KEY=sk-or-... headroom proxy --backend openrouter
```

See [Proxy Documentation](proxy.md) for all options.

---

## Agno

Full integration with the Agno agent framework.

```python
from agno.agent import Agent
from agno.models.anthropic import Claude
from headroom.integrations.agno import HeadroomAgnoModel

model = HeadroomAgnoModel(Claude(id="claude-sonnet-4-20250514"))
agent = Agent(model=model, tools=[your_tools])
response = agent.run("Investigate the issue")

print(f"Tokens saved: {model.total_tokens_saved}")
```

See [Agno Guide](agno.md) for hooks, multi-provider, and streaming.

---

## LangChain

Full integration with LangChain — chat models, memory, retrievers, tool wrappers, and streaming.

```python
from langchain_openai import ChatOpenAI
from headroom.integrations import HeadroomChatModel

llm = HeadroomChatModel(ChatOpenAI(model="gpt-4o"))
response = llm.invoke("Hello!")
```

See [LangChain Guide](langchain.md) for details and known limitations.

---

## TypeScript SDK

For Node.js, Next.js, and any TypeScript/JavaScript application.

```bash
npm install headroom-ai
```

See the [TypeScript SDK Guide](typescript-sdk.md) for full documentation including Vercel AI SDK middleware, OpenAI SDK wrapper, and Anthropic SDK wrapper.

---

## OpenClaw

Context compression plugin for [OpenClaw](https://github.com/openclaw/openclaw) agents.

```bash
pip install "headroom-ai[proxy]"
openclaw plugins install headroom-openclaw
```

Configure as context engine:
```json
{ "plugins": { "slots": { "contextEngine": "headroom" } } }
```

The plugin auto-detects a running Headroom proxy or starts one. Compression happens in `assemble()` — zero changes to the agent's behavior.

See the [OpenClaw plugin documentation](https://github.com/chopratejas/headroom/tree/main/plugins/openclaw) for full setup.

---

## Compression Hooks (Advanced)

Customize compression behavior without modifying Headroom's code:

```python
from headroom import compress, CompressionHooks, CompressContext

class MyHooks(CompressionHooks):
    def pre_compress(self, messages, ctx):
        # Modify messages before compression (dedup, filter, inject)
        return messages

    def compute_biases(self, messages, ctx):
        # Per-message compression aggressiveness
        # >1.0 = keep more, <1.0 = compress more
        return {5: 1.5, 6: 0.5}  # Keep message 5, compress message 6

    def post_compress(self, event):
        # Observe results (logging, analytics, learning)
        print(f"Saved {event.tokens_saved} tokens")

result = compress(messages, model="gpt-4o", hooks=MyHooks())
```

See [Architecture](ARCHITECTURE.md) for how hooks integrate with the pipeline.

---

## FAQ

**Q: Does Headroom change the response format?**
No. Your LLM returns the same response format. Headroom only modifies the input messages.

**Q: What if compression removes something the LLM needs?**
Headroom stores originals in CCR (Compress-Cache-Retrieve). The LLM can call `headroom_retrieve` to get full uncompressed content. Compression summaries tell the LLM what's available.

**Q: Does it work with streaming?**
Yes. Compression happens before the request is sent. Streaming responses are unaffected.

**Q: How much latency does it add?**
15-200ms depending on content size and type. Small JSON arrays take ~15ms, large tool outputs take 100-200ms. The token savings typically save far more time on the LLM side than compression adds — a 50% token reduction on a Sonnet call saves seconds of generation time. See [Latency Benchmarks](LATENCY_BENCHMARKS.md) for real numbers.