# DeepSeek V4 Flash (Q2): Capability Limits and Usage Guidance

## TL;DR

The local **DeepSeek V4 Flash Q2** build that ships in DS4 / Deeptide local mode is a **technology-feasibility preview**, not a production coding model. It exists to prove that V4-class architectures (DSA attention, indexed top-k, compressed KV, MoE routing) can run end-to-end on a single Apple Silicon machine — and that goal it accomplishes. It is **not** a faithful approximation of cloud V4's behavior, and trying to use it as one will produce a long string of small disappointments before you give up.

If you want the model to actually help you finish work, use one of:

- **Cloud DeepSeek V4 / V4-Pro** (API, full precision, 1M context) — for general agent work.
- **Local V4 Flash Q4_K mixed-precision** (from [`opensota/deepseek-v4-gguf`](https://huggingface.co/opensota/deepseek-v4-gguf), file `DeepSeek-V4-Flash-Q4KExperts-F16HC-F16Compressor-F16Indexer-Q8Attn-Q8Shared-Q8Out-chat-v2.gguf`, ~153 GiB) — the in-architecture upgrade. Needs a host with the memory headroom (M3/M4 Ultra-class 192 GB+; will **not** fit on a 128 GB M3/M4 Max).
- **A local non-V4 Q4-class GGUF** such as Qwen3-Coder-Next UD-Q4_K_XL or GLM-4.7-Flash UD-Q6_K_XL — for offline coding agents on the same 128 GB envelope where the V4 Q4_K build can't load.

Use V4 Flash Q2 when the *thing you are validating* is "does the local stack work at all" — never when the thing you need is "a correct answer."

## Why Q2 specifically

Q2 represents each weight with only **4 quantization levels** (2 bits, so values in {−1.5σ, −0.5σ, +0.5σ, +1.5σ} after the per-block scale). That is a brutal compression ratio. It works surprisingly well on lightly-loaded paths — token embeddings, MLP layers with broad activation distributions — but it falls off a cliff on three specific paths that V4 *relies on heavily*:

| Path | What Q2 breaks |
|------|----------------|
| Attention output projection (`Wo`) | Quantization noise stacks linearly with sequence length. Attention scores become coarse — fine-grained "this token attends to that token" routing drops to "this token attends to roughly this region." Past ~32k tokens, perplexity rises visibly and recall on long-range references collapses. |
| DSA indexer top-k selection | The indexer learns a low-rank projection to score "which past tokens matter for this query." With 4 levels of weight resolution, the score distribution flattens and the top-k cutoff stops being meaningful. The model still returns *some* tokens, just often the wrong ones. |
| RoPE-dependent positional encoding | Position-modulated attention logits need at least 3-bit resolution to keep the sin/cos rotation phase information intact. At Q2 the rotation gets aliased; relative positions past a few thousand tokens become noisy. |

The cumulative effect: V4 Flash Q2 acts like a *much smaller* model on long contexts. It is not "the same model, slightly worse." It is qualitatively different.

By contrast:

- **Q3 (3-bit)** preserves most of the attention path's resolution. Long-context perplexity is acceptable up to roughly 100k tokens.
- **Q4_K_XL** (effectively 4.5–5 bits with unsloth-style mixed-precision groups) is the conventional "no apologies needed" floor for local serving. Outliers in activations are preserved; attention scores stay well-shaped well past 128k. This is the regime the Qwen3-Coder-Next and GLM-4.7-Flash profiles target.

## What this means for Deeptide local mode

Deeptide's local profile for V4 Flash is intentionally tuned to play *within* Q2's strengths:

1. **64k context cap.** The profile reports `contextWindow = 64_000` in both `LocalAgentPolicy` (the local-mode profile) and `ModelContextWindow.forModel` (the cloud-routing compaction lookup). Both must stay aligned — earlier versions of `ModelContextWindow` used a loose `contains("deepseek-v4")` substring match that incorrectly returned the cloud-V4 1M window for the local Flash build, which let the agent loop think it had headroom it does not.

2. **No subagent fanout, serial tool execution.** `InferenceProfile.local(...)` sets `disablesSubagents: true` and `serialToolExecution: true`. Parallel tool calls let Q2 fall behind quickly because tool-result blocks compound state across turns.

3. **Compact prompts, no Tide banners in system messages.** The system-prompt section for the V4 Flash profile is shorter than the qwen / glm sections — every kilotoken of system prompt reduces the budget for the actual task.

The other local profiles do not share these caps because they do not share Q2's quality cliff:

| Profile | Context | Why |
|---------|---------|-----|
| `deepseek-v4-flash` (Q2 dsedge) | **64k** | Q2 attention quality cliff |
| `deepseek-v4-flash-q4k` (Q4_K mixed-precision dsedge) | **1M** | Q4 experts + F16 indexer/HC/compressor preserves long-range attention |
| `qwen3-coder-next` (Q4_K_XL) | 131_072 | Q4 holds at 128k; this is the native window |
| `glm-4.7-flash` (Q6_K_XL) | 131_072 | Q6 has more headroom than the context needs |
| `qwen3.6-35b-a3b` (Q4_K_M) | 131_072 | Native window |
| Cloud `deepseek-v4*` (API) | 1_000_000 | Server-side; full precision |

If you are tempted to raise the V4 Flash context window past 64k, re-run the local-agent benchmark suite under `benchmarks/local-agent/` with prompts spanning 32k / 64k / 96k tokens **first**. The 96k bucket is where the regression usually appears, and it is not always visible in a single-shot perplexity check — it shows up as the agent hallucinating file paths, forgetting tool results from 20 turns ago, or producing diffs that no longer match the file it just read.

## The Q4_K mixed-precision local build

The `opensota/deepseek-v4-gguf` repository on Hugging Face publishes a higher-fidelity local V4 build that solves the Q2 quality cliff at the cost of a much larger memory footprint. The file is:

```
DeepSeek-V4-Flash-Q4KExperts-F16HC-F16Compressor-F16Indexer-Q8Attn-Q8Shared-Q8Out-chat-v2.gguf
```

The naming encodes a per-tensor-group quantization plan:

| Path | Precision | Why |
|------|-----------|-----|
| MoE experts | **Q4_K** | Experts dominate model size; Q4_K is the standard "no apologies" floor for MoE weights |
| Head-channel projection (`HC`) | **F16** | Tiny in absolute terms; quantization here directly aliases positional encoding, so we keep it full-precision |
| Compressed-KV projection | **F16** | Compressed KV is the only KV path past the sliding window — any precision loss here compounds across the entire long context |
| DSA indexer | **F16** | The top-k routing path. Q2 broke this; F16 makes it boringly correct |
| Attention `Q`/`K`/`V`/`Wo` | **Q8** | Q8 attention preserves head separation; Q8_0 dot-products are well-behaved on Metal |
| Shared expert | **Q8** | Hit on every token; F16 here would balloon size with no measurable quality gain over Q8 |
| Output head | **Q8** | Final projection — keep enough precision that the top-1 logit is stable |

Resident size is about **153 GiB**. The deeptide local profile loads this as `deepseek-v4-flash-q4k` (aliases: `flash-q4`, `flash-q4k`, `v4-flash-q4`, `v4-flash-q4k`). The profile's context window is set to **1M tokens**, matching the cloud V4 figure — the Q4_K + F16 indexer combination preserves long-range attention well enough that the architecture's native 1M window is actually usable, not aspirational.

### Who should use it

- Hosts with 192 GB+ unified memory (M3 Ultra / M4 Ultra-class). It will not fit on a 128 GB Max — `mmap` may succeed but the working set is too large to keep resident, and you will spend the entire session swapping.
- Workloads that actually need a long context (large-repo onboarding, multi-hour agent sessions, code review across a big diff). On short contexts the Q2 build is faster and the quality gap is smaller; the Q4_K build pays off where Q2's long-range attention falls down.
- Operators who want to keep all V4-specific architecture quirks (DSA, indexer, compressed KV) in their evaluation — switching to Qwen3-Coder-Next or GLM-4.7-Flash gives you Q4 quality but on a different architecture.

### Who should not

- Anyone on 128 GB or less. The Qwen Q4_K_XL or GLM Q6_K_XL profiles are the right local choice — they fit, hold 128k cleanly, and don't require V4-specific server code.
- Anyone where the bottleneck is decode latency. Q4_K experts decode faster than F16 experts but slower than Q2 experts; if you measured Q2 as "fast enough but wrong," Q4_K will be "slower and correct," not "as fast and correct."

### How to enable

1. Confirm the host has the memory headroom. `vm_stat | head -10` should show free + inactive comfortably above 160 GB before launch.
2. Download the GGUF from `opensota/deepseek-v4-gguf` to `~/Zero/models/deepseek-v4-flash-q4k/` (the path the bundled profile expects), or override with `deeptide config set local.model_path /your/path.gguf`.
3. Start the local server with the Q4_K alias: `deeptide local start --model flash-q4 --ctx 1000000`.
4. Run agent sessions normally — they pick up the new profile via `deeptide --local --model flash-q4`.

If you cannot fit the file, do not lower its context window as a workaround. The bottleneck is resident weight bytes, not KV cache; a smaller context will still fail to keep the model in memory. Use a different profile instead.

## When Q2 is actually fine

- Smoke-testing the local serving stack (DSEdge boot, Metal graph compile, KV-disk persistence).
- Walking through the agent loop on a small, well-scoped task (one or two files, a few hundred lines).
- Reproducing a server-side issue locally without consuming cloud credits.
- Demos that prioritize "this runs on my laptop" over "this answer is right."
- Anything where you are going to verify every line of output anyway.

## When Q2 will let you down

- Tasks that need to hold more than ~6–8 files in working context.
- Long-running agent sessions with many tool calls (Q2 KV degrades faster than Q4 KV).
- Anything requiring exact recall of a function signature or interface defined earlier in the session.
- Code review across a large diff.
- Tasks where a small inaccuracy compounds (refactors, migrations, API rename).

For any of the above: switch to cloud V4 (`deeptide --provider deepseek-cloud`), or to one of the Q4-class local profiles (`deeptide local --model qwen3-coder-next` or `--model glm-4.7-flash`).

## Implementation notes for maintainers

- `Sources/AgentLoop/LocalAgentPolicy.swift` is the source of truth for per-model local context windows. The Q2 default (`defaultContextWindow = 64_000`) and the Q4_K profile (`contextWindow: 1_000_000`) are deliberately separate constants — do not collapse them.
- `Sources/AgentLoop/CompactionManager.swift::ModelContextWindow.forModel` checks model ids in this order: `deepseek-v4-flash-q4` (1M) → `deepseek-v4-flash` (64k) → `deepseek-v4` (1M, cloud). The ordering matters because `String.contains` is a substring match, so a more specific id has to appear first. If you add another local V4 variant, slot it in *before* the bare `deepseek-v4-flash` check, not after.
- `Sources/Configuration/ModelAlias.swift` short aliases (`flash`, `fast`, `v4-flash` → Q2 build; `flash-q4`, `flash-q4k`, `v4-flash-q4`, `v4-flash-q4k` → Q4_K build) all currently resolve to the *local* canonical id. If you ever add a cloud Flash SKU, make sure these aliases do not silently route cloud traffic to a 64k-capped context.
- The pricing entry in `Sources/Configuration/CostTracker.swift` for `deepseek-v4-flash` is independent of the context cap and reflects the actual local cost (zero); leave it alone. The Q4_K id has no pricing entry, which is fine — local runs do not charge.