Star 历史趋势
数据来源: GitHub API · 生成自 Stargazers.cn
README.md

dotLLM

High-performance LLM inference engine written natively in C#/.NET

CI License: GPL v3 .NET Website

Website · Documentation · Roadmap · Discussions


About

dotLLM is a ground-up LLM inference engine for .NET — not a wrapper around llama.cpp or Python libraries. All orchestration, model loading, tokenization, sampling, and CPU compute are implemented in pure C#, with CUDA GPU acceleration via PTX kernels loaded through the CUDA Driver API (no native shared library). It targets transformer-based models (Llama, Mistral, Phi, Qwen, DeepSeek) with SIMD-optimized CPU and CUDA GPU backends.

Status: Phase 6 complete — speculative decoding, paged KV-cache, Native AOT (experimental), and startup warm-up on top of the OpenAI-compatible API server, built-in chat UI, constrained decoding (JSON/schema/regex/grammar), tool calling, and prompt caching. CUDA GPU backend with CPU/GPU hybrid offloading and KV-cache quantization. SIMD-optimized CPU inference with Q4_K_M, chat templates, streaming, multi-threading, NUMA pinning. Supports Llama, Mistral, Phi, Qwen. Phase 7 (diagnostics & interpretability) in progress — logprobs landed. See Roadmap.

Key Features

Performance

  • Zero-GC inference — unmanaged memory (NativeMemory.AlignedAlloc, 64-byte aligned) for all tensor data; no managed heap allocations on the hot path
  • SIMD vectorizationTensorPrimitives + hand-tuned System.Runtime.Intrinsics for quantized matmul, RMSNorm, RoPE, softmax
  • Memory-mapped model loading — GGUF files loaded via MemoryMappedFile; OS demand-paging means multi-GB models load in milliseconds
  • Quantized inference — FP16, Q8_0, Q4_K_M and other GGUF quantization formats; fused scale×int dot-product kernels operating directly on quantized blocks

Architecture Support

  • Transformer models — Llama, Mistral, Phi, Qwen, DeepSeek via parameterized TransformerBlock and ModelConfig
  • Attention mechanisms — MHA, MQA, GQA via parameterized ModelConfig, with IAttentionStrategy for kernel selection
  • Position encoding — RoPE, ALiBi, absolute, none — pluggable via IPositionEncoding
  • Composable samplingISamplerStep chain: repetition penalty → temperature → top-k → top-p → min-p → categorical sample

Serving

  • OpenAI-compatible API/v1/chat/completions, /v1/completions, tool calling, streaming via ASP.NET
  • Paged KV-cache — PagedAttention with block-level allocation, prefix caching, and copy-on-write
  • Speculative decoding — draft-verify-accept with KV-cache rollback (greedy mode today; non-greedy planned — see issue #121)
  • Structured output — FSM/PDA-based constrained decoding guaranteeing valid JSON, JSON Schema, regex, and grammar
  • (Planned) Continuous batching — iteration-level scheduling with preemption and priority queuing — Phase 9, see Roadmap

Extensibility

  • Pluggable backendsIBackend interface with separate packages per backend (CPU, CUDA, ROCm)
  • Diagnostic hooks — zero-cost IInferenceHook points for activation capture, logit lens, SAE integration
  • (Planned) LoRA adapters — runtime loading, no weight merging, concurrent multi-adapter serving — Phase 7, see Roadmap
  • (Planned) OpenTelemetry observabilitySystem.Diagnostics.Metrics + Activity for throughput, latency, and per-request tracing — Phase 7 / Phase 9, see Roadmap

Architecture Overview

dotLLM is organized as a layered architecture where each layer depends only on the layers below it:

┌─────────────────────────────────────────┐
│            DotLLM.Server                │  ASP.NET OpenAI-compatible API
├─────────────────────────────────────────┤
│            DotLLM.Engine                │  KV-cache, scheduler, samplers,
│                                         │  constraints, speculative decoding
├──────────┬──────────┬───────────────────┤
│ DotLLM.  │ DotLLM.  │ DotLLM.Cpu/Cuda   │  GGUF/SafeTensors, BPE/SPM,
│ Models   │Tokenizers│ (backends)        │  SIMD kernels / CUDA kernels
├──────────┴──────────┴───────────────────┤
│            DotLLM.Core                  │  Interfaces, tensor types, config
└─────────────────────────────────────────┘

Each project ships as a separate NuGet package, so users pull in only what they need. DotLLM.Core defines all abstractions (ITensor, IBackend, IModel, ISamplerStep, etc.) while concrete implementations live in their respective projects.

Getting Started

There are two paths: grab a pre-built release and run it, or clone the repo and build from source.

Use a pre-built release

Pick one of three install options.

Option A — install as a global .NET tool (requires .NET 10 runtime):

dotnet tool install -g DotLLM.Cli --prerelease # Download a model once, then use it anywhere dotllm model pull QuantFactory/SmolLM-135M-GGUF dotllm run QuantFactory/SmolLM-135M-GGUF -p "The capital of France is" -n 64 dotllm serve QuantFactory/SmolLM-135M-GGUF # OpenAI-compatible API + chat UI

Option B — download a self-contained binary (no .NET install needed — the runtime is bundled):

Grab the archive for your platform from the latest release:

  • Windows x64: dotllm-<version>-win-x64.zip
  • Linux x64: dotllm-<version>-linux-x64.tar.gz
  • macOS (Apple Silicon): dotllm-<version>-osx-arm64.tar.gz

Unpack and run:

# Linux / macOS tar -xzf dotllm-<version>-linux-x64.tar.gz cd dotllm-<version>-linux-x64 ./dotllm model pull QuantFactory/SmolLM-135M-GGUF ./dotllm run QuantFactory/SmolLM-135M-GGUF -p "The capital of France is" -n 64 ./dotllm serve QuantFactory/SmolLM-135M-GGUF # OpenAI-compatible API + chat UI
# Windows Expand-Archive dotllm-<version>-win-x64.zip -DestinationPath . cd dotllm-<version>-win-x64 .\dotllm.exe model pull QuantFactory/SmolLM-135M-GGUF .\dotllm.exe run QuantFactory/SmolLM-135M-GGUF -p "The capital of France is" -n 64

Experimental: Native AOT builds for Linux and Windows are also attached to each release (dotllm-<version>-aot-<rid>.{zip,tar.gz}) — smaller and faster to start, but please file an issue if you hit a crash.

Option C — reference the libraries from your .NET app — see NuGet Packages below.

Build from source

Clone the repository and build with the .NET 10 SDK.

Prerequisites:

git clone https://github.com/kkokosa/dotLLM.git cd dotLLM dotnet build -c Release

When built from source, replace dotllm <subcommand> in the Usage examples below with dotnet run --project src/DotLLM.Cli -c Release -- <subcommand>.

Usage

dotLLM ships a single CLI tool with four command groups:

  • dotllm model — download, list, search, and inspect GGUF models
  • dotllm run — single-shot text generation with a performance summary
  • dotllm chat — interactive multi-turn REPL with chat template formatting
  • dotllm serve — OpenAI-compatible HTTP API with a built-in web chat UI

Models are identified by a local .gguf path or a HuggingFace repo ID (e.g., QuantFactory/SmolLM-135M-GGUF). Models must be downloaded explicitly with dotllm model pull before they can be usedrun, chat, and serve read from ~/.dotllm/models/ and do not auto-fetch.

Manage models

# Search HuggingFace for GGUF repos dotllm model search llama --limit 5 # Download a repo (streams the .gguf files + tokenizer metadata into ~/.dotllm/models/) dotllm model pull QuantFactory/SmolLM-135M-GGUF # List everything cached locally dotllm model list # Show architecture, quantizations, and tokenizer info for a cached repo dotllm model info QuantFactory/SmolLM-135M-GGUF # Remove a cached repo dotllm model delete QuantFactory/SmolLM-135M-GGUF

Run — single-shot generation

Encodes a prompt, streams tokens to stdout, and prints a performance + memory summary.

# Greedy generation (default: temperature=0, max-tokens=128) dotllm run QuantFactory/SmolLM-135M-GGUF -p "The capital of France is" -n 64 # Sampled generation dotllm run QuantFactory/SmolLM-135M-GGUF -p "Once upon a time" -n 128 -t 0.7 --top-k 40 --top-p 0.95 # JSON output (for scripting / piping) dotllm run QuantFactory/SmolLM-135M-GGUF -p "Hello" --json # Select a specific quantization when a repo has multiple .gguf files dotllm run QuantFactory/SmolLM-135M-GGUF -p "Test" -q Q8_0 # GPU inference (requires NVIDIA GPU + CUDA Toolkit) dotllm run QuantFactory/SmolLM-135M-GGUF -p "The capital of France is" --device gpu # NUMA / P-core aware CPU threading dotllm run QuantFactory/SmolLM-135M-GGUF -p "Test" --threads 8 --decode-threads 4 --numa-pin # KV-cache quantization (Q8_0 / Q4_0) to fit longer contexts in memory dotllm run QuantFactory/SmolLM-135M-GGUF -p "..." --cache-type-k q8_0 --cache-type-v q8_0 # Constrained JSON output dotllm run QuantFactory/SmolLM-135M-GGUF -p "List 3 colors as JSON." --response-format json_object # Speculative decoding — target + draft must share the same vocabulary. # Example pair (validated by scripts/test_models_speculative.py): # target: Llama-3.2-3B-Instruct Q8_0 draft: Llama-3.2-1B-Instruct Q4_K_M dotllm model pull bartowski/Llama-3.2-3B-Instruct-GGUF dotllm model pull bartowski/Llama-3.2-1B-Instruct-GGUF dotllm run bartowski/Llama-3.2-3B-Instruct-GGUF -q Q8_0 \ -p "Explain what a CPU cache is in one sentence." -n 48 \ --speculative-model bartowski/Llama-3.2-1B-Instruct-GGUF --speculative-k 5

Sample output:

── dotllm | Llama 30L/576H | Q8_0 | 16 threads | greedy ──────────────────
The capital of France is Paris. Paris is a city of romance and culture,

╭──────────────────────────────────────────────────────────────────────────╮
│                                                                          │
│  Generation Complete                                      163.27 tok/s   │
│                                                                          │
│  Performance                                                             │
│  Prefill            12.3 ms       6 tokens       487.80 tok/s            │
│  Decode             91.8 ms      15 tokens       163.40 tok/s            │
│  Sampling            0.1 ms      15 tokens                               │
│  ──────────────────────────────────────────────────────────              │
│  Total             104.2 ms      21 tokens       201.54 tok/s            │
│  Load              456.7 ms                                              │
│                                                                          │
│  Memory                                                                  │
│  Weights         136.73 MiB      (memory-mapped)                         │
│  Compute           2.25 MiB                                              │
│  KV Cache        158.20 MiB      (192 slots)                             │
│  ──────────────────────────────────────────────────────────              │
│  Total           297.18 MiB                                              │
│                                                                          │
│  length | 6 prompt, 15 generated                                         │
╰──────────────────────────────────────────────────────────────────────────╯

Chat — interactive REPL

Multi-turn chat with persistent history, using the model's built-in chat template (falling back to ChatML). Prompt caching reuses KV-cache state across turns so subsequent turns skip redundant prefill.

# Basic chat dotllm chat QuantFactory/SmolLM-135M-GGUF # With a system prompt and sampling dotllm chat QuantFactory/SmolLM-135M-GGUF --system "You are a helpful assistant." -t 0.8 --top-p 0.95 # GPU + KV-cache quantization for long contexts dotllm chat bartowski/Llama-3.2-3B-Instruct-GGUF --device gpu --cache-type-k q8_0 --cache-type-v q8_0

In-session commands: /exit or /quit to leave, /clear to reset history (keeps the system prompt), /system <text> to change the system prompt.

Sample session:

── dotllm chat | Llama 30L/576H | Q8_0 | 16 threads | greedy ─────────────
Type /exit to quit, /clear to reset history, /system <text> to set system prompt.

>>> Hello, how are you?
I'm doing well, thank you for asking! How can I help you today?
[42 prompt tokens, 18 generated tokens, 28 ms TTFT, 487.8 prefill tok/s, 163.4 decode tok/s]

>>> What is 2+2?
2 + 2 = 4.
[78 prompt tokens, 12 generated tokens, 45 ms TTFT, 312.5 prefill tok/s, 155.2 decode tok/s]

>>> /clear
History cleared.
>>> /exit

Serve — OpenAI-compatible API + chat UI

Starts a local HTTP server exposing an OpenAI-compatible API (/v1/chat/completions, /v1/completions, /v1/models, /v1/tokenize, streaming SSE, tool calling) plus a built-in single-page web chat UI. Paged KV-cache, prompt caching, and startup warm-up are on by default. The browser opens automatically unless --no-browser is set.

# Start the server with a loaded model and open the chat UI dotllm serve QuantFactory/SmolLM-135M-GGUF # Bind a public interface, custom port, API-only, no auto-browser dotllm serve QuantFactory/SmolLM-135M-GGUF --host 0.0.0.0 --port 9000 --no-ui --no-browser # Start without a model — pick one from the chat UI dotllm serve # GPU with partial hybrid offload and more warm-up iterations dotllm serve bartowski/Llama-3.2-3B-Instruct-GGUF --device gpu --gpu-layers 24 --warmup-iterations 5 # Speculative decoding — draft must share the target's vocabulary dotllm serve bartowski/Llama-3.2-3B-Instruct-GGUF -q Q8_0 \ --speculative-model bartowski/Llama-3.2-1B-Instruct-GGUF --speculative-k 5

Any OpenAI-compatible client works against the running server:

curl -N http://localhost:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "SmolLM-135M", "messages": [{"role":"user","content":"Say hi in one word."}], "stream": true }'

To embed the same endpoints inside your own ASP.NET Core app, see Host the OpenAI API in your ASP.NET app below.

CLI option reference

Common options (shared by run, chat, and serve):

OptionShortDefaultDescription
--device-dcpuCompute device: cpu, gpu, gpu:0, gpu:1
--gpu-layers(all if gpu, 0 if cpu)Transformer layers on GPU (hybrid offload)
--threads0 (auto)CPU threads for inference
--decode-threads0 (auto)Decode threads (capped at memory channels)
--numa-pinfalsePin workers to NUMA-local cores (multi-socket)
--pcore-onlyfalsePin workers to P-cores only (Intel hybrid)
--quant-q(auto)Quant filter when a repo has multiple .gguf files (e.g., Q4_K_M)
--cache-type-kf32KV-cache key quant: f32, q8_0, q4_0
--cache-type-vf32KV-cache value quant: f32, q8_0, q4_0
--speculative-model(none)Draft model for speculative decoding (must share vocab)
--speculative-k5Draft tokens per speculative step

Sampling & constraints (shared by run and chat):

OptionShortDefault (run)Default (chat)Description
--max-tokens-n128512Max tokens per generation
--temp-t0 (greedy)0 (greedy)Sampling temperature
--top-k0 (off)0 (off)Top-K sampling
--top-p1.01.0Nucleus threshold
--min-p0 (off)0 (off)Min-P threshold
--repeat-penalty1.01.0Repetition penalty
--repeat-last-n0 (full)0 (full)Penalty lookback window
--seed-s (run only)(random)(random)Random seed for reproducibility
--cache-window00Full-precision tail window for KV quant
--pagedoffoffUse paged (block-based) KV-cache
--response-formattexttexttext, json_object, json_schema, regex, grammar
--schemaJSON Schema (or @file.json) for json_schema
--patternRegex pattern for regex
--grammarGBNF grammar (or @file.gbnf) for grammar
--toolsTool definitions JSON (or @file.json)

run-only:

  • --prompt / -p — input prompt (required)
  • --json — emit a single JSON result object (suppresses formatted output)

chat-only:

  • --system / -s — system prompt
  • --tool-choiceauto (default), none, required, or a function name
  • --no-prompt-cache — disable KV-cache reuse across turns
  • --prompt-cache-size — max cached sessions (default: 1)
  • --verbose / -v — debug output (finish reason, raw text, tool-call details)

serve-only:

OptionShortDefaultDescription
--hostlocalhostBind address
--port-p8080Port to listen on
--no-uifalseDisable the built-in chat UI (API only)
--no-browserfalseDon't auto-open the browser
--no-pagedfalseDisable paged KV-cache (paged is on by default for serve)
--no-prompt-cachefalseDisable KV-cache reuse across requests
--prompt-cache-size4Max cached sessions
--no-warmupfalseDisable startup warm-up passes
--warmup-iterations3Warm-up iteration count

Short-flag gotcha: -p is prompt under run but port under serve. -s is seed under run but system under chat. When in doubt, use the long form.

Development

Debug build

Building in Debug configuration (-c Debug) enables a debug command group with diagnostic tools for inspecting GGUF files and model internals. These commands are excluded from Release builds via #if DEBUG.

# Build in Debug mode dotnet build src/DotLLM.Cli -c Debug # Inspect GGUF file structure dotnet run --project src/DotLLM.Cli -c Debug -- debug gguf-header model.gguf dotnet run --project src/DotLLM.Cli -c Debug -- debug gguf-metadata model.gguf dotnet run --project src/DotLLM.Cli -c Debug -- debug gguf-tensors model.gguf dotnet run --project src/DotLLM.Cli -c Debug -- debug gguf-config model.gguf # Tokenizer round-trip verification dotnet run --project src/DotLLM.Cli -c Debug -- debug tokenize model.gguf --text "Hello world" # Single forward pass with top-10 logit diagnostics dotnet run --project src/DotLLM.Cli -c Debug -- debug forward-pass model.gguf --prompt "Hello" # Inspect embedding vector for a token ID dotnet run --project src/DotLLM.Cli -c Debug -- debug embed-lookup model.gguf --token-id 1
CommandDescription
debug gguf-headerGGUF header structure (magic, version, tensor/metadata counts)
debug gguf-metadataAll metadata key-value pairs
debug gguf-tensorsTensor descriptors (name, shape, quantization type, offset)
debug gguf-configExtracted ModelConfig (architecture, layers, dims, RoPE params)
debug tokenizeEncode text → token IDs → decode, verify round-trip fidelity
debug forward-passSingle forward pass, top-10 predicted tokens with softmax probabilities
debug embed-lookupRaw embedding vector for a given token ID

Debug builds are significantly slower than Release (~2-10x) because JIT optimizations, inlining, and SIMD vectorization are reduced. Always use -c Release for performance measurements.

Tests

Unit and integration tests:

dotnet test

Integration tests automatically download several GGUF models (~4.5 GB total) from HuggingFace to ~/.dotllm/test-cache/ on first run. The first dotnet test will take a while; subsequent runs use the cache. To run only unit tests (no downloads): dotnet test tests/DotLLM.Tests.Unit.

GPU tests (tagged Category=GPU) require an NVIDIA GPU and run full model inference — they can take 20-30 minutes. They are skipped automatically on machines without CUDA. To exclude them explicitly: dotnet test tests/DotLLM.Tests.Unit/ --filter "Category!=GPU"

Model correctness smoke tests (scripts/test_models.py) run dotLLM CLI with greedy decoding across architectures (Llama, Mistral, Phi, Qwen) and verify expected output:

# Build CLI first dotnet build src/DotLLM.Cli -c Release # List available test cases and which models are cached python scripts/test_models.py --list # Run tests for all cached models python scripts/test_models.py # Download missing models and run all tests python scripts/test_models.py --download # Run only specific architectures python scripts/test_models.py --filter phi,qwen

Models are downloaded from HuggingFace to ~/.dotllm/models/ on first use and cached for subsequent runs.

Sample output:

Test                                Arch       Result      Time  Details
=====================================================================================================
SmolLM-135M                         Llama      PASS        2.1s  Paris  (163.3 tok/s)
Llama-3.2-1B-Instruct-Q4            Llama      PASS        5.7s  Paris  (31.0 tok/s)
Qwen2.5-0.5B-Instruct               Qwen       PASS        3.2s  Paris  (78.5 tok/s)
Phi-3-mini-4k-instruct              Phi        PASS       12.4s  Paris  (14.2 tok/s)
=====================================================================================================

4/4 passed, 0 failed, 0 skipped

Benchmarks

Three scripts in scripts/ provide benchmarking at different levels:

bench_compare.py -- Single-point benchmark. Runs dotLLM (via BenchmarkDotNet) and optionally llama.cpp on one or more models, reports best-of-N throughput with CV (coefficient of variation):

# Benchmark dotLLM on SmolLM-135M (auto-downloads from HuggingFace) python scripts/bench_compare.py --model QuantFactory/SmolLM-135M-GGUF --quant Q8_0 # Benchmark multiple models and quantizations python scripts/bench_compare.py \ --model QuantFactory/SmolLM-135M-GGUF,bartowski/Llama-3.2-1B-Instruct-GGUF \ --quant Q4_K_M,Q8_0 # Compare dotLLM vs llama.cpp side-by-side python scripts/bench_compare.py --model QuantFactory/SmolLM-135M-GGUF --dotllm --llamacpp # Export results to JSON for later comparison python scripts/bench_compare.py --model QuantFactory/SmolLM-135M-GGUF \ --export-json benchmarks/results/baseline.json --label baseline

Sample output:

=== dotLLM Benchmark Results ===

  Model                  Prefill tok/s   Decode tok/s   Decode ms/tok   Total tok/s     CV
  SmolLM-135M.Q8_0             229.2          182.7           5.47         175.3      14.7%
  SmolLM-135M.Q4_K_M           165.0          230.1           4.35         198.2      20.5%

All values are best-of-N (max tok/s, min ms). CV is the coefficient of variation
across N iterations -- lower means more stable measurements.

bench_trend.py -- Interactive comparison of exported JSON results. Displays color-coded delta tables with noise-aware highlighting:

# Interactive mode: select runs and models to compare python scripts/bench_trend.py # Compare two specific result files python scripts/bench_trend.py benchmarks/results/baseline.json benchmarks/results/optimized.json # Show all results as a trend table python scripts/bench_trend.py --all

Sample output (trend across three labeled runs):

                             Benchmark Trend
 Label      Date        Model               Prefill tok/s   Decode tok/s   CV
 baseline   2026-03-11  SmolLM-135M.Q4_K_M          127.6          109.5    -
 step29     2026-03-13  SmolLM-135M.Q4_K_M          142.2          127.0    -
 step30     2026-03-13  SmolLM-135M.Q4_K_M          146.0           98.6    -

bench_history.py -- Benchmark across git commits. Creates worktrees for each commit, runs bench_compare in each, and displays trend tables with per-commit deltas:

# Benchmark last 5 commits on main python scripts/bench_history.py myrun --last 5 # Benchmark from a specific commit to HEAD python scripts/bench_history.py myrun --from f3d3bf8 # Show results from a previous run (no benchmarking) python scripts/bench_history.py myrun --show # Interactively select which commits to benchmark python scripts/bench_history.py myrun --last 10 --select

Sample output:

                     Benchmark History -- Llama-3.2-3B-Instruct-Q8_0
 Label                   Date        Prefill tok/s  %chg pf  Decode tok/s  %chg dc     CV
 test_run_0 (f3d3bf8)    2026-03-11          21.2                     7.4              3.8%
 test_run_1 (cdb5234)    2026-03-12          24.9    +17.5%           8.0    +8.1%     2.1%
 test_run_2 (a062743)    2026-03-13          24.5    ~-1.6%           7.8   ~-2.5%     4.5%
 test_run_3 (6c06fbf)    2026-03-14          24.6    ~+0.4%           7.8   ~+0.0%     3.2%
 test_run_4 (d1978d2)    2026-03-15          25.4     +3.3%           7.0   -10.3%     5.1%
 test_run_5 (572179d)    2026-03-16          25.5    ~+0.4%           7.8   +11.4%     4.2%

%chg columns show commit-to-commit deltas. ~ prefix means the change is within noise (CV threshold). CV requires multiple BenchmarkDotNet iterations (controlled by --runs in bench_compare).

Why best-of-N instead of median? On a non-isolated machine, run-to-run noise is typically 6--30%. The median includes runs degraded by OS scheduling jitter, thermal throttling, and background I/O. Best-of-N (maximum throughput) represents what the hardware can achieve and is more stable across sessions. CV is reported alongside so you can judge measurement quality -- if CV is high, the environment was noisy and even the best-of-N value should be taken with a grain of salt.

llama.cpp setup

To run comparison benchmarks against llama.cpp:

  1. Get llama.cpp -- either download a prebuilt release or build from source:

    git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp && cmake -B build -DCMAKE_BUILD_TYPE=Release && cmake --build build --config Release
  2. Point bench_compare to the binary -- either:

    • Set LLAMACPP_BIN environment variable to the path of llama-cli
    • Or pass --llamacpp-bin /path/to/llama-cli on each invocation
  3. Run comparison:

    python scripts/bench_compare.py --model QuantFactory/SmolLM-135M-GGUF --dotllm --llamacpp

llama.cpp is optional. All dotLLM benchmarks work without it. The --llamacpp flag simply adds a side-by-side comparison column.

NuGet Packages

dotLLM ships as a set of NuGet packages so you can reference only what you need from your own .NET app:

PackageDescription
DotLLM.CoreCore abstractions — tensor types, backend interfaces, model config, sampling, attention strategies, diagnostics hooks
DotLLM.CpuCPU backend — SIMD-optimized quantized matmul, RMSNorm, RoPE, softmax, attention
DotLLM.CudaCUDA GPU backend — PTX kernels via CUDA Driver API, cuBLAS prefill, CPU/GPU hybrid offload
DotLLM.ModelsMemory-mapped GGUF/SafeTensors loaders, parameterized TransformerBlock (Llama/Mistral/Phi/Qwen/DeepSeek)
DotLLM.TokenizersBPE, SentencePiece, HuggingFace tokenizer.json, Jinja2-subset chat templates
DotLLM.EngineInference engine — KV-cache, scheduler, samplers, constrained decoding, speculative decoding
DotLLM.ServerOpenAI-compatible HTTP server, tool calling, built-in chat UI
DotLLM.HuggingFaceHuggingFace Hub search and GGUF download/caching
DotLLM.DiagnosticsInterpretability hooks — activation capture, logit lens, logprobs
DotLLM.TelemetryPlaceholder package; System.Diagnostics.Metrics counters and Activity-based tracing are planned (Phase 7 / Phase 9)
DotLLM.Clidotnet tool — the dotllm command (run / chat / serve / model management)

Install the engine plus CPU backend for a minimal setup:

dotnet add package DotLLM.Engine dotnet add package DotLLM.Cpu dotnet add package DotLLM.Models dotnet add package DotLLM.Tokenizers

Or install the CLI as a global tool:

dotnet tool install -g DotLLM.Cli

All packages track the same version and are published together on each release.

Host the OpenAI API in your ASP.NET app

DotLLM.Server is a library — reference it from your own ASP.NET Core host to expose dotLLM's OpenAI-compatible routes. Two patterns are supported.

Mode 1 — run a dedicated dotLLM WebApplication inside your process. Simplest; you hand off model loading and routing to dotLLM.

using DotLLM.Server; var options = new ServerOptions { Model = "llama-3.2-3b.Q4_K_M.gguf", Device = "gpu", GpuLayers = 32, Port = 8080, PromptCacheEnabled = true, UsePaged = true, }; using var state = ServerStartup.LoadModel(options.Model, options); var app = ServerStartup.BuildApp(state, args: [], serveUi: true); await app.RunAsync($"http://localhost:{options.Port}");

Mode 2 — attach dotLLM routes to your own WebApplication. Use this when you want dotLLM's endpoints alongside your own routes, middleware, and services.

using DotLLM.Server; var builder = WebApplication.CreateBuilder(args); // Load the model up front so the state can be registered in DI // before builder.Build() — dotLLM endpoints resolve ServerState from DI. var modelPath = "llama-3.2-3b.Q4_K_M.gguf"; var state = ServerStartup.LoadModel(modelPath, new ServerOptions { Model = modelPath, Device = "cpu", PromptCacheEnabled = true, ModelId = "llama-3.2-3b", }); // Your own services... builder.Services.AddAuthentication(); // Register dotLLM's state + JSON context in the same IServiceCollection: builder.Services.AddDotLLM(state); var app = builder.Build(); // Your own middleware and routes... app.UseAuthentication(); app.MapGet("/hello", () => "hi"); // Mount dotLLM's OpenAI-compatible endpoints: app.MapDotLLMEndpoints(serveUi: false); app.Run();

Both modes transparently reuse the embedded chat UI assets if serveUi: true. The ServerState is IDisposable — dispose it on shutdown to release the model and KV-cache.

News

  • 2026-04First public release (v0.1.0-preview.1) — dotLLM goes public. NuGet packages for all 10 libraries + DotLLM.Cli as a global dotnet tool. Self-contained single-file downloads for Windows / Linux / macOS (Apple Silicon) and experimental Native AOT builds for Linux / Windows attached to every GitHub Release. Companion website at dotllm.dev (#119)
  • 2026-04Wave 7: CPU performance cleanup pass — TopKSampler replaces full Array.Sort with a hand-rolled size-K min-heap (O(N log K), stack-resident scratch); JsonSchemaConstraint adds first-char bucketing to skip the ~160 MB of struct clones per mask build when the tracker rejects most leading characters, plus LRU eviction instead of the previous full-flush cache overflow; Dequantize.Q5_0 gains an AVX2 path matching Q8_0's throughput (reuses MatMulQ5_0.ExtractQ5HighBits / vpshufb bit-extraction); BpeTokenizer pre-splits special tokens via the existing Trie.TryMatchLongest instead of the O(n × m) linear scan; ComputeThreadPool now pins the caller (inference) thread to the first candidate P-core on first Dispatch, eliminating the hybrid-CPU stall where pinned P-core workers idled at the barrier waiting for an E-core caller. New BenchmarkDotNet suites for TopK sampling, schema mask build, and special-token encode (#109)
  • 2026-04Phase 7 begins: Logprobs — OpenAI-compatible logprobs: true + top_logprobs: N (0-20) on /v1/chat/completions and /v1/completions. Per-token log-softmax captured before sampling, returned in both streaming SSE chunks and non-streaming responses. Chat UI gains opt-in logprobs visualization: color-coded token confidence (green/lime/yellow/orange/red), hover tooltips with top-K alternatives and probabilities, diagnostic cues for low confidence, ambiguity, and sampling effect. DotLLM.Sample.Logprobs console sample with ANSI-colored output (#101)
  • 2026-04Phase 6 complete: Speculative decoding — draft-verify-accept loop with modified rejection sampling. A small draft model proposes K candidate tokens; the target model verifies in one batched forward pass. Greedy fast-path for temperature=0. IKvCache.Rollback() for KV-cache truncation on rejection. IDecodingConstraint.Clone() for constraint rollback. Serve UI gains draft model selector and K slider. --speculative-model and --speculative-k CLI options. Speculative acceptance rate in response timings (#98)
  • 2026-04 — Paged KV-cache — block-based KV-cache memory management (the allocation half of PagedAttention): shared block pool, block tables, ref counting, copy-on-write. Foundation for advanced prefix sharing (step 37, hard requirement) and speculative decoding (step 43, cheap rollback/fork). Attention kernels still operate on contiguous buffers via staging-buffer gather — true paged attention kernels are a future step. --paged (opt-in for CLI), on by default for serve. --no-ui flag for API-only hosting. See docs/KV_CACHE.md (#96)
  • 2026-04 — Native AOT (experimental) — opt-in dotnet publish -p:PublishAot=true produces a single-file dotllm binary with ~50ms startup (vs ~500ms JIT). Source-generated JSON serialization across all projects, CreateSlimBuilder for ASP.NET, TrimmerRoots.xml for Spectre.Console.Cli type preservation. JIT remains default for best throughput (Dynamic PGO). See docs/AOT.md (#94)
  • 2026-04Phase 6 begins: Warm-up — configurable dummy inference passes at server startup trigger .NET Tier-1 JIT promotion (Dynamic PGO) and exercise CUDA/cuBLAS pipelines. /ready probe gates on warm-up completion. --no-warmup to disable, --warmup-iterations N to configure (#92)
  • 2026-04Phase 5 complete: simple prompt caching completes the constrained decoding & API phase — all 7 steps done (JSON mode, JSON Schema, regex/CFG, tool calling, API server, chat UI, prompt caching)
  • 2026-04 — Simple prompt caching — PrefixCache keeps live KV-cache instances across generation calls. On each turn, element-wise prefix match finds cached tokens and skips redundant prefill, processing only new suffix tokens. LRU eviction with configurable max sessions. Enabled by default in chat and serve commands (--no-prompt-cache to disable). Cached token count reported in CLI stats, API timings.cached_tokens, and Chat UI stats bar. Near-100% cache hit rate in multi-turn chat, dramatically reducing TTFT on subsequent turns (#90)
  • 2026-04 — Built-in web chat UI — dotllm serve model.gguf starts the API server and opens a browser to a bundled single-page chat UI (vanilla JS + TailwindCSS, embedded as resources in the DLL). Per-message inference stats (prefill/decode tok/s, TTFT), live sampling parameter control, model hot-swap from the UI, system prompt, verbose mode. New endpoints: /props, /v1/config, /v1/models/available, /v1/models/load. Streaming SSE now includes usage + timings in the final chunk (#86)
  • 2026-04 — ASP.NET OpenAI-compatible API server — DotLLM.Server with /v1/chat/completions (streaming SSE + non-streaming), /v1/completions, /v1/models, /v1/tokenize, /v1/detokenize, health/ready probes. Chat template formatting, tool calling with IToolCallParser detection, response_format constrained decoding. Model loading at startup via --model/--device CLI args. Sequential request processing via semaphore (#84)
  • 2026-04 — Tool calling — IToolCallParser implementations for Llama 3.1+, Hermes/Qwen, Mistral, and generic fallback. Auto-detection factory selects parser from model architecture and chat template content. ToolCallSchemaBuilder generates JSON Schema from tool definitions for constrained decoding (tool_choice=required). ToolCallDetector for post-generation detection, StreamingToolCallAccumulator for streaming. --tools and --tool-choice CLI options with multi-turn tool use in chat REPL. Parallel tool calls supported (#82)
  • 2026-04 — Regex + CFG constrained decoding — RegexConstraint compiles patterns to minimized DFA (Thompson NFA → subset construction → Hopcroft minimization) with equivalence-class compression. GrammarConstraint parses GBNF grammars into PDA with InlineArray-based call stack. Both use zero-alloc struct simulators and dictionary-cached token masks. --response-format regex --pattern <pattern> and --response-format grammar --grammar <gbnf|@file> CLI support (#80)
  • 2026-03 — JSON Schema constrained decoding — JsonSchemaConstraint layers schema tracking on JsonCharParser to enforce type constraints, required properties, enum values, nested structures. Schema compiled into flat node array with property-name tries. Zero-alloc Clone() via struct-copy. --response-format json_schema --schema <json|@file> CLI support (#78)
  • 2026-03Phase 5 begins: JSON mode constrained decoding — JsonConstraint FSM guarantees syntactically valid JSON output via per-token vocabulary masking. Stack-based PDA (RFC 8259), AVX2-vectorized logit masking, state-keyed mask cache. --response-format json_object CLI flag (#76)
  • 2026-03 — KV-cache quantization: Q8_0 and Q4_0 KV-cache compression on CPU and GPU (3.7–7.1× memory reduction). Separate --cache-type-k/--cache-type-v options, mixed-precision window --cache-window N keeps recent tokens in full precision. Dual-region storage with quantize-on-evict, per-tile dequantization in tiled attention (#74)
  • 2026-03 — CPU/GPU hybrid layer offloading: --gpu-layers N to run first N layers on GPU, remainder on CPU. Automatic FP16→FP32 hidden state transfer at boundary. Split KV-cache (GPU FP16 + CPU FP32). Partial VRAM usage proportional to offloaded layers (#72)
  • 2026-03 — CUDA GPU backend: PTX kernels via CUDA Driver API P/Invoke (no native shared library), cuBLAS HGEMM for prefill, custom quantized GEMV for decode (Q8_0, Q4_K, Q6_K), FP16 activation pipeline, on-the-fly weight dequantization, GPU KV-cache, --device gpu CLI flag, --device both benchmarking (#70)
  • 2026-03 — NUMA-aware threading: adaptive spin-wait dispatch (generation counter with event fallback), NUMA topology detection (Windows/Linux), P-core/E-core awareness, CPU affinity pinning, auto-reduced decode thread count (#57)
  • 2026-03 — Operator fusion: fused RMSNorm+quantize (decode-only, eliminates normOut intermediate buffer) and tiled SwiGLU (1KB L1-resident sigmoid buffer) reduce DRAM roundtrips on the decode hot path (#56)
  • 2026-03 — Fast approximate exp/softmax: Schraudolph IEEE-754 bit-manipulation trick replaces polynomial exp (~3 SIMD ops vs ~12) in attention softmax. AVX2/AVX-512 fused shift+exp+sum pass eliminates 3 separate TensorPrimitives calls. Sampling softmax keeps full precision (#55)
  • 2026-03 — Tiled attention with online softmax: O(N) memory flash-attention-style algorithm replaces O(N²) score matrix materialization, eliminates 64 MB/head allocations at ctx 4096, uses ~1 KB stack per head (#54)
  • 2026-03 — Row-interleaved weight repacking: R4 layout stores 4 consecutive rows' blocks contiguously at model load time, improving cache/TLB locality for all quantized GEMV kernels (#52)
  • 2026-03 — Q8_1 input quantization: precomputed block sums for Q5_0 kernels, 2-block loop unrolling, eliminates ~4 SIMD ops/block from Q5_0 vec_dot hot path (#51)
  • 2026-03 — Fused decode dispatch: Q/K/V (3→1) and Gate/Up (2→1) projection fusion saves ~72 dispatches/layer, ~4% decode throughput improvement (#50)
  • 2026-03Phase 2 complete: additional model architectures (Mistral, Phi, Qwen), sliding window attention, fused QKV support, IModel interface, ModelLoader helper (#34)
  • 2026-03 — Streaming token generation: IAsyncEnumerable<GenerationToken> API with UTF-8-safe incremental text, CancellationToken support, and per-token finish reason/timings (#31)
  • 2026-03 — Chat template engine: Jinja2-subset interpreter (lexer→parser→evaluator), IChatTemplate implementation, GgufChatTemplateFactory, dotllm chat REPL command (#30)
  • 2026-03 — Mixed quantization + Q8_K: Q8_K input quantization (float32 scale, 256-element blocks, precomputed bsums), true 4-row fused K-quant kernels, re-enabled Q4_K×Q8_K/Q5_K×Q8_K/Q6_K×Q8_K fused GEMV/GEMM (#29)
  • 2026-03 — Q4_K_M dequantization and vec_dot kernels: Q4_K, Q5_K, Q6_K scalar + AVX2 dequant and fused matmul kernels with full model-level dispatch (#28)
  • 2026-03 — BDN inference benchmarks: end-to-end benchmarks with custom tok/s columns, auto model download, llama.cpp comparison script (#42)
  • 2026-03 — Engine inference timings: InferenceTimings on InferenceResponse, onTokenGenerated callback, CLI refactored to use TextGenerator (#41)
  • 2026-03 — Multi-threaded CPU inference: zero-alloc ComputeThreadPool with delegate* dispatch, parallel GEMV/GEMM and head-parallel attention (#36)
  • 2026-03 — SIMD kernel tuning: FMA float accumulation, 4-row batched GEMV, AVX-512 paths, SIMD quantization (#26)
  • 2026-03 — Phase 1 complete: sampling pipeline + stop conditions — first coherent multi-token generation (#24)
  • 2026-03 — KV-cache: eval drops from 1091 ms/token to 227 ms/token (~4.8× speedup)
  • 2026-03 — Llama forward pass: first token generation from embedding to logits
  • 2026-02 — BPE Tokenizer with SentencePiece and tiktoken support (#16)

Roadmap

PhaseDescriptionStatus
1 — End-to-End GenerationGGUF loading, dequantization, CPU ops, tokenizer, attention, forward pass, KV-cache, samplingDone (9/9)
2 — Practical Local InferenceEngine metrics, benchmarks, Q4_K_M, chat templates, streaming, multi-threading, more architecturesDone (10/10)
3 — CPU PerformanceDecode dispatch, Q8_1 input, weight repacking, outer-product GEMM, tiled attention, fast exp, fusion, NUMAIn Progress (7/8)
4 — GPU AccelerationCUDA backend, CPU/GPU hybrid, KV-cache quantizationDone (3/3)
5 — Constrained Decoding & APIJSON mode, JSON Schema, regex/CFG, tool calling, OpenAI API server, chat UI, prompt cachingDone (7/7)
6 — Improved ServingWarm-up, NativeAOT, paged KV-cache, speculative decodingDone (4/4)
7 — Diagnostics & InterpretabilityLogprobs, hook system, logit lens, SAE integration, LoRA adaptersIn Progress (1/5)
8 — Model ExpansionMLA attention, ALiBi, SmolLM3, Gemma 4, Mixture of ExpertsPlanned (0/5)
9 — Production ServingContinuous batching, prefix sharing, advanced scheduling, rate limiting, metrics & tracingPlanned (0/5)

See docs/ROADMAP.md for detailed steps, dependencies, and milestones.

Documentation

Contributing

Contributions are welcome! dotLLM uses an issue-driven workflow — every change starts with a GitHub issue describing the work. Pick an existing issue or open a new one, then submit a PR targeting main.

Contact

Questions, ideas, or feedback? Open a thread in GitHub Discussions.

Author

Built by Konrad Kokosa — .NET MVP, author of Pro .NET Memory Management (2nd ed.), and AI/agents engineer at Nethermind. Over 20 years of .NET performance work.

License

dotLLM is licensed under the GNU General Public License v3.0.

Acknowledgments

  • llama.cpp — reference for GGUF format, quantization kernels, and CUDA implementations
  • Hugging Face — model ecosystem, transformers reference implementations, tokenizer specs
  • .NET teamTensorPrimitives, System.Runtime.Intrinsics, MemoryMappedFile, and the runtime that makes this possible

关于 About

LLM inference engine written in .NET

语言 Languages

C#83.5%
Python9.8%
JavaScript2.5%
Cuda2.5%
HTML0.9%
CSS0.5%
PowerShell0.2%
Shell0.1%

提交活跃度 Commit Activity

代码提交热力图
过去 52 周的开发活跃度
89
Total Commits
峰值: 29次/周
Less
More

核心贡献者 Contributors