Star 历史趋势
数据来源: GitHub API · 生成自 Stargazers.cn
README.md
Shimmy Logo

The Lightweight OpenAI API Server

🔒 Local Inference Without Dependencies 🚀

License: MIT Security Crates.io Downloads Rust GitHub Stars

💝 Sponsor this project

Shimmy will be free forever. No asterisks. No "free for now." No pivot to paid.

💝 Support Shimmy's Growth

🚀 If Shimmy helps you, consider sponsoring — 100% of support goes to keeping it free forever.

  • $5/month: Coffee tier ☕ - Eternal gratitude + sponsor badge
  • $25/month: Bug prioritizer 🐛 - Priority support + name in SPONSORS.md
  • $100/month: Corporate backer 🏢 - Logo placement + monthly office hours
  • $500/month: Infrastructure partner 🚀 - Direct support + roadmap input

🎯 Become a Sponsor | See our amazing sponsors 🙏


Drop-in OpenAI API Replacement for Local LLMs

Shimmy is a single-binary that provides 100% OpenAI-compatible endpoints for GGUF models. Point your existing AI tools to Shimmy and they just work — locally, privately, and free.

🎉 NEW in v1.9.0: One download, all GPU backends included! No compilation, no backend confusion - just download and run.

Developer Tools

Whether you're forking Shimmy or integrating it as a service, we provide complete documentation and integration templates.

Try it in 30 seconds

# 1) Download pre-built binary (includes all GPU backends) # Windows: curl -L https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-windows-x86_64.exe -o shimmy.exe ./shimmy.exe serve & # Linux: curl -L https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-linux-x86_64 -o shimmy && chmod +x shimmy ./shimmy serve & # macOS (Apple Silicon): curl -L https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-macos-arm64 -o shimmy && chmod +x shimmy ./shimmy serve & # 2) See models and pick one ./shimmy list # 3) Smoke test the OpenAI API curl -s http://127.0.0.1:11435/v1/chat/completions \ -H 'Content-Type: application/json' \ -d '{ "model":"REPLACE_WITH_MODEL_FROM_list", "messages":[{"role":"user","content":"Say hi in 5 words."}], "max_tokens":32 }' | jq -r '.choices[0].message.content'

🚀 Compatible with OpenAI SDKs and Tools

No code changes needed - just change the API endpoint:

  • Any OpenAI client: Python, Node.js, curl, etc.
  • Development applications: Compatible with standard SDKs
  • VSCode Extensions: Point to http://localhost:11435
  • Cursor Editor: Built-in OpenAI compatibility
  • Continue.dev: Drop-in model provider

Use with OpenAI SDKs

  • Node.js (openai v4)
import OpenAI from "openai"; const openai = new OpenAI({ baseURL: "http://127.0.0.1:11435/v1", apiKey: "sk-local", // placeholder, Shimmy ignores it }); const resp = await openai.chat.completions.create({ model: "REPLACE_WITH_MODEL", messages: [{ role: "user", content: "Say hi in 5 words." }], max_tokens: 32, }); console.log(resp.choices[0].message?.content);
  • Python (openai>=1.0.0)
from openai import OpenAI client = OpenAI(base_url="http://127.0.0.1:11435/v1", api_key="sk-local") resp = client.chat.completions.create( model="REPLACE_WITH_MODEL", messages=[{"role": "user", "content": "Say hi in 5 words."}], max_tokens=32, ) print(resp.choices[0].message.content)

⚡ Zero Configuration Required

  • Automatically finds models from Hugging Face cache, Ollama, local dirs
  • Auto-allocates ports to avoid conflicts
  • Auto-detects LoRA adapters for specialized models
  • Just works - no config files, no setup wizards

🧠 Advanced MOE (Mixture of Experts) Support

Run 70B+ models on consumer hardware with intelligent CPU/GPU hybrid processing:

  • 🔄 CPU MOE Offloading: Automatically distribute model layers across CPU and GPU
  • 🧮 Intelligent Layer Placement: Optimizes which layers run where for maximum performance
  • 💾 Memory Efficiency: Fit larger models in limited VRAM by using system RAM strategically
  • ⚡ Hybrid Acceleration: Get GPU speed where it matters most, CPU reliability everywhere else
  • 🎛️ Configurable: --cpu-moe and --n-cpu-moe flags for fine control
# Enable MOE CPU offloading during installation cargo install shimmy --features moe # Run with MOE hybrid processing shimmy serve --cpu-moe --n-cpu-moe 8 # Automatically balances: GPU layers (fast) + CPU layers (memory-efficient)

Perfect for: Large models (70B+), limited VRAM systems, cost-effective inference

🎯 Perfect for Local Development

  • Privacy: Your code never leaves your machine
  • Cost: No API keys, no per-token billing
  • Speed: Local inference, sub-second responses
  • Reliability: No rate limits, no downtime

Quick Start (30 seconds)

Installation

✨ v1.9.0 NEW: Download pre-built binaries with ALL GPU backends included!

📥 Pre-Built Binaries (Recommended - Zero Dependencies)

Pick your platform and download - no compilation needed:

# Windows x64 (includes CUDA + Vulkan + OpenCL) curl -L https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-windows-x86_64.exe -o shimmy.exe # Linux x86_64 (includes CUDA + Vulkan + OpenCL) curl -L https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-linux-x86_64 -o shimmy && chmod +x shimmy # macOS ARM64 (includes MLX for Apple Silicon) curl -L https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-macos-arm64 -o shimmy && chmod +x shimmy # macOS Intel (CPU-only) curl -L https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-macos-intel -o shimmy && chmod +x shimmy # Linux ARM64 (CPU-only) curl -L https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-linux-aarch64 -o shimmy && chmod +x shimmy

That's it! Your GPU will be detected automatically at runtime.

🛠️ Build from Source (Advanced)

Want to customize or contribute?

# Basic installation (CPU only) cargo install shimmy --features huggingface # Kitchen Sink builds (what pre-built binaries use): # Windows/Linux x64: cargo install shimmy --features huggingface,llama,llama-cuda,llama-vulkan,llama-opencl,vision # macOS ARM64: cargo install shimmy --features huggingface,llama,mlx,vision # CPU-only (any platform): cargo install shimmy --features huggingface,llama,vision

⚠️ Build Notes:

  • Windows: Install LLVM first for libclang.dll
  • Recommended: Use pre-built binaries to avoid dependency issues
  • Advanced users only: Building from source requires C++ compiler + CUDA/Vulkan SDKs

GPU Acceleration

✨ NEW in v1.9.0: One binary per platform with automatic GPU detection!

⚠️ IMPORTANT - Vision Feature Performance:
CPU-based vision inference (MiniCPM-V) is 5-10x slower than GPU acceleration.
CPU: 15-45 seconds per image | GPU (CUDA/Vulkan): 2-8 seconds per image
For production vision workloads, GPU acceleration is strongly recommended.

📥 Download Pre-Built Binaries (Recommended)

No compilation needed! Each binary includes ALL GPU backends for your platform:

PlatformDownloadGPU SupportAuto-Detects
Windows x64shimmy-windows-x86_64.exeCUDA + Vulkan + OpenCL
Linux x86_64shimmy-linux-x86_64CUDA + Vulkan + OpenCL
macOS ARM64shimmy-macos-arm64MLX (Apple Silicon)
macOS Intelshimmy-macos-intelCPU onlyN/A
Linux ARM64shimmy-linux-aarch64CPU onlyN/A

How it works: Download one file, run it. Shimmy automatically detects and uses your GPU!

# Windows example curl -L https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-windows-x86_64.exe -o shimmy.exe ./shimmy.exe serve --gpu-backend auto # Auto-detects CUDA/Vulkan/OpenCL # Linux example curl -L https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-linux-x86_64 -o shimmy chmod +x shimmy ./shimmy serve --gpu-backend auto # Auto-detects CUDA/Vulkan/OpenCL # macOS ARM64 example curl -L https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-macos-arm64 -o shimmy chmod +x shimmy ./shimmy serve # Auto-detects MLX on Apple Silicon

🎯 GPU Auto-Detection

Shimmy uses intelligent GPU detection with this priority order:

  1. CUDA (NVIDIA GPUs via nvidia-smi)
  2. Vulkan (Cross-platform GPUs via vulkaninfo)
  3. OpenCL (AMD/Intel GPUs via clinfo)
  4. MLX (Apple Silicon via system detection)
  5. CPU (Fallback if no GPU detected)

No manual configuration needed! Just run with --gpu-backend auto (default).

🔧 Manual Backend Override

Want to force a specific backend? Use the --gpu-backend flag:

# Auto-detect (default - recommended) shimmy serve --gpu-backend auto # Force CPU (for testing or compatibility) shimmy serve --gpu-backend cpu # Force CUDA (NVIDIA GPUs only) shimmy serve --gpu-backend cuda # Force Vulkan (AMD/Intel/Cross-platform) shimmy serve --gpu-backend vulkan # Force OpenCL (AMD/Intel alternative) shimmy serve --gpu-backend opencl

🛡️ Error Handling & Robustness: If you force an unavailable backend (e.g., --gpu-backend cuda on AMD GPU), Shimmy will:

  1. ✅ Display clear error message explaining the issue
  2. ✅ Automatically fallback to next available backend in priority order
  3. ✅ Log which backend was actually used (check with --verbose)
  4. ✅ Continue serving requests (graceful degradation, no crashes)
  5. ✅ Support environment variable override: SHIMMY_GPU_BACKEND=cuda

Common scenarios:

  • --gpu-backend cuda on non-NVIDIA → Falls back to Vulkan or OpenCL
  • --gpu-backend vulkan without drivers → Falls back to OpenCL or CPU
  • --gpu-backend invalid → Clear error + fallback to auto-detection
  • No GPU detected → Runs on CPU with performance warning

Environment Variable: Set SHIMMY_GPU_BACKEND=cuda to override default without CLI flags.

🔍 Check GPU Support

# Show detected GPU backends shimmy gpu-info # Check which backend is being used shimmy serve --gpu-backend auto --verbose

⚡ Binary Sizes

  • GPU-enabled binaries (Windows/Linux x64, macOS ARM64): ~40-50MB
  • CPU-only binaries (macOS Intel, Linux ARM64): ~20-30MB

Trade-off: Slightly larger binaries for zero compilation and automatic GPU detection.

🛠️ Build from Source (Advanced)

Want to customize or contribute? Build from source:

  • Multiple backends can be compiled in, best one selected automatically
  • Use --gpu-backend <backend> to force specific backend

Get Models

Shimmy auto-discovers models from:

  • Hugging Face cache: ~/.cache/huggingface/hub/
  • Ollama models: ~/.ollama/models/
  • Local directory: ./models/
  • Environment: SHIMMY_BASE_GGUF=path/to/model.gguf
# Download models that work out of the box huggingface-cli download microsoft/Phi-3-mini-4k-instruct-gguf --local-dir ./models/ huggingface-cli download bartowski/Llama-3.2-1B-Instruct-GGUF --local-dir ./models/

Start Server

# Auto-allocates port to avoid conflicts shimmy serve # Or use manual port shimmy serve --bind 127.0.0.1:11435

Point your development tools to the displayed port — VSCode Copilot, Cursor, Continue.dev all work instantly.

📦 Download & Install

Package Managers

Direct Downloads

  • GitHub Releases: Latest binaries
  • Docker: docker pull shimmy/shimmy:latest (coming soon)

🍎 macOS Support

Full compatibility confirmed! Shimmy works flawlessly on macOS with Metal GPU acceleration.

# Install dependencies brew install cmake rust # Install shimmy cargo install shimmy

✅ Verified working:

  • Intel and Apple Silicon Macs
  • Metal GPU acceleration (automatic)
  • MLX native acceleration for Apple Silicon
  • Xcode 17+ compatibility
  • All LoRA adapter features

Integration Examples

VSCode Copilot

{ "github.copilot.advanced": { "serverUrl": "http://localhost:11435" } }

Continue.dev

{ "models": [{ "title": "Local Shimmy", "provider": "openai", "model": "your-model-name", "apiBase": "http://localhost:11435/v1" }] }

Cursor IDE

Works out of the box - just point to http://localhost:11435/v1

Why Shimmy Will Always Be Free

I built Shimmy to retain privacy-first control on my AI development and keep things local and lean.

This is my commitment: Shimmy stays MIT licensed, forever. If you want to support development, sponsor it. If you don't, just build something cool with it.

💡 Shimmy saves you time and money. If it's useful, consider sponsoring for $5/month — less than your Netflix subscription, infinitely more useful for developers.

API Reference

Endpoints

  • GET /health - Health check
  • POST /v1/chat/completions - OpenAI-compatible chat
  • GET /v1/models - List available models
  • POST /api/generate - Shimmy native API
  • GET /ws/generate - WebSocket streaming

CLI Commands

shimmy serve # Start server (auto port allocation) shimmy serve --bind 127.0.0.1:8080 # Manual port binding shimmy serve --cpu-moe --n-cpu-moe 8 # Enable MOE CPU offloading shimmy list # Show available models (LLM-filtered) shimmy discover # Refresh model discovery shimmy generate --name X --prompt "Hi" # Test generation shimmy probe model-name # Verify model loads shimmy gpu-info # Show GPU backend status

Technical Architecture

  • Rust + Tokio: Memory-safe, async performance
  • llama.cpp backend: Industry-standard GGUF inference
  • OpenAI API compatibility: Drop-in replacement
  • Dynamic port management: Zero conflicts, auto-allocation
  • Zero-config auto-discovery: Just works™

🚀 Advanced Features

  • 🧠 MOE CPU Offloading: Hybrid GPU/CPU processing for large models (70B+)
  • 🎯 Smart Model Filtering: Automatically excludes non-language models (Stable Diffusion, Whisper, CLIP)
  • 🛡️ 6-Gate Release Validation: Constitutional quality limits ensure reliability
  • ⚡ Smart Model Preloading: Background loading with usage tracking for instant model switching
  • 💾 Response Caching: LRU + TTL cache delivering 20-40% performance gains on repeat queries
  • 🚀 Integration Templates: One-command deployment for Docker, Kubernetes, Railway, Fly.io, FastAPI, Express
  • 🔄 Request Routing: Multi-instance support with health checking and load balancing
  • 📊 Advanced Observability: Real-time metrics with self-optimization and Prometheus integration
  • 🔗 RustChain Integration: Universal workflow transpilation with workflow orchestration

Community & Support

Star History

Star History Chart

🚀 Momentum Snapshot

📦 Sub-5MB single binary (142x smaller than Ollama) 🌟 GitHub stars stars and climbing fast<1s startup 🦀 100% Rust, no Python

📰 As Featured On

🔥 Hacker NewsFront Page AgainIPE Newsletter

Companies: Need invoicing? Email michaelallenkuykendall@gmail.com

⚡ Performance Comparison

ToolBinary SizeStartup TimeMemory UsageOpenAI API
Shimmy4.8MB<100ms50MB100%
Ollama680MB5-10s200MB+Partial
llama.cpp89MB1-2s100MBVia llama-server

Quality & Reliability

Shimmy maintains high code quality through comprehensive testing:

  • Comprehensive test suite with property-based testing
  • Automated CI/CD pipeline with quality gates
  • Runtime invariant checking for critical operations
  • Cross-platform compatibility testing

Development Testing

Run the complete test suite:

# Using cargo aliases cargo test-quick # Quick development tests # Using Makefile make test # Full test suite make test-quick # Quick development tests

See our testing approach for technical details.


License & Philosophy

MIT License - forever and always.

Philosophy: Infrastructure should be invisible. Shimmy is infrastructure.

Testing Philosophy: Reliability through comprehensive validation and property-based testing.


Forever maintainer: Michael A. Kuykendall Promise: This will never become a paid product Mission: Making local model inference simple and reliable

关于 About

⚡ Python-free Rust inference server — OpenAI-API compatible. GGUF + SafeTensors, hot model swap, auto-discovery, single binary. FREE now, FREE forever.
api-servercommand-line-tooldeveloper-toolsggufhuggingfacehuggingface-modelshuggingface-transformersinference-serverllamallamacppllm-inferencelocal-ailoramachine-learningollama-apiopenai-compatiblerustrust-cratetransformers

语言 Languages

Rust69.3%
C10.3%
C++5.0%
Python5.0%
Shell5.0%
JavaScript2.7%
TOML1.3%
YAML0.5%
Dockerfile0.3%
Batchfile0.3%
PowerShell0.2%
Ruby0.1%

提交活跃度 Commit Activity

代码提交热力图
过去 52 周的开发活跃度
689
Total Commits
峰值: 136次/周
Less
More

核心贡献者 Contributors