LLM APIs with permanent free tiers for text inference.

Provider APIs

APIs run by the companies that train or fine-tune the models themselves.

Cohere 🇨🇦

Free "Trial" API key, no credit card. 1,000 API calls/month. Non-commercial use only.

Base URL: https://api.cohere.com/v2

Model Name	Context	Max Output	Modality	Rate Limit
Command A (111B)	256K	4K	Text	20 RPM
Command R+	128K	4K	Text	20 RPM
Command R	128K	4K	Text	20 RPM
Command R7B	128K	4K	Text	20 RPM
Embed 4	—	—	Embeddings (Text + Image)	2,000 inputs/min
Rerank 3.5	—	—	Reranking	10 RPM

Google Gemini 🇺🇸

Free tier unavailable in EU/UK/Switzerland. Free-tier prompts may be used by Google to improve products. ¹

Base URL: https://generativelanguage.googleapis.com/v1beta

Model Name	Context	Max Output	Modality	Rate Limit
Gemini 2.5 Flash	1M	65K	Text + Image + Audio + Video	10 RPM, 250 RPD
Gemini 2.5 Flash-Lite	1M	65K	Text + Image + Audio + Video	15 RPM, 1,000 RPD

Mistral AI 🇫🇷

Free "Experiment" plan, no credit card. ~1B tokens/month.

Base URL: https://api.mistral.ai/v1

Model Name	Context	Max Output	Modality	Rate Limit
Mistral Small 4	256K	256K	Text + Image + Code	~1 RPS, 500K TPM
Mistral Medium 3	128K	128K	Text	~1 RPS, 500K TPM
Mistral Large 3	256K	256K	Text	~1 RPS, 500K TPM
Mistral Nemo (12B)	128K	128K	Text	~1 RPS, 500K TPM
Codestral	256K	256K	Code	~1 RPS, 500K TPM
Pixtral Large	128K	128K	Text + Image	~1 RPS, 500K TPM

Z AI (Zhipu AI) 🇨🇳

Permanent free models, no credit card required.

Base URL: https://open.bigmodel.cn/api/paas/v4

Model Name	Context	Max Output	Modality	Rate Limit
GLM-4.7-Flash	200K	128K	Text	1 concurrent request
GLM-4.5-Flash	128K	~8K	Text	1 concurrent request
GLM-4.6V-Flash	128K	~4K	Text + Image	1 concurrent request

Inference providers

Third-party platforms that host open-weight models from various sources.

Cerebras 🇺🇸

Free tier, no credit card. Ultra-fast inference (~2,600 tok/s). 1M tokens/day cap.

Base URL: https://api.cerebras.ai/v1

Model Name	Context	Max Output	Modality	Rate Limit
llama3.1-8b	128K (8K on free)	8K	Text	30 RPM, 14,400 RPD, 1M TPD
gpt-oss-120b	128K (8K on free)	8K	Text	30 RPM, 14,400 RPD, 1M TPD
qwen-3-235b-a22b-instruct-2507	131K (8K on free)	8K	Text	30 RPM, 14,400 RPD, 1M TPD
zai-glm-4.7	128K (8K on free)	8K	Text	10 RPM, 100 RPD, 1M TPD

Cloudflare Workers AI 🇺🇸

10,000 Neurons/day free. 50+ models available on free tier.

Base URL: https://api.cloudflare.com/client/v4/accounts/{account_id}/ai/run

Model Name	Context	Max Output	Modality	Rate Limit
@cf/meta/llama-3.3-70b-instruct-fp8-fast	131K	Shared w/ context	Text	10K neurons/day (shared)
@cf/meta/llama-3.1-8b-instruct-fp8-fast	131K	Shared w/ context	Text	10K neurons/day (shared)
@cf/meta/llama-3.2-11b-vision-instruct	131K	Shared w/ context	Text + Vision	10K neurons/day (shared)
@cf/meta/llama-4-scout-17b-16e-instruct	Up to 10M	Shared w/ context	Multimodal	10K neurons/day (shared)
@cf/mistralai/mistral-small-3.1-24b-instruct	128K	Shared w/ context	Text	10K neurons/day (shared)
@cf/google/gemma-4-26b-a4b-it	256K	Shared w/ context	Text	10K neurons/day (shared)
@cf/qwen/qwq-32b	32K	Shared w/ context	Text	10K neurons/day (shared)
@cf/deepseek-ai/deepseek-r1-distill-qwen-32b	32K	Shared w/ context	Text	10K neurons/day (shared)
+ 42 more models	Varies	Varies	Text, Image, Audio, Embeddings	10K neurons/day (shared)

GitHub Models 🇺🇸

Free prototyping for all GitHub users. 45+ models. Per-request limits (8K in / 4K out).

Base URL: https://models.inference.ai.azure.com

Model Name	Context	Max Output	Modality	Rate Limit
gpt-4.1	1M	32K	Text	10 RPM, 50 RPD
gpt-4.1-mini	1M	32K	Text	15 RPM, 150 RPD
gpt-4o	128K	16K	Text + Vision	10 RPM, 50 RPD
o3-mini	200K	100K	Text (reasoning)	10 RPM, 50 RPD
o4-mini	200K	100K	Text (reasoning)	10 RPM, 50 RPD
Llama-4-Scout-17B-16E	512K	~4K	Text + Vision	15 RPM, 150 RPD
Llama-4-Maverick-17B-128E	256K	~4K	Text + Vision	10 RPM, 50 RPD
Meta-Llama-3.3-70B	131K	~4K	Text	15 RPM, 150 RPD
DeepSeek-R1	64K	8K	Text (reasoning)	15 RPM, 150 RPD
Mistral-Small-3.1	128K	~4K	Text + Vision	15 RPM, 150 RPD
+ 35 more models	Varies	Varies	Text / Image	Varies by tier

Groq 🇺🇸

Free tier, no credit card. Ultra-fast LPU inference. ²

Base URL: https://api.groq.com/openai/v1

Model Name	Context	Max Output	Modality	Rate Limit
llama-3.3-70b-versatile	131K	32K	Text	30 RPM, 14,400 RPD
llama-3.1-8b-instant	131K	131K	Text	30 RPM, 14,400 RPD
llama-4-scout-17b-16e-instruct	131K	8K	Text + Vision	30 RPM, 14,400 RPD
llama-4-maverick-17b-128e-instruct	131K	8K	Text + Vision	15 RPM, 500 RPD
qwen3-32b	131K	131K	Text	30 RPM, 14,400 RPD
gpt-oss-120b	131K	32K	Text	30 RPM, 14,400 RPD
kimi-k2-instruct	262K	262K	Text	30 RPM, 14,400 RPD
deepseek-r1-distill-70b	131K	8K	Text	30 RPM, 14,400 RPD
whisper-large-v3	—	—	Audio → Text	20 RPM, 2,000 RPD
whisper-large-v3-turbo	—	—	Audio → Text	20 RPM, 2,000 RPD

Hugging Face 🇺🇸

Free Serverless Inference API

Base URL: https://api-inference.huggingface.co/models

Model Name	Context	Max Output	Modality	Rate Limit
Meta-Llama-3.1-8B-Instruct	128K	~4K	Text	~1,000 RPD
Mistral-7B-Instruct-v0.3	32K	~4K	Text	~1,000 RPD
Mixtral-8x7B-Instruct-v0.1	32K	~4K	Text	~1,000 RPD
Phi-3.5-mini-instruct	128K	~4K	Text	~1,000 RPD
Qwen2.5-7B-Instruct	131K	~4K	Text	~1,000 RPD

Kilo Code 🇺🇸

Free models with no credit card required. kilo-auto/free auto-router routes to minimax/minimax-m2.5:free (80%) and stepfun/step-3.5-flash:free (20%). ³

Base URL: https://api.kilo.ai/api/gateway

Model Name	Context	Max Output	Modality	Rate Limit
bytedance-seed/dola-seed-2.0-pro:free	—	—	Text	~200 req/hr
x-ai/grok-code-fast-1:optimized:free	—	—	Text (code)	~200 req/hr
nvidia/nemotron-3-super-120b-a12b:free	262K	32K	Text	~200 req/hr
arcee-ai/trinity-large-thinking:free	—	—	Text (reasoning)	~200 req/hr
openrouter/free	Varies	Varies	Text	~200 req/hr

LLM7.io 🇬🇧

Zero-friction API gateway. No registration needed for basic access. 30+ models.

Base URL: https://api.llm7.io/v1

Model Name	Context	Max Output	Modality	Rate Limit
deepseek-r1-0528	—	—	Text (reasoning)	30 RPM (120 with token)
deepseek-v3-0324	—	—	Text	30 RPM (120 with token)
gemini-2.5-flash-lite	—	—	Text + Vision	30 RPM (120 with token)
gpt-4o-mini	—	—	Text + Vision	30 RPM (120 with token)
mistral-small-3.1-24b	32K	—	Text	30 RPM (120 with token)
qwen2.5-coder-32b	—	—	Text (code)	30 RPM (120 with token)
+ ~24 more models	Varies	Varies	Text	30 RPM (120 with token)

NVIDIA NIM 🇺🇸

Free with NVIDIA Developer Program membership. 100+ models. No daily token cap.

Base URL: https://integrate.api.nvidia.com/v1

Model Name	Context	Max Output	Modality	Rate Limit
deepseek-ai/deepseek-r1	128K	~163K	Text (reasoning)	~40 RPM
nvidia/llama-3.1-nemotron-ultra-253b-v1	128K	4K	Text	~40 RPM
nvidia/nemotron-3-super-120b-a12b	262K	262K	Text	~40 RPM
nvidia/nemotron-3-nano-30b-a3b	128K	32K	Text	~40 RPM
meta/llama-3.1-405b-instruct	128K	4K	Text	~40 RPM
qwen/qwen2.5-72b-instruct	128K	8K	Text	~40 RPM
google/gemma-4-31b	128K	8K	Text	~40 RPM
mistralai/mistral-large-2-instruct	128K	4K	Text	~40 RPM
nvidia/nemotron-nano-2-vl	128K	8K	Vision + Text + Video	~40 RPM
minimax/minimax-m2.7	128K	8K	Text	~40 RPM
+ 90 more models	Varies	Varies	Text, Image, Video, Speech, Embeddings	~40 RPM

Ollama Cloud 🇺🇸

Free tier with qualitative usage limits. 400+ models from Ollama library. Not OpenAI SDK-compatible; uses Ollama API. ⁴

Base URL: https://api.ollama.com

Model Name	Context	Max Output	Modality	Rate Limit
llama3.1:cloud	128K	Model-dependent	Text	Session/weekly limits (unpublished)
deepseek-r1:cloud	128K	Model-dependent	Text (reasoning)	Session/weekly limits (unpublished)
qwen2.5:cloud	128K	Model-dependent	Text	Session/weekly limits (unpublished)
gemma2:cloud	8K	Model-dependent	Text	Session/weekly limits (unpublished)
mistral:cloud	32K	Model-dependent	Text	Session/weekly limits (unpublished)
+ 400 more models	Varies	Varies	Text	Session/weekly limits (unpublished)

OpenRouter 🇺🇸

35+ free models (marked with :free suffix). OpenAI SDK-compatible. ⁵

Base URL: https://openrouter.ai/api/v1

Model Name	Context	Max Output	Modality	Rate Limit
deepseek/deepseek-r1-0528:free	163K	~163K	Text (reasoning)	20 RPM, 200 RPD
deepseek/deepseek-chat-v3-0324:free	163K	163K	Text	20 RPM, 200 RPD
qwen/qwen3.6-plus:free	1M	65K	Text	20 RPM, 200 RPD
qwen/qwen3-coder-480b-a35b:free	262K	~32K	Text	20 RPM, 200 RPD
meta-llama/llama-4-scout:free	10M	16K	Multimodal	20 RPM, 200 RPD
meta-llama/llama-4-maverick:free	1M	16K	Multimodal	20 RPM, 200 RPD
meta-llama/llama-3.3-70b-instruct:free	65K	~16K	Text	20 RPM, 200 RPD
google/gemma-4-31b-it:free	256K	~8K	Multimodal	20 RPM, 200 RPD
nvidia/nemotron-3-super-120b-a12b:free	1M	~32K	Text	20 RPM, 200 RPD
openai/gpt-oss-120b:free	131K	131K	Text	20 RPM, 200 RPD
minimax/minimax-m2.5:free	196K	8K	Text	20 RPM, 200 RPD
mistralai/devstral-2512:free	256K	~32K	Text	20 RPM, 200 RPD
+ ~23 more free models	Varies	Varies	Text / Image	20 RPM, 200 RPD

SiliconFlow 🇨🇳

Free tier with 14 CNY signup credits. Permanently free models available.

Base URL: https://api.siliconflow.cn/v1

Model Name	Context	Max Output	Modality	Rate Limit
Qwen/Qwen3-8B	131K	131K	Text	1,000 RPM, 50K TPM
deepseek-ai/DeepSeek-R1-0528-Qwen3-8B	~33K	16K	Text (reasoning)	1,000 RPM, 50K TPM
deepseek-ai/DeepSeek-R1-Distill-Qwen-7B	131K	Configurable	Text (reasoning)	1,000 RPM, 50K TPM
THUDM/glm-4-9b-chat	32K	32K	Text	1,000 RPM, 50K TPM
THUDM/GLM-4.1V-9B-Thinking	66K	66K	Vision + Text	1,000 RPM, 50K TPM
deepseek-ai/DeepSeek-OCR	—	8K	Vision (OCR)	1,000 RPM, 50K TPM
+ embedding/speech models	Varies	Varies	Embeddings, Speech	1,000 RPM, 50K TPM

Contributing

Know a free tier that's missing? Open a PR. Include the provider, endpoint, rate limits (link to their docs), and a few notable models. Trial credits and time-limited promos don't count.

Glossary

Abbreviation	Meaning
RPM	Requests per minute
RPD	Requests per day
TPM	Tokens per minute
TPD	Tokens per day
RPS	Requests per second

Notes

All endpoints are OpenAI SDK-compatible unless noted.
Each link points to the provider's API key page.

Free tier not available in the EU, UK, or Switzerland (available regions). ↩
Groq rate limits vary by model. Llama 4 Maverick is limited to 500 RPD. Most other models get 14,400 RPD (rate limits). ↩
Kilo Code free model list may change over time. nvidia/nemotron-3-super-120b-a12b:free is for trial use only — prompts are logged by NVIDIA. Auto-router kilo-auto/free routes to minimax/minimax-m2.5:free (80%) and stepfun/step-3.5-flash:free (20%). ↩
Ollama Cloud measures usage by GPU time, not tokens or requests. Free tier described as "light usage" with session limits resetting every 5 hours and weekly limits every 7 days. Pro (50x more) and Max (250x more) plans available. Not OpenAI SDK-compatible; uses Ollama API. ↩
Free models default to 200 RPD. A one-time purchase of $10+ in credits unlocks 1,000 RPD for free models. OpenRouter also offers a Free Models Router (openrouter/free) and model fallbacks for chaining models in priority order. ↩

Contents

Provider APIs

Cohere 🇨🇦

Google Gemini 🇺🇸

Mistral AI 🇫🇷

Z AI (Zhipu AI) 🇨🇳

Inference providers

Cerebras 🇺🇸

Cloudflare Workers AI 🇺🇸

GitHub Models 🇺🇸

Groq 🇺🇸

Hugging Face 🇺🇸

Kilo Code 🇺🇸

LLM7.io 🇬🇧

NVIDIA NIM 🇺🇸

Ollama Cloud 🇺🇸

OpenRouter 🇺🇸

SiliconFlow 🇨🇳

Contributing

Glossary

Notes

关于 About

语言 Languages

提交活跃度 Commit Activity

核心贡献者 Contributors

Contents

Provider APIs

Cohere 🇨🇦

Google Gemini 🇺🇸

Mistral AI 🇫🇷

Z AI (Zhipu AI) 🇨🇳

Inference providers

Cerebras 🇺🇸

Cloudflare Workers AI 🇺🇸

GitHub Models 🇺🇸

Groq 🇺🇸

Hugging Face 🇺🇸

Kilo Code 🇺🇸

LLM7.io 🇬🇧

NVIDIA NIM 🇺🇸

Ollama Cloud 🇺🇸

OpenRouter 🇺🇸

SiliconFlow 🇨🇳

Contributing

Glossary

Notes

Footnotes

关于 About

语言 Languages

提交活跃度 Commit Activity

核心贡献者 Contributors