---
Arcee AI Trinity Large 400B:
  image: "https://sebastianraschka.com/llm-architecture-gallery/images/architectures/arcee-ai-trinity-large-400b.webp"
  date: '2026-01-27'
  summary: Arcee's flagship blends several efficiency tricks into a DeepSeek-like coarse MoE design.
  scale: 400B total, 13B active (3.3% active)
  context_tokens: '512,000'
  license_name: Apache License 2.0
  license_url: https://huggingface.co/arcee-ai/Trinity-Large-Base/blob/main/README.md
  decoder_type: Sparse MoE
  attention: GQA with gated attention and 3:1 sliding-window/global attention
  layer_mix: '45 sliding-window + 15 global'
  kv_cache_per_token_bf16: '240 KiB'
  highlight: Combines QK-Norm, RoPE+NoPE, sandwich norm, and a coarse-grained MoE.
  aa_intelligence_index: 'N/A'
  aai_profile: 'N/A'
  aa_url:
  config:
    repo: arcee-ai/Trinity-Large-Base
    label: config.json
    url: https://huggingface.co/arcee-ai/Trinity-Large-Base/blob/main/config.json
  tech_report:
    url: https://arxiv.org/pdf/2602.17004
DeepSeek R1:
  image: "https://sebastianraschka.com/llm-architecture-gallery/images/architectures/deepseek-v3-r1-671-billion.webp"
  date: '2025-01-20'
  summary: Reasoning-tuned DeepSeek model built on the V3 architecture rather than a new base design.
  scale: 671B total, 37B active (5.5% active)
  context_tokens: '128,000'
  license_name: MIT License
  license_url: https://huggingface.co/deepseek-ai/DeepSeek-R1/blob/main/LICENSE
  decoder_type: Sparse MoE
  attention: MLA
  layer_mix: '61 MLA'
  kv_cache_per_token_bf16: '68.6 KiB'
  highlight: Architecture matches DeepSeek V3; the main change is the reasoning-oriented training recipe.
  aa_intelligence_index: '18.8'
  aai_profile: 'General 33.1 · Scientific 22.5 · Coding 15.9 · Agents 3.8'
  aa_url: https://artificialanalysis.ai/models/deepseek-r1-0120
  config:
    repo: deepseek-ai/DeepSeek-R1
    label: config.json
    url: https://huggingface.co/deepseek-ai/DeepSeek-R1/blob/main/config.json
  tech_report:
    url: https://arxiv.org/pdf/2501.12948
DeepSeek V3:
  image: "https://sebastianraschka.com/llm-architecture-gallery/images/architectures/deepseek-v3-r1-671-billion.webp"
  date: '2024-12-26'
  summary: DeepSeek's flagship template kicked off the recent wave of large open MoE models.
  scale: 671B total, 37B active (5.5% active)
  context_tokens: '128,000'
  license_name: DeepSeek License Agreement v1.0
  license_url: https://huggingface.co/deepseek-ai/DeepSeek-V3/blame/main/LICENSE-MODEL
  decoder_type: Sparse MoE
  attention: MLA
  layer_mix: '61 MLA'
  kv_cache_per_token_bf16: '68.6 KiB'
  highlight: Uses a dense prefix plus a shared expert to keep a very large model practical at inference.
  aa_intelligence_index: '16.5'
  aai_profile: 'General 24.9 · Scientific 15.7 · Coding 16.4 · Agents 8.8'
  aa_url: https://artificialanalysis.ai/models/deepseek-v3
  config:
    repo: deepseek-ai/DeepSeek-V3
    label: config.json
    url: https://huggingface.co/deepseek-ai/DeepSeek-V3/blob/main/config.json
  tech_report:
    url: https://arxiv.org/pdf/2412.19437
DeepSeek V3.2:
  image: "https://sebastianraschka.com/llm-architecture-gallery/images/architectures/deepseek-v3-2-671b.webp"
  date: '2025-12-01'
  summary: DeepSeek's successor keeps the V3 template but adds sparse attention to cut long-context costs.
  scale: 671B total, 37B active (5.5% active)
  context_tokens: '128,000'
  license_name: MIT License
  license_url: https://huggingface.co/deepseek-ai/DeepSeek-V3.2/blob/main/LICENSE
  decoder_type: Sparse MoE
  attention: MLA with DeepSeek Sparse Attention
  layer_mix: '61 MLA'
  kv_cache_per_token_bf16: '68.6 KiB'
  highlight: An evolutionary update focused on efficiency rather than a new base layout.
  aa_intelligence_index: '32.1'
  aai_profile: 'General 29.7 · Scientific 24.2 · Coding 34.6 · Agents 39.8'
  aa_url: https://artificialanalysis.ai/models/deepseek-v3-2
  config:
    repo: deepseek-ai/DeepSeek-V3.2
    label: config.json
    url: https://huggingface.co/deepseek-ai/DeepSeek-V3.2/blob/main/config.json
  tech_report:
    url: https://arxiv.org/pdf/2512.02556
GLM-4.5 355B:
  image: "https://sebastianraschka.com/llm-architecture-gallery/images/architectures/glm-4-5-355b.webp"
  date: '2025-07-28'
  summary: Agent-oriented instruction/reasoning hybrid that borrows DeepSeek's dense-prefix MoE layout.
  scale: 355B total, 32B active (9% active)
  context_tokens: '128,000'
  license_name: MIT License
  license_url: https://huggingface.co/zai-org/GLM-4.5/blob/main/README.md
  decoder_type: Sparse MoE
  attention: GQA with QK-Norm
  layer_mix: '92 GQA'
  kv_cache_per_token_bf16: '368 KiB'
  highlight: Starts with three dense layers before MoE routing and keeps a shared expert.
  aa_intelligence_index: '26.4'
  aai_profile: 'General 37.5 · Scientific 25.6 · Coding 26.3 · Agents 16.2'
  aa_url: https://artificialanalysis.ai/models/glm-4.5
  config:
    repo: zai-org/GLM-4.5
    label: config.json
    url: https://huggingface.co/zai-org/GLM-4.5/blob/main/config.json
  tech_report:
    url: https://arxiv.org/pdf/2508.06471
GLM-4.7 355B:
  image: "https://sebastianraschka.com/llm-architecture-gallery/images/architectures/glm-4-7-355b.webp"
  date: '2025-12-22'
  summary: Immediate GLM predecessor that stays closer to the older GLM-4.5 style before the MLA shift.
  scale: 355B total, 32B active (9% active)
  context_tokens: '202,752'
  license_name: MIT License
  license_url: https://huggingface.co/zai-org/GLM-4.7/blob/main/README.md
  decoder_type: Sparse MoE
  attention: GQA with QK-Norm
  layer_mix: '92 GQA'
  kv_cache_per_token_bf16: '368 KiB'
  highlight: Serves as the pre-MLA, pre-sparse-attention baseline with the same 32B active path as GLM-4.5.
  aa_intelligence_index: '34.2'
  aai_profile: 'General 30.6 · Scientific 19.7 · Coding 32.0 · Agents 54.3'
  aa_url: https://artificialanalysis.ai/models/glm-4-7-non-reasoning
  config:
    repo: zai-org/GLM-4.7
    label: config.json
    url: https://huggingface.co/zai-org/GLM-4.7/blob/main/config.json
  tech_report:
    url: https://arxiv.org/pdf/2508.06471
GLM-5 744B:
  image: "https://sebastianraschka.com/llm-architecture-gallery/images/architectures/glm-5-744b.webp"
  date: '2026-02-11'
  summary: Huge GLM refresh that adopts both MLA and DeepSeek Sparse Attention for flagship-scale inference.
  scale: 744B total, 40B active (5.4% active)
  context_tokens: '202,752'
  license_name: MIT License
  license_url: https://huggingface.co/zai-org/GLM-5/blob/main/README.md
  decoder_type: Sparse MoE
  attention: MLA with DeepSeek Sparse Attention
  layer_mix: '78 MLA'
  kv_cache_per_token_bf16: '87.8 KiB'
  highlight: Bigger than GLM-4.7, with more experts and fewer layers.
  aa_intelligence_index: '40.6'
  aai_profile: 'General 42.8 · Scientific 20.2 · Coding 39.0 · Agents 60.3'
  aa_url: https://artificialanalysis.ai/models/glm-5-non-reasoning
  config:
    repo: zai-org/GLM-5
    label: config.json
    url: https://huggingface.co/zai-org/GLM-5/blob/main/config.json
  tech_report:
    url: https://arxiv.org/pdf/2602.15763
GPT-2 XL 1.5B:
  image: "https://sebastianraschka.com/llm-architecture-gallery/images/architectures/gpt-2-xl.webp"
  date: '2019-11-05'
  summary: Late-2019 dense baseline included here as a reference point for how much decoder stacks have changed since GPT-2.
  scale: 1.5B parameters
  context_tokens: '1,024'
  license_name: OpenAI "Modified MIT" license
  license_url: https://github.com/openai/gpt-2
  decoder_type: Dense
  attention: MHA with learned absolute positional embeddings
  layer_mix: '48 MHA'
  kv_cache_per_token_bf16: '300 KiB'
  highlight: Classic GPT-2 recipe with dropout, GELU, LayerNorm, and full multi-head attention.
  aa_intelligence_index: 'N/A'
  aai_profile: 'N/A'
  aa_url:
  config:
    repo: openai-community/gpt2-xl
    label: config.json
    url: https://huggingface.co/openai-community/gpt2-xl/blob/main/config.json
  tech_report:
    url: https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
GPT-OSS 120B:
  image: "https://sebastianraschka.com/llm-architecture-gallery/images/architectures/gpt-oss-120b.webp"
  date: '2025-08-04'
  summary: Larger gpt-oss variant keeps the same alternating-attention recipe as the 20B model.
  scale: 117B total, 5.1B active (4.4% active)
  context_tokens: '128,000'
  license_name: Apache License 2.0
  license_url: https://huggingface.co/openai/gpt-oss-120b/blob/main/LICENSE
  decoder_type: Sparse MoE
  attention: GQA with alternating sliding-window and global layers
  layer_mix: '18 sliding-window + 18 global'
  kv_cache_per_token_bf16: '72 KiB'
  highlight: Shared architectural template scaled up for OpenAI's flagship open-weight release.
  aa_intelligence_index: '33.3'
  aai_profile: 'General 37.5 · Scientific 29.1 · Coding 28.6 · Agents 37.9'
  aa_url: https://artificialanalysis.ai/models/gpt-oss-120b
  config:
    repo: openai/gpt-oss-120b
    label: config.json
    url: https://huggingface.co/openai/gpt-oss-120b/blob/main/config.json
  tech_report:
    url: https://cdn.openai.com/pdf/419b6906-9da6-406c-a19d-1bb078ac7637/oai_gpt-oss_model_card.pdf
GPT-OSS 20B:
  image: "https://sebastianraschka.com/llm-architecture-gallery/images/architectures/gpt-oss-20b.webp"
  date: '2025-08-04'
  summary: OpenAI's smaller open-weight MoE model favors width and alternating local/global attention.
  scale: 21B total, 3.6B active (17.1% active)
  context_tokens: '128,000'
  license_name: Apache License 2.0
  license_url: https://huggingface.co/openai/gpt-oss-20b/blob/main/LICENSE
  decoder_type: Sparse MoE
  attention: GQA with alternating sliding-window and global layers
  layer_mix: '12 sliding-window + 12 global'
  kv_cache_per_token_bf16: '48 KiB'
  highlight: Wider and shallower than Qwen3, with attention bias and sink mechanisms.
  aa_intelligence_index: '24.5'
  aai_profile: 'General 29.3 · Scientific 22.5 · Coding 18.5 · Agents 27.6'
  aa_url: https://artificialanalysis.ai/models/gpt-oss-20b
  config:
    repo: openai/gpt-oss-20b
    label: config.json
    url: https://huggingface.co/openai/gpt-oss-20b/blob/main/config.json
  tech_report:
    url: https://cdn.openai.com/pdf/419b6906-9da6-406c-a19d-1bb078ac7637/oai_gpt-oss_model_card.pdf
Gemma 3 270M:
  image: "https://sebastianraschka.com/llm-architecture-gallery/images/architectures/gemma-3-270m.webp"
  date: '2025-08-14'
  summary: Tiny Gemma 3 variant that preserves the family's local-global attention recipe at a toy scale.
  scale: 270M parameters
  context_tokens: '128,000'
  vocab_size: '262,144 (~262k)'
  license_name: Gemma Terms of Use + Gemma Prohibited Use Policy
  license_url: https://ai.google.dev/gemma/prohibited_use_policy
  decoder_type: Dense
  attention: Multi-query attention with QK-Norm and 5:1 sliding-window/global attention
  layer_mix: '15 sliding-window + 3 global'
  kv_cache_per_token_bf16: '18 KiB'
  highlight: Keeps the Gemma 3 stack shape while shrinking down to 4 attention heads, a single KV head, and the same 262k vocabulary.
  aa_intelligence_index: '7.7'
  aai_profile: 'General 20.1 · Scientific 7.7 · Coding 0.0 · Agents 3.0'
  aa_url: https://artificialanalysis.ai/models/gemma-3-270m
  config:
    repo: google/gemma-3-270m
    label: config.json
    url: https://huggingface.co/google/gemma-3-270m/blob/main/config.json
  tech_report:
    url: https://arxiv.org/pdf/2503.19786
  from_scratch:
    url: https://github.com/rasbt/LLMs-from-scratch/tree/main/ch05/12_gemma3
Gemma 3 27B:
  image: "https://sebastianraschka.com/llm-architecture-gallery/images/architectures/gemma-3-27b.webp"
  date: '2025-03-11'
  summary: Gemma's flagship text stack leans on local attention more aggressively than Gemma 2.
  scale: 27B parameters
  context_tokens: '128,000'
  vocab_size: '262,144 (~262k)'
  license_name: Gemma Terms of Use + Gemma Prohibited Use Policy
  license_url: https://ai.google.dev/gemma/prohibited_use_policy
  decoder_type: Dense
  attention: GQA with QK-Norm and 5:1 sliding-window/global attention
  layer_mix: '52 sliding-window + 10 global'
  kv_cache_per_token_bf16: '496 KiB'
  highlight: Built around a 27B sweet spot with heavier local attention and a large 262k multilingual vocabulary.
  aa_intelligence_index: '10.3'
  aai_profile: 'General 15.1 · Scientific 13.0 · Coding 9.6 · Agents 3.5'
  aa_url: https://artificialanalysis.ai/models/gemma-3-27b
  config:
    repo: google/gemma-3-27b-it
    label: config.json
    url: https://huggingface.co/google/gemma-3-27b-it/blob/main/config.json
  tech_report:
    url: https://arxiv.org/pdf/2503.19786
  from_scratch:
    url: https://github.com/rasbt/LLMs-from-scratch/tree/main/ch05/12_gemma3
Gemma 4 26B-A4B:
  image: "https://sebastianraschka.com/llm-architecture-gallery/images/architectures/gemma-4-26b-a4b.webp"
  date: '2026-04-02'
  summary: Sparse Gemma 4 variant that keeps the local:global attention backbone while swapping dense FFNs for MoE layers.
  scale: 25.2B total, 3.8B active (15.1% active)
  context_tokens: '256,000'
  vocab_size: '262,144 (~262k)'
  license_name: Apache License 2.0
  license_url: https://ai.google.dev/gemma/docs/core/model_card_4
  decoder_type: Sparse MoE
  attention: GQA with QK-Norm, unified K/V on global layers, p-RoPE on global layers, and 5:1 sliding-window/global attention
  layer_mix: '25 sliding-window + 5 global'
  kv_cache_per_token_bf16: '210 KiB'
  highlight: Uses 128 total experts with only 8 routed plus 1 shared expert active per token.
  aa_intelligence_index: 'N/A'
  aai_profile: 'N/A'
  aa_url:
  config:
    repo: google/gemma-4-26B-A4B-it
    label: config.json
    url: https://huggingface.co/google/gemma-4-26B-A4B-it/blob/main/config.json
  tech_report:
    url: https://ai.google.dev/gemma/docs/core/model_card_4
Gemma 4 31B:
  image: "https://sebastianraschka.com/llm-architecture-gallery/images/architectures/gemma-4-31b.webp"
  date: '2026-04-02'
  summary: Dense Gemma 4 scales the family to a 256K-context multimodal checkpoint without changing the core local-global recipe much.
  scale: 30.7B parameters
  context_tokens: '256,000'
  vocab_size: '262,144 (~262k)'
  license_name: Apache License 2.0
  license_url: https://ai.google.dev/gemma/docs/core/model_card_4
  decoder_type: Dense
  attention: GQA with QK-Norm, unified K/V on global layers, p-RoPE on global layers, and 5:1 sliding-window/global attention
  layer_mix: '50 sliding-window + 10 global'
  kv_cache_per_token_bf16: '840 KiB'
  highlight: Carries Gemma's unusual pre/post-norm stack into a larger 31B dense model with 256K context.
  aa_intelligence_index: 'N/A'
  aai_profile: 'N/A'
  aa_url:
  config:
    repo: google/gemma-4-31B-it
    label: config.json
    url: https://huggingface.co/google/gemma-4-31B-it/blob/main/config.json
  tech_report:
    url: https://ai.google.dev/gemma/docs/core/model_card_4
Grok 2.5 270B:
  image: "https://sebastianraschka.com/llm-architecture-gallery/images/architectures/grok-2-5-270b.webp"
  date: '2025-08-22'
  summary: Rare production-model release that shows an older MoE style with fewer, larger experts.
  scale: 270B parameters
  context_tokens: '131,072'
  license_name: Grok 2 Community License Agreement
  license_url: https://huggingface.co/xai-org/grok-2/blob/main/README.md
  decoder_type: Sparse MoE
  attention: GQA
  layer_mix: '64 GQA'
  kv_cache_per_token_bf16: '256 KiB'
  highlight: Adds an always-on SwiGLU path that effectively behaves like a shared expert.
  aa_intelligence_index: 'N/A'
  aai_profile: 'N/A'
  aa_url:
  config:
    repo: xai-org/grok-2
    label: config.json
    url: https://huggingface.co/xai-org/grok-2/blob/main/config.json
Kimi K2:
  image: "https://sebastianraschka.com/llm-architecture-gallery/images/architectures/kimi-k2-1-trillion.webp"
  date: '2025-07-10'
  summary: Trillion-parameter Moonshot model that essentially scales the DeepSeek V3 recipe upward.
  scale: 1T total, 32B active (3.2% active)
  context_tokens: '128,000'
  license_name: Modified MIT License
  license_url: https://huggingface.co/moonshotai/Kimi-K2-Base/blame/main/LICENSE
  decoder_type: Sparse MoE
  attention: MLA
  layer_mix: '61 MLA'
  kv_cache_per_token_bf16: '68.6 KiB'
  highlight: More experts and fewer MLA heads than DeepSeek V3.
  aa_intelligence_index: '26.3'
  aai_profile: 'General 36.3 · Scientific 22.6 · Coding 22.1 · Agents 24.3'
  aa_url: https://artificialanalysis.ai/models/kimi-k2
  config:
    repo: moonshotai/Kimi-K2-Base
    label: config.json
    url: https://huggingface.co/moonshotai/Kimi-K2-Base/blob/main/config.json
  tech_report:
    url: https://arxiv.org/pdf/2507.20534
Kimi K2.5:
  image: "https://sebastianraschka.com/llm-architecture-gallery/images/architectures/kimi-k2-5.webp"
  date: '2026-01-27'
  summary: Native-multimodal Moonshot flagship that keeps the K2/DeepSeek-style MoE layout and pushes native context to 256k.
  scale: 1T total, 32B active (3.2% active)
  context_tokens: '256,000'
  license_name: Modified MIT License
  license_url: https://huggingface.co/moonshotai/Kimi-K2.5/blob/main/LICENSE
  decoder_type: Sparse MoE
  attention: MLA
  layer_mix: '61 MLA'
  kv_cache_per_token_bf16: '68.6 KiB'
  highlight: Keeps the 384-expert K2 backbone, but adds multimodal capabilities (not shown) and doubles the native context length.
  aa_intelligence_index: '37.3'
  aai_profile: 'General 44.4 · Scientific 26.0 · Coding 25.8 · Agents 52.8'
  aa_url: https://artificialanalysis.ai/models/kimi-k2-5-non-reasoning
  config:
    repo: moonshotai/Kimi-K2.5
    label: config.json
    url: https://huggingface.co/moonshotai/Kimi-K2.5/blob/main/config.json
  tech_report:
    url: https://arxiv.org/pdf/2602.02276
Kimi Linear 48B-A3B:
  image: "https://sebastianraschka.com/llm-architecture-gallery/images/architectures/kimi-linear-48b-a3b.webp"
  date: '2025-10-30'
  summary: Linear-attention hybrid that keeps a transformer backbone but replaces most full-attention layers.
  scale: 48B total, 3B active (6.3% active)
  context_tokens: '1,000,000'
  license_name: MIT License
  license_url: https://github.com/MoonshotAI/Kimi-Linear
  decoder_type: Sparse hybrid
  attention: 3:1 Kimi Delta Attention and MLA
  layer_mix: '7 MLA + 20 Kimi Delta Attention'
  kv_cache_per_token_bf16: '7.9 KiB'
  highlight: Uses NoPE in MLA layers and channel-wise gating for long-context efficiency.
  aa_intelligence_index: '14.4'
  aai_profile: 'General N/A · Scientific N/A · Coding 14.2 · Agents N/A'
  aa_url: https://artificialanalysis.ai/models/kimi-linear-48b-a3b-instruct
  config:
    repo: moonshotai/Kimi-Linear-48B-A3B-Base
    label: config.json
    url: https://huggingface.co/moonshotai/Kimi-Linear-48B-A3B-Base/blob/main/config.json
  tech_report:
    url: https://arxiv.org/pdf/2510.26692
Ling 2.5 1T:
  image: "https://sebastianraschka.com/llm-architecture-gallery/images/architectures/ling-2-5-1t.webp"
  date: '2026-02-15'
  summary: Trillion-parameter long-context model that swaps DeltaNet for Lightning Attention.
  scale: 1T total, 63B active (6.3% active)
  context_tokens: '256,000'
  license_name: MIT License
  license_url: https://huggingface.co/inclusionAI/Ling-2.5-1T/blob/main/README.md
  decoder_type: Sparse hybrid
  attention: Lightning Attention plus MLA
  layer_mix: '10 MLA + 70 Lightning Attention'
  kv_cache_per_token_bf16: '11.2 KiB'
  highlight: Uses a 7:1 linear-attention/MLA ratio and a much larger 63B active path.
  aa_intelligence_index: 'N/A'
  aai_profile: 'N/A'
  aa_url:
  config:
    repo: inclusionAI/Ling-2.5-1T
    label: config.json
    url: https://huggingface.co/inclusionAI/Ling-2.5-1T/blob/main/config.json
Llama 3.2 1B:
  image: "https://sebastianraschka.com/llm-architecture-gallery/images/architectures/llama-3-2-1b.webp"
  date: '2024-09-25'
  summary: Small dense Llama baseline in the Qwen comparison, with fewer layers but more width.
  scale: 1B parameters
  context_tokens: '128,000'
  license_name: Llama Community License Agreement (variant-specific)
  license_url: https://huggingface.co/meta-llama/Llama-3.2-1B/blob/main/LICENSE.txt
  decoder_type: Dense
  attention: GQA
  layer_mix: '16 GQA'
  kv_cache_per_token_bf16: '32 KiB'
  highlight: Wider architecture with more heads than Qwen3 0.6B.
  aa_intelligence_index: '6.3'
  aai_profile: 'General 17.0 · Scientific 7.6 · Coding 0.6 · Agents 0.0'
  aa_url: https://artificialanalysis.ai/models/llama-3-2-instruct-1b
  from_scratch:
    url: https://github.com/rasbt/LLMs-from-scratch/tree/main/ch05/07_gpt_to_llama
Llama 3 8B:
  image: "https://sebastianraschka.com/llm-architecture-gallery/images/architectures/llama-3-8b.webp"
  date: '2024-04-18'
  summary: Reference dense Llama stack used to contrast OLMo 2's normalization and attention choices.
  scale: 8B parameters
  context_tokens: '8,192'
  license_name: Llama 3 Community License Agreement
  license_url: https://huggingface.co/meta-llama/Meta-Llama-3-8B/blob/main/LICENSE
  decoder_type: Dense
  attention: GQA with RoPE
  layer_mix: '32 GQA'
  kv_cache_per_token_bf16: '128 KiB'
  highlight: Pre-norm baseline; wider than OLMo 2 at a similar scale.
  aa_intelligence_index: 'N/A'
  aai_profile: 'N/A'
  aa_url:
  config:
    repo: meta-llama/Meta-Llama-3-8B
    label: config.json
    url: https://huggingface.co/meta-llama/Meta-Llama-3-8B/blob/main/config.json
  tech_report:
    url: https://arxiv.org/pdf/2407.21783
  from_scratch:
    url: https://github.com/rasbt/LLMs-from-scratch/tree/main/ch05/07_gpt_to_llama
Llama 3.2 3B:
  summary: Small Llama baseline used to show how closely Nanbeige follows the mainstream compact decoder recipe.
  scale: 3B parameters
  context_tokens: '128,000'
  license_name: Llama 3.2 Community License Agreement
  license_url: https://huggingface.co/meta-llama/Llama-3.2-3B/blob/main/LICENSE.txt
  decoder_type: Dense
  attention: GQA
  layer_mix: '28 GQA'
  kv_cache_per_token_bf16: '112 KiB'
  highlight: Reference small-model Llama architecture with tied embeddings.
  aa_intelligence_index: '9.7'
  aai_profile: 'N/A'
  aa_url: https://artificialanalysis.ai/models/llama-3-2-instruct-3b
Llama 4 Maverick:
  image: "https://sebastianraschka.com/llm-architecture-gallery/images/architectures/llama-4-maverick-400b.webp"
  date: '2025-04-05'
  summary: Meta's large MoE follows the DeepSeek V3 playbook but with a more conventional attention stack.
  scale: 400B total, 17B active (4.3% active)
  context_tokens: '1,000,000'
  license_name: Llama 4 Community License Agreement
  license_url: https://huggingface.co/meta-llama/Llama-4-Maverick
  decoder_type: Sparse MoE
  attention: GQA
  layer_mix: '36 chunked + 12 full GQA'
  kv_cache_per_token_bf16: '192 KiB'
  highlight: Alternates dense and MoE blocks and uses fewer, larger experts than DeepSeek V3.
  aa_intelligence_index: 'N/A'
  aai_profile: 'N/A'
  aa_url:
  config:
    repo: meta-llama/Llama-4-Maverick-17B-128E-Instruct
    label: config.json
    url: https://huggingface.co/meta-llama/Llama-4-Maverick-17B-128E-Instruct/blob/main/config.json
  tech_report:
    url: https://ai.meta.com/blog/llama-4-multimodal-intelligence/
MiniMax M2 230B:
  image: "https://sebastianraschka.com/llm-architecture-gallery/images/architectures/minimax-m2-230b.webp"
  date: '2025-10-23'
  summary: MiniMax's flagship returns to full attention and looks like a leaner, sparser cousin of Qwen3.
  scale: 230B total, 10B active (4.3% active)
  context_tokens: '196,608'
  license_name: Modified MIT License
  license_url: https://huggingface.co/MiniMaxAI/MiniMax-M2/blob/main/README.md
  decoder_type: Sparse MoE
  attention: GQA with QK-Norm and partial RoPE
  layer_mix: '62 GQA'
  kv_cache_per_token_bf16: '248 KiB'
  highlight: Uses per-layer QK-Norm and much sparser MoE routing than Qwen3.
  aa_intelligence_index: 'N/A'
  aai_profile: 'N/A'
  aa_url:
  config:
    repo: MiniMaxAI/MiniMax-M2
    label: config.json
    url: https://huggingface.co/MiniMaxAI/MiniMax-M2/blob/main/config.json
MiniMax M2.5 230B:
  image: "https://sebastianraschka.com/llm-architecture-gallery/images/architectures/minimax-m2-5-230b.webp"
  date: '2026-02-12'
  summary: Popular 230B coder that opts for a classic architecture instead of the newer hybrid-attention ideas.
  scale: 230B total, 10B active (4.3% active)
  context_tokens: '196,608'
  license_name: Modified MIT License
  license_url: https://huggingface.co/MiniMaxAI/MiniMax-M2.5/blob/main/README.md
  decoder_type: Sparse MoE
  attention: GQA with QK-Norm
  layer_mix: '62 GQA'
  kv_cache_per_token_bf16: '248 KiB'
  highlight: Deliberately avoids sliding-window or linear-attention hybrids while keeping a 10B active path.
  aa_intelligence_index: 'N/A'
  aai_profile: 'N/A'
  aa_url:
  config:
    repo: MiniMaxAI/MiniMax-M2.5
    label: config.json
    url: https://huggingface.co/MiniMaxAI/MiniMax-M2.5/blob/main/config.json
Mistral Large 3:
  image: "https://sebastianraschka.com/llm-architecture-gallery/images/architectures/mistral-3-large-673-billion.webp"
  date: '2025-12-02'
  summary: Mistral's new flagship effectively adopts the DeepSeek architecture and retunes the expert sizes.
  scale: 673B total, 41B active (6.1% active)
  context_tokens: '262,144'
  license_name: Apache License 2.0
  license_url: https://huggingface.co/mistralai/Mistral-Large-3-675B-Instruct-2512/blob/main/README.md
  decoder_type: Sparse MoE
  attention: MLA
  layer_mix: '61 MLA'
  kv_cache_per_token_bf16: '68.6 KiB'
  highlight: Near-clone of DeepSeek V3 with larger experts, fewer routed experts, and multimodal support.
  aa_intelligence_index: '22.8'
  aai_profile: 'General 27.8 · Scientific 19.1 · Coding 22.7 · Agents 21.7'
  aa_url: https://artificialanalysis.ai/models/mistral-large-3
  config:
    repo: mistralai/Mistral-Large-3-675B-Instruct-2512
    label: params.json
    url: https://huggingface.co/mistralai/Mistral-Large-3-675B-Instruct-2512/blob/main/params.json
Mistral Small 3.1 24B:
  image: "https://sebastianraschka.com/llm-architecture-gallery/images/architectures/mistral-3-1-small-24b.webp"
  date: '2025-03-18'
  summary: Fast dense 24B model that drops the sliding-window setup used in older Mistral releases.
  scale: 24B parameters
  context_tokens: '128,000'
  license_name: Apache License 2.0
  license_url: https://mistral.ai/news/mistral-small-3-1
  decoder_type: Dense
  attention: Standard GQA
  layer_mix: '40 GQA'
  kv_cache_per_token_bf16: '160 KiB'
  highlight: Latency-focused design with a smaller KV cache and fewer layers than Gemma 3 27B.
  aa_intelligence_index: '14.5'
  aai_profile: 'General 21.9 · Scientific 13.8 · Coding 13.9 · Agents 8.4'
  aa_url: https://artificialanalysis.ai/models/mistral-small-3-1
  config:
    repo: mistralai/Mistral-Small-3.1-24B-Base-2503
    label: config.json
    url: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Base-2503/blob/main/config.json
  tech_report:
    url: https://mistral.ai/news/mistral-small-3-1
Mistral Small 4:
  image: "https://sebastianraschka.com/llm-architecture-gallery/images/architectures/mistral-small-4.webp"
  date: '2026-03-16'
  summary: Multimodal Mistral Small refresh that jumps from the older dense 24B stack to an MLA-based sparse MoE design.
  scale: 119B total, 6.63B active (5.6% active)
  context_tokens: '256,000'
  license_name: Apache License 2.0
  license_url: https://mistral.ai/news/mistral-small-4
  decoder_type: Sparse MoE
  attention: MLA
  layer_mix: '36 MLA'
  kv_cache_per_token_bf16: '22.5 KiB'
  highlight: Uses 128 experts with 4 routed plus 1 shared expert active per token while unifying instruct, reasoning, and vision.
  aa_intelligence_index: '26.9'
  aai_profile: 'General 37.1 · Scientific 24.1 · Coding 24.3 · Agents 22.4'
  aa_url: https://artificialanalysis.ai/models/mistral-small-4
  config:
    repo: mistralai/Mistral-Small-4-119B-2603
    label: config.json
    url: https://huggingface.co/mistralai/Mistral-Small-4-119B-2603/blob/main/config.json
  hide_article_link: true
  tech_report:
    url: https://mistral.ai/news/mistral-small-4
Nanbeige 4.1 3B:
  image: "https://sebastianraschka.com/llm-architecture-gallery/images/architectures/nanbeige-4-1-3b.webp"
  date: '2026-02-10'
  summary: Small on-device oriented model that stays close to Llama 3.2 while nudging the scaling choices.
  scale: 3B parameters
  context_tokens: '262,144'
  license_name: Apache License 2.0
  license_url: https://huggingface.co/Nanbeige/Nanbeige4.1-3B/blob/main/README.md
  decoder_type: Dense
  attention: GQA
  layer_mix: '32 GQA'
  kv_cache_per_token_bf16: '64 KiB'
  highlight: Llama-like stack without tying input embeddings to the output layer.
  aa_intelligence_index: '16.1'
  aai_profile: 'General 22.0 · Scientific 26.2 · Coding 8.9 · Agents 7.2'
  aa_url: https://artificialanalysis.ai/models/nanbeige4-1-3b
  config:
    repo: Nanbeige/Nanbeige4.1-3B
    label: config.json
    url: https://huggingface.co/Nanbeige/Nanbeige4.1-3B/blob/main/config.json
  tech_report:
    url: https://arxiv.org/pdf/2602.13367
Nemotron 3 Nano 30B-A3B:
  image: "https://sebastianraschka.com/llm-architecture-gallery/images/architectures/nemotron-3-nano-30b-a3b.webp"
  date: '2025-12-04'
  summary: NVIDIA's Nano model is the most extreme transformer-state-space hybrid in the gallery.
  scale: 30B total, 3B active (10% active)
  context_tokens: '1,000,000'
  license_name: NVIDIA Nemotron Open Model License
  license_url: https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-nemotron-open-model-license/
  decoder_type: Hybrid MoE
  attention: Mostly Mamba-2 with a few GQA layers
  layer_mix: '6 GQA + 23 Mamba-2 + 23 MoE'
  kv_cache_per_token_bf16: '6 KiB'
  highlight: Interleaves Mamba-2 and MoE blocks, using attention only sparingly.
  aa_intelligence_index: '13.2'
  aai_profile: 'General 16.2 · Scientific 12.3 · Coding 15.8 · Agents 8.5'
  aa_url: https://artificialanalysis.ai/models/nvidia-nemotron-3-nano-30b-a3b
  config:
    repo: nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16
    label: config.json
    url: https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/blob/main/config.json
  tech_report:
    url: https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Nano-Technical-Report.pdf
Nemotron 3 Nano 4B:
  image: "https://sebastianraschka.com/llm-architecture-gallery/images/architectures/nemotron-3-nano-4b.webp"
  date: '2026-03-16'
  summary: Compact on-device hybrid that compresses Nemotron Nano 9B v2 into a mostly Mamba-2 stack with only four attention layers.
  scale: 4B parameters
  context_tokens: '262,144'
  license_name: NVIDIA Nemotron Open Model License
  license_url: https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-nemotron-open-model-license/
  decoder_type: Dense hybrid
  attention: GQA with only 4 attention layers
  layer_mix: '4 GQA + 21 Mamba-2 + 17 FFN'
  kv_cache_per_token_bf16: '16 KiB'
  highlight: Uses a 42-layer stack with 21 Mamba-2 blocks, 17 ReLU² FFNs, and just 4 GQA layers.
  aa_intelligence_index: '14.7'
  aai_profile: 'General 23.7 · Scientific 15.2 · Coding 10.0 · Agents 9.8'
  aa_url: https://artificialanalysis.ai/models/nvidia-nemotron-3-nano-4b
  config:
    repo: nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16
    label: config.json
    url: https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16/blob/main/config.json
  hide_article_link: true
  tech_report:
    url: https://huggingface.co/blog/nvidia/nemotron-3-nano-4b
Nemotron 3 Super 120B-A12B:
  image: "https://sebastianraschka.com/llm-architecture-gallery/images/architectures/nemotron-3-super-120b-a12b.webp"
  date: '2026-03-11'
  summary: The Super variant scales up Nano and adds both latent experts and native speculative decoding support.
  scale: 120B total, 12B active (10% active)
  context_tokens: '1,000,000'
  license_name: NVIDIA Nemotron Open Model License
  license_url: https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-nemotron-open-model-license/
  decoder_type: Hybrid MoE
  attention: Mostly Mamba-2 with a few GQA layers
  layer_mix: '8 GQA + 40 Mamba-2 + 40 MoE'
  kv_cache_per_token_bf16: '8 KiB'
  highlight: Adds latent-space MoE and shared-weight MTP for fast inference.
  aa_intelligence_index: '36.0'
  aai_profile: 'General 42.1 · Scientific 30.4 · Coding 31.2 · Agents 40.2'
  aa_url: https://artificialanalysis.ai/models/nvidia-nemotron-3-super-120b-a12b
  config:
    repo: nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16
    label: config.json
    url: https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16/blob/main/config.json
  tech_report:
    url: https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Super-Technical-Report.pdf
OLMo 2 7B:
  image: "https://sebastianraschka.com/llm-architecture-gallery/images/architectures/olmo-2-7b.webp"
  date: '2024-11-25'
  summary: Transparent dense model that keeps classic MHA and pushes normalization changes for training stability.
  scale: 7B parameters
  context_tokens: '4,096'
  license_name: Apache License 2.0
  license_url: https://huggingface.co/allenai/OLMo-2-1124-7B-Instruct/blob/main/README.md
  decoder_type: Dense
  attention: MHA with QK-Norm
  layer_mix: '32 MHA'
  kv_cache_per_token_bf16: '512 KiB'
  highlight: Uses inside-residual post-norm instead of the usual pre-norm layout.
  aa_intelligence_index: 'N/A'
  aai_profile: 'N/A'
  aa_url:
  config:
    repo: allenai/OLMo-2-1124-7B-Instruct
    label: config.json
    url: https://huggingface.co/allenai/OLMo-2-1124-7B-Instruct/blob/main/config.json
  tech_report:
    url: https://arxiv.org/pdf/2501.00656
OLMo 3 32B:
  image: "https://sebastianraschka.com/llm-architecture-gallery/images/architectures/olmo-3-32b.webp"
  date: '2025-11-20'
  summary: Scaled-up OLMo 3 keeps the same block design but moves to grouped-query attention.
  scale: 32B parameters
  context_tokens: '65,536'
  license_name: Apache License 2.0
  license_url: https://huggingface.co/allenai/Olmo-3-1125-32B/blob/main/README.md
  decoder_type: Dense
  attention: GQA with QK-Norm and 3:1 sliding-window/global attention
  layer_mix: '48 sliding-window + 16 global'
  kv_cache_per_token_bf16: '256 KiB'
  highlight: Keeps post-norm while scaling width and applying YaRN only on global layers.
  aa_intelligence_index: 'N/A'
  aai_profile: 'N/A'
  aa_url:
  config:
    repo: allenai/Olmo-3-32B-Think
    label: config.json
    url: https://huggingface.co/allenai/Olmo-3-32B-Think/blob/main/config.json
  tech_report:
    url: https://arxiv.org/pdf/2512.13961
  from_scratch:
    url: https://github.com/rasbt/LLMs-from-scratch/tree/main/ch05/13_olmo3
OLMo 3 7B:
  image: "https://sebastianraschka.com/llm-architecture-gallery/images/architectures/olmo-3-7b.webp"
  date: '2025-11-20'
  summary: New transparent Allen AI model that keeps OLMo's post-norm flavor while modernizing context handling.
  scale: 7B parameters
  context_tokens: '65,536'
  license_name: Apache License 2.0
  license_url: https://huggingface.co/allenai/Olmo-3-1025-7B/blob/main/README.md
  decoder_type: Dense
  attention: MHA with QK-Norm and 3:1 sliding-window/global attention
  layer_mix: '24 sliding-window + 8 global'
  kv_cache_per_token_bf16: '512 KiB'
  highlight: Retains post-norm, keeps MHA, and applies YaRN only on global layers.
  aa_intelligence_index: '8.2'
  aai_profile: 'General 12.1 · Scientific 12.9 · Coding 3.4 · Agents 4.2'
  aa_url: https://artificialanalysis.ai/models/olmo-3-7b-instruct
  config:
    repo: allenai/Olmo-3-1025-7B
    label: config.json
    url: https://huggingface.co/allenai/Olmo-3-1025-7B/blob/main/config.json
  tech_report:
    url: https://arxiv.org/pdf/2512.13961
  from_scratch:
    url: https://github.com/rasbt/LLMs-from-scratch/tree/main/ch05/13_olmo3
Phi-4:
  image: "https://sebastianraschka.com/llm-architecture-gallery/images/architectures/phi-4.webp"
  date: '2024-12-12'
  summary: Microsoft's 14B dense Phi refresh stays close to Phi-3-medium but swaps its sliding-window attention for full-context GQA and a larger tokenizer.
  scale: 14B parameters
  context_tokens: '16,384'
  license_name: MIT License
  license_url: https://huggingface.co/microsoft/phi-4/blob/main/LICENSE
  decoder_type: Dense
  attention: GQA with RoPE
  layer_mix: '40 GQA'
  kv_cache_per_token_bf16: '200 KiB'
  highlight: Classic pre-norm RMSNorm stack with GQA, 40 heads, 10 KV heads, and a 100,352-token vocabulary.
  related_concepts:
    - rmsnorm
    - gqa
  aa_intelligence_index: '10.4'
  aai_profile: 'General 14.0 · Scientific 16.4 · Coding 11.2 · Agents 0.0'
  aa_url: https://artificialanalysis.ai/models/phi-4
  config:
    repo: microsoft/phi-4
    label: config.json
    url: https://huggingface.co/microsoft/phi-4/blob/main/config.json
  hide_article_link: true
  tech_report:
    url: https://arxiv.org/pdf/2412.08905
Qwen3 0.6B:
  date: '2025-04-28'
  summary: Tiny current-generation Qwen model that trades width for more depth and a low memory footprint.
  scale: 0.6B parameters
  context_tokens: '32,768'
  license_name: Apache License 2.0
  license_url: https://huggingface.co/Qwen/Qwen3-0.6B-Base/blob/main/LICENSE
  decoder_type: Dense
  attention: GQA
  layer_mix: '28 GQA'
  kv_cache_per_token_bf16: '112 KiB'
  highlight: Deeper than Llama 3.2 1B and unusually practical for local experiments and teaching.
  aa_intelligence_index: '5.7'
  aai_profile: 'General 8.1 · Scientific 8.4 · Coding 1.4 · Agents 4.9'
  aa_url: https://artificialanalysis.ai/models/qwen3-0.6b-instruct
  config:
    repo: Qwen/Qwen3-0.6B-Base
    label: config.json
    url: https://huggingface.co/Qwen/Qwen3-0.6B-Base/blob/main/config.json
Qwen3 235B-A22B:
  image: "https://sebastianraschka.com/llm-architecture-gallery/images/architectures/qwen3-235b-a22b.webp"
  date: '2025-04-28'
  summary: Large sparse Qwen variant that stays very close to DeepSeek V3 while removing the shared expert.
  scale: 235B total, 22B active (9.4% active)
  context_tokens: '128,000'
  license_name: Apache License 2.0
  license_url: https://huggingface.co/Qwen/Qwen3-235B-A22B/blob/main/LICENSE
  decoder_type: Sparse MoE
  attention: GQA with QK-Norm
  layer_mix: '94 GQA'
  kv_cache_per_token_bf16: '188 KiB'
  highlight: High-capacity MoE design optimized for serving efficiency without a shared expert.
  aa_intelligence_index: '17.0'
  aai_profile: 'General 16.9 · Scientific 17.7 · Coding 14.0 · Agents 19.2'
  aa_url: https://artificialanalysis.ai/models/qwen3-235b-a22b-instruct
  config:
    repo: Qwen/Qwen3-235B-A22B
    label: config.json
    url: https://huggingface.co/Qwen/Qwen3-235B-A22B/blob/main/config.json
  tech_report:
    url: https://arxiv.org/pdf/2505.09388
  from_scratch:
    url: https://github.com/rasbt/LLMs-from-scratch/tree/main/ch05/11_qwen3
Qwen3 30B-A3B:
  date: '2025-04-28'
  summary: Small sparse Qwen3 model that sits close to GPT-OSS in active size but uses a deeper stack.
  scale: 30B total, 3B active (10% active)
  context_tokens: '128,000'
  license_name: Apache License 2.0
  license_url: https://huggingface.co/Qwen/Qwen3-30B-A3B/blob/main/LICENSE
  decoder_type: Sparse MoE
  attention: GQA
  layer_mix: '48 GQA'
  kv_cache_per_token_bf16: '96 KiB'
  highlight: Deeper, narrower MoE alternative to GPT-OSS without a shared expert.
  aa_intelligence_index: '12.5'
  aai_profile: 'General 14.2 · Scientific 15.2 · Coding 13.3 · Agents 7.4'
  aa_url: https://artificialanalysis.ai/models/qwen3-30b-a3b-instruct
  config:
    repo: Qwen/Qwen3-30B-A3B
    label: config.json
    url: https://huggingface.co/Qwen/Qwen3-30B-A3B/blob/main/config.json
Qwen3 32B:
  image: "https://sebastianraschka.com/llm-architecture-gallery/images/architectures/qwen3-32b.webp"
  date: '2025-04-28'
  summary: Large dense Qwen3 model that serves as the clearest like-for-like comparison for OLMo 3 32B.
  scale: 32B parameters
  context_tokens: '128,000'
  license_name: Apache License 2.0
  license_url: https://huggingface.co/Qwen/Qwen3-32B/blob/main/LICENSE
  decoder_type: Dense
  attention: GQA with QK-Norm
  layer_mix: '64 GQA'
  kv_cache_per_token_bf16: '256 KiB'
  highlight: Reference dense Qwen stack with QK-Norm and 8 KV heads.
  aa_intelligence_index: '14.5'
  aai_profile: 'N/A'
  aa_url: https://artificialanalysis.ai/models/qwen3-32b-instruct
  config:
    repo: Qwen/Qwen3-32B
    label: config.json
    url: https://huggingface.co/Qwen/Qwen3-32B/blob/main/config.json
  tech_report:
    url: https://arxiv.org/pdf/2505.09388
  from_scratch:
    url: https://github.com/rasbt/LLMs-from-scratch/tree/main/ch05/11_qwen3
Qwen3 4B:
  image: "https://sebastianraschka.com/llm-architecture-gallery/images/architectures/qwen3-4b.webp"
  date: '2025-04-28'
  summary: Mid-size dense Qwen3 model used here as a clean baseline against SmolLM3 and Tiny Aya.
  scale: 4B parameters
  context_tokens: '32,768'
  license_name: Apache License 2.0
  license_url: https://huggingface.co/Qwen/Qwen3-4B/blob/main/LICENSE
  decoder_type: Dense
  attention: GQA with QK-Norm
  layer_mix: '36 GQA'
  kv_cache_per_token_bf16: '144 KiB'
  highlight: Compact Qwen3 dense stack with QK-Norm and a 151k vocabulary.
  aa_intelligence_index: '12.5'
  aai_profile: 'N/A'
  aa_url: https://artificialanalysis.ai/models/qwen3-4b-instruct
  config:
    repo: Qwen/Qwen3-4B
    label: config.json
    url: https://huggingface.co/Qwen/Qwen3-4B/blob/main/config.json
  tech_report:
    url: https://arxiv.org/pdf/2505.09388
  from_scratch:
    url: https://github.com/rasbt/LLMs-from-scratch/tree/main/ch05/11_qwen3
Qwen3 8B:
  image: "https://sebastianraschka.com/llm-architecture-gallery/images/architectures/qwen3-8b.webp"
  date: '2025-04-28'
  summary: Dense Qwen3 baseline used here to show how little OLMo 3 changed the overall decoder recipe.
  scale: 8B parameters
  context_tokens: '128,000'
  license_name: Apache License 2.0
  license_url: https://huggingface.co/Qwen/Qwen3-8B/blob/main/LICENSE
  decoder_type: Dense
  attention: GQA with QK-Norm
  layer_mix: '36 GQA'
  kv_cache_per_token_bf16: '144 KiB'
  highlight: Reference Qwen3 dense stack with QK-Norm and 8 KV heads.
  aa_intelligence_index: '10.6'
  aai_profile: 'General 11.2 · Scientific 12.7 · Coding 7.1 · Agents 11.6'
  aa_url: https://artificialanalysis.ai/models/qwen3-8b-instruct
  config:
    repo: Qwen/Qwen3-8B
    label: config.json
    url: https://huggingface.co/Qwen/Qwen3-8B/blob/main/config.json
  tech_report:
    url: https://arxiv.org/pdf/2505.09388
  from_scratch:
    url: https://github.com/rasbt/LLMs-from-scratch/tree/main/ch05/11_qwen3
Qwen3 Next 80B-A3B:
  image: "https://sebastianraschka.com/llm-architecture-gallery/images/architectures/qwen3-next-80b-a3b.webp"
  date: '2025-09-09'
  summary: Efficiency-focused Qwen refresh that swaps standard attention for a DeltaNet-attention hybrid.
  scale: 80B total, 3B active (3.8% active)
  context_tokens: '262,144'
  license_name: Apache License 2.0
  license_url: https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Instruct/blob/main/LICENSE
  decoder_type: Sparse hybrid
  attention: 3:1 Gated DeltaNet and Gated Attention
  layer_mix: '12 gated attention + 36 DeltaNet'
  kv_cache_per_token_bf16: '24 KiB'
  highlight: Adds many more experts, a shared expert, and a native 262k context.
  aa_intelligence_index: '20.1'
  aai_profile: 'General 28.9 · Scientific 22.1 · Coding 15.3 · Agents 14.2'
  aa_url: https://artificialanalysis.ai/models/qwen3-next-80b-a3b-instruct
  config:
    repo: Qwen/Qwen3-Next-80B-A3B-Instruct
    label: config.json
    url: https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Instruct/blob/main/config.json
Qwen3 Coder Flash 30B-A3B:
  image: "https://sebastianraschka.com/llm-architecture-gallery/images/architectures/qwen3-coder-flash-30b-a3b-mixture-of-experts.webp"
  date: '2025-07-31'
  summary: Coding-tuned Qwen model that keeps a straightforward grouped-query MoE stack instead of the newer hybrid-attention variants.
  scale: 30B total, 3.3B active (11% active)
  context_tokens: '256,000'
  license_name: Apache License 2.0
  license_url: https://huggingface.co/Qwen/Qwen3-Coder-30B-A3B-Instruct/blob/main/LICENSE
  decoder_type: Sparse MoE
  attention: GQA
  layer_mix: '48 GQA'
  kv_cache_per_token_bf16: '96 KiB'
  highlight: Uses 128 experts with 8 active per token and a native 256k context window for coding workloads.
  aa_intelligence_index: '20.0'
  aai_profile: 'General 24.6 · Scientific 14.9 · Coding 19.4 · Agents 21.1'
  aa_url: https://artificialanalysis.ai/models/qwen3-coder-30b-a3b-instruct
  config:
    repo: Qwen/Qwen3-Coder-30B-A3B-Instruct
    label: config.json
    url: https://huggingface.co/Qwen/Qwen3-Coder-30B-A3B-Instruct/blob/main/config.json
  from_scratch:
    url: https://github.com/rasbt/LLMs-from-scratch/tree/main/ch05/11_qwen3
Qwen3.5 397B:
  image: "https://sebastianraschka.com/llm-architecture-gallery/images/architectures/qwen3-5-397b.webp"
  date: '2026-02-16'
  summary: Mainline Qwen refresh that brings the Next-style hybrid attention into the flagship series.
  scale: 397B total, 17B active (4.3% active)
  context_tokens: '262,144'
  license_name: Apache License 2.0
  license_url: https://huggingface.co/Qwen/Qwen3.5-397B-A17B/blob/main/LICENSE
  decoder_type: Sparse hybrid
  attention: 3:1 Gated DeltaNet and Gated Attention
  layer_mix: '15 gated attention + 45 DeltaNet'
  kv_cache_per_token_bf16: '30 KiB'
  highlight: Turns the former Qwen3-Next side branch into the new core design with 512 experts and 17B active parameters.
  aa_intelligence_index: '40.1'
  aai_profile: 'General 38.5 · Scientific 31.1 · Coding 37.4 · Agents 53.3'
  aa_url: https://artificialanalysis.ai/models/qwen3-5-397b-a17b-non-reasoning
  config:
    repo: Qwen/Qwen3.5-397B-A17B
    label: config.json
    url: https://huggingface.co/Qwen/Qwen3.5-397B-A17B/blob/main/config.json
  from_scratch:
    url: https://github.com/rasbt/LLMs-from-scratch/tree/main/ch05/16_qwen3.5
Sarvam 105B:
  image: "https://sebastianraschka.com/llm-architecture-gallery/images/architectures/sarvam-105b.webp"
  date: '2026-03-03'
  summary: Larger Sarvam variant keeps the sparse MoE layout but switches from GQA to MLA.
  scale: 105B total, 10.3B active (9.8% active)
  context_tokens: '131,072'
  license_name: Apache License 2.0
  license_url: https://huggingface.co/sarvamai/sarvam-105b/blob/main/README.md
  decoder_type: Sparse MoE
  attention: MLA with KV LayerNorm and NoPE + RoPE
  layer_mix: '32 MLA'
  kv_cache_per_token_bf16: '36 KiB'
  highlight: Large vocabulary and strong Indic language support carried into the larger MLA-based sparse MoE variant.
  aa_intelligence_index: '18.2'
  aai_profile: 'General 14.6 · Scientific 23.5 · Coding 9.8 · Agents 24.7'
  aa_url: https://artificialanalysis.ai/models/sarvam-105b
  config:
    repo: sarvamai/sarvam-105b
    label: config.json
    url: https://huggingface.co/sarvamai/sarvam-105b/blob/main/config.json
  tech_report:
    url: https://www.sarvam.ai/blogs/sarvam-30b-105b
Sarvam 30B:
  image: "https://sebastianraschka.com/llm-architecture-gallery/images/architectures/sarvam-30b.webp"
  date: '2026-03-03'
  summary: Reasoning-oriented Indian-language sparse MoE that keeps GQA at the smaller size.
  scale: 30B total, 2.4B active (8% active)
  context_tokens: '131,072'
  license_name: Apache License 2.0
  license_url: https://huggingface.co/sarvamai/sarvam-30b/blob/main/README.md
  decoder_type: Sparse MoE
  attention: GQA with QK-Norm
  layer_mix: '19 GQA'
  kv_cache_per_token_bf16: '19 KiB'
  highlight: Large vocabulary and strong Indic language support paired with a reasoning-focused sparse MoE design.
  aa_intelligence_index: '12.3'
  aai_profile: 'General 10.5 · Scientific 19.4 · Coding 7.9 · Agents 11.5'
  aa_url: https://artificialanalysis.ai/models/sarvam-30b
  config:
    repo: sarvamai/sarvam-30b
    label: config.json
    url: https://huggingface.co/sarvamai/sarvam-30b/blob/main/config.json
  tech_report:
    url: https://www.sarvam.ai/blogs/sarvam-30b-105b
SmolLM3 3B:
  image: "https://sebastianraschka.com/llm-architecture-gallery/images/architectures/smollm3-3b.webp"
  date: '2025-06-19'
  summary: Compact dense model that experiments with leaving out positional encodings in selected layers.
  scale: 3B parameters
  context_tokens: '131,072'
  license_name: Apache License 2.0
  license_url: https://huggingface.co/HuggingFaceTB/SmolLM3-3B-Base/blob/main/README.md
  decoder_type: Dense
  attention: GQA with periodic NoPE layers
  layer_mix: '36 GQA'
  kv_cache_per_token_bf16: '72 KiB'
  highlight: Every fourth layer omits RoPE to test a NoPE-style cadence.
  aa_intelligence_index: 'N/A'
  aai_profile: 'N/A'
  aa_url:
  config:
    repo: HuggingFaceTB/SmolLM3-3B-Base
    label: config.json
    url: https://huggingface.co/HuggingFaceTB/SmolLM3-3B-Base/blob/main/config.json
  tech_report:
    url: https://huggingface.co/blog/smollm3
Step 3.5 Flash 196B:
  image: "https://sebastianraschka.com/llm-architecture-gallery/images/architectures/step-3-5-flash-196b.webp"
  date: '2026-02-01'
  summary: Throughput-oriented MoE model that stays competitive with much larger DeepSeek-style systems.
  scale: 196B total, 11B active (5.6% active)
  context_tokens: '262,144'
  license_name: Apache License 2.0
  license_url: https://huggingface.co/stepfun-ai/Step-3.5-Flash/blob/main/README.md
  decoder_type: Sparse MoE
  attention: GQA with 3:1 sliding-window attention
  layer_mix: '36 sliding-window + 12 global'
  kv_cache_per_token_bf16: '192 KiB'
  highlight: Uses MTP-3 during both training and inference for unusually high throughput.
  aa_intelligence_index: '37.8'
  aai_profile: 'General 36.6 · Scientific 30.9 · Coding 31.6 · Agents 52.0'
  aa_url: https://artificialanalysis.ai/models/step-3-5-flash
  config:
    repo: stepfun-ai/Step-3.5-Flash
    label: config.json
    url: https://huggingface.co/stepfun-ai/Step-3.5-Flash/blob/main/config.json
  tech_report:
    url: https://arxiv.org/pdf/2602.10604
Tiny Aya 3.35B:
  image: "https://sebastianraschka.com/llm-architecture-gallery/images/architectures/tiny-aya-3-35b.webp"
  date: '2026-02-13'
  summary: Compact multilingual model from Cohere with a rare parallel transformer block.
  scale: 3.35B parameters
  context_tokens: '8,192'
  license_name: Creative Commons Attribution-NonCommercial 4.0
  license_url: https://huggingface.co/CohereLabs/tiny-aya-base/blob/main/README.md
  decoder_type: Dense
  attention: GQA with 3:1 sliding-window attention
  layer_mix: '27 sliding-window + 9 global'
  kv_cache_per_token_bf16: '72 KiB'
  highlight: Runs attention and the MLP in parallel while mixing RoPE with NoPE.
  aa_intelligence_index: 'N/A'
  aai_profile: 'N/A'
  aa_url:
  config:
    repo: CohereLabs/tiny-aya-base
    label: config.json
    url: https://huggingface.co/CohereLabs/tiny-aya-base/blob/main/config.json
  tech_report:
    url: https://arxiv.org/pdf/2603.11510
  from_scratch:
    url: https://github.com/rasbt/LLMs-from-scratch/tree/main/ch05/15_tiny-aya
Xiaomi MiMo-V2-Flash 309B:
  image: "https://sebastianraschka.com/llm-architecture-gallery/images/architectures/xiaomi-mimo-v2-flash-309b.webp"
  date: '2025-12-16'
  summary: Large MoE model that pushes sliding-window attention harder than most contemporaries.
  scale: 309B total, 15B active (4.9% active)
  context_tokens: '262,144'
  license_name: MIT License
  license_url: https://huggingface.co/XiaomiMiMo/MiMo-V2-Flash/blob/main/README.md
  decoder_type: Sparse MoE
  attention: 5:1 sliding-window/global attention
  layer_mix: '40 sliding-window + 8 global'
  kv_cache_per_token_bf16: '144 KiB'
  highlight: Uses an unusually small 128-token local window plus multi-token prediction.
  aa_intelligence_index: '30.4'
  aai_profile: 'General 27.8 · Scientific 20.4 · Coding 25.8 · Agents 47.3'
  aa_url: https://artificialanalysis.ai/models/mimo-v2-flash
  config:
    repo: XiaomiMiMo/MiMo-V2-Flash
    label: config.json
    url: https://huggingface.co/XiaomiMiMo/MiMo-V2-Flash/blob/main/config.json
  tech_report:
    url: https://arxiv.org/pdf/2601.02780
xLSTM 7B:
  image: "https://sebastianraschka.com/llm-architecture-gallery/images/architectures/xlstm-7b.webp"
  date: '2025-03-17'
  summary: Recurrent 7B language model that replaces self-attention with xLSTM blocks built around matrix memory.
  scale: 7B parameters
  context_tokens: No explicit limit
  license_name: NXAI Community License Agreement
  license_url: https://huggingface.co/NX-AI/xLSTM-7b/blob/main/LICENSE
  decoder_type: Recurrent
  attention: No self-attention; mLSTM recurrent layers with matrix memory
  layer_mix: '32 mLSTM'
  kv_cache_per_token_bf16: '0 B'
  highlight: Stateful recurrent architecture aimed at fast long-context inference without an explicit context window.
  aa_intelligence_index: 'N/A'
  aai_profile: 'N/A'
  aa_url:
  config:
    repo: NX-AI/xLSTM-7b
    label: config.json
    url: https://huggingface.co/NX-AI/xLSTM-7b/blob/main/config.json
  hide_article_link: true
  tech_report:
    url: https://arxiv.org/abs/2503.13427