Star 历史趋势
数据来源: GitHub API · 生成自 Stargazers.cn
README.md

YiRage - Yield Revolutionary AGile Engine

Multi-Backend LLM Inference Optimization

License Python GitHub


🎯 About YiRage

Ask DeepWiki

YiRage (Yield Revolutionary AGile Engine) provides comprehensive multi-backend support for LLM inference optimization across diverse hardware platforms.

Multi-Backend Optimization Focus

  • Unified optimization workflow across CUDA, ROCm, CPU, MPS, Ascend, MACA, TPU, XPU, FPGA, Triton, NKI, and MLIR backends
  • Hardware-aware search, profiling, and kernel generation for deployment-focused LLM inference
  • Extensible backend architecture for adding new hardware targets and compiler integrations

🏗️ Architecture

Five-Layer Backend Architecture

┌─────────────────────────────────────────────────────────────────────────────────────────┐
│                              YiRage Backend Architecture                                │
├─────────────────────────────────────────────────────────────────────────────────────────┤
│                                                                                         │
│  ┌─────────────────────────────────────────────────────────────────────────────────┐    │
│  │                         Layer 1: Python API                                     │    │
│  │  yirage.new_kernel_graph() → UnifiedCompiler → CoreBridge → superoptimize()     │    │
│  │  HardwareRegistry.instance() → ChipArchitecture → detect_current_chip()         │    │
│  └─────────────────────────────────────────────────────────────────────────────────┘    │
│                                        │                                                │
│  ┌─────────────────────────────────────▼───────────────────────────────────────────┐    │
│  │                         Layer 2: Backend Manager (C++)                          │    │
│  │  BackendRegistry (thread-safe) ← BackendFactory ← StrategyFactory               │    │
│  └─────────────────────────────────────────────────────────────────────────────────┘    │
│                                        │                                                │
│  ┌─────────────────────────────────────▼───────────────────────────────────────────┐    │
│  │                         Layer 3: Search & Strategy                              │    │
│  │  Hardware-aware Search │ Fingerprint Verification │ Performance Profiling       │    │
│  └─────────────────────────────────────────────────────────────────────────────────┘    │
│                                        │                                                │
│  ┌─────────────────────────────────────▼───────────────────────────────────────────┐    │
│  │                         Layer 4: Threadblock Operations                         │    │
│  │  MatMul │ Attention │ RMSNorm │ SwiGLU │ Softmax │ Reduce │ Elementwise         │    │
│  └─────────────────────────────────────────────────────────────────────────────────┘    │
│                                        │                                                │
│  ┌─────────────────────────────────────▼───────────────────────────────────────────┐    │
│  │                         Layer 5: Persistent Kernel Runtime                      │    │
│  │  Memory Management │ Kernel Launch │ Synchronization │ JIT Compilation          │    │
│  └─────────────────────────────────────────────────────────────────────────────────┘    │
│                                        │                                                │
│  ┌─────────────────────────────────────▼───────────────────────────────────────────┐    │
│  │                              Hardware Layer                                     │    │
│  │  ┌─────┐ ┌─────┐ ┌─────┐ ┌───────┐ ┌──────┐ ┌──────┐ ┌─────┐ ┌─────┐ ┌──────┐   │    │
│  │  │CUDA │ │ROCm │ │ MPS │ │Ascend │ │ MACA │ │ TPU  │ │ XPU │ │FPGA │ │ CPU  │   │    │
│  │  │NVIDA│ │ AMD │ │Apple│ │Huawei │ │MetaX │ │Google│ │Intel│ │Xilinx││x86/ARM│  │    │
│  │  └─────┘ └─────┘ └─────┘ └───────┘ └──────┘ └──────┘ └─────┘ └─────┘ └──────┘   │    │
│  │  ┌───────┐ ┌─────┐ ┌──────┐                                                     │    │
│  │  │Triton │ │ NKI │ │ MLIR │  ← Compiler Backends                                │    │
│  │  │OpenAI │ │ AWS │ │ LLVM │                                                     │    │
│  │  └───────┘ └─────┘ └──────┘                                                     │    │
│  └─────────────────────────────────────────────────────────────────────────────────┘    │
│                                                                                         │
└─────────────────────────────────────────────────────────────────────────────────────────┘

Backend Support Matrix (12 Backends × 5 Layers)

BackendHardwareBackend APIStrategyKernelThreadblockPK Runtime
CUDANVIDIA GPU
ROCmAMD GPU
CPUx86/ARM
MPSApple Silicon
AscendHuawei NPU
MACAMetaX GPU
TPUGoogle Cloud
XPUIntel GPU
FPGAIntel/Xilinx
TritonCompiler
NKIAWS Neuron🚧🚧🚧
MLIRMulti-target

Status note: ✅ means the YiRage interface and modeling path exist; 🚧 means the backend is limited to modeling/code-generation paths and is not yet available through the runtime execution API. Some vendor-specific threadblock and persistent-kernel implementations remain experimental and are excluded from the default CMake build until their vendor toolchains and interfaces are complete.

Five-Layer Design

Layer 1: Python API

  • Backend query and selection (get_available_backends())
  • Hardware Device Registry (HardwareRegistry — register/query chip architectures at runtime)
  • Unified compiler interface (UnifiedCompiler)
  • Core bridge to C++ (CoreBridge)
  • Hardware-specific optimizers

Layer 2: Backend Manager (C++)

  • BackendRegistry (singleton, thread-safe)
  • Factory patterns for backends and strategies
  • Automatic initialization on import

Layer 3: Search & Strategy

  • Hardware-aware kernel search
  • Fingerprint-based verification
  • Performance profiling and modeling

Layer 4: Threadblock Operations

  • Optimized LLM operators (MatMul, Attention, RMSNorm, SwiGLU)
  • Hardware-specific implementations
  • Code generation for Triton/NKI/MLIR

Layer 5: Persistent Kernel Runtime

  • Device memory management
  • Kernel launch and synchronization
  • JIT compilation support

✨ Key Features

🚀 12 Backend Targets (Core + Experimental)

BackendHardwareKey FeaturesArchitecture
CUDANVIDIA GPUTensor Core, 32-thread Warp, cuBLASSM, Shared Memory
ROCmAMD GPUMatrix Core, 64-thread Wavefront, rocBLASGCN/CDNA, LDS
CPUx86/ARMAVX512/NEON SIMD, Cache Blocking, OpenMPMulti-core, L1/L2/L3
MPSApple SiliconMetal, Threadgroup, Unified MemoryM1/M2/M3/M4
AscendHuawei NPUCube Unit 16×16, AI Core, L1 BufferAscend 910/310
MACAMetaX GPU64-thread Warp, CUDA-compat, Tensor CoreC500 Series
TPUGoogle CloudMXU 128×128, BF16 Native, PJRTTPU v2/v3/v4/v5
XPUIntel GPUXMX 8×8, SYCL/oneAPI, SLMArc/Max/Gaudi
FPGAIntel/XilinxDSP Blocks, Pipeline, BRAM/HBMOpenCL Kernel
TritonCompilerAuto-tuning, Tile Fusion, MMAPTX/HSACO
NKIAWS NeuronTensor Engine 128×128, SBUF 24MBTrainium/Inferentia
MLIRMulti-targetJIT, Linalg, Pass PipelineLLVM/NVVM/SPIRV

🔧 Hardware Architecture Differences

┌────────────────────────────────────────────────────────────────────────────────────┐
│                        Hardware Architecture Comparison                            │
├────────────┬─────────────────┬─────────────────┬───────────────────────────────────┤
│ Backend    │ Thread Model    │ Matrix Unit     │ Memory Hierarchy                  │
├────────────┼─────────────────┼─────────────────┼───────────────────────────────────┤
│ CUDA       │ 32-thread Warp  │ Tensor Core     │ Registers → Shared → L2 → HBM     │
│ ROCm       │ 64-thread Wave  │ Matrix Core     │ VGPR → LDS → L2 → HBM             │
│ MPS        │ SIMD Group      │ Apple GPU       │ Threadgroup → Device → Unified    │
│ Ascend     │ AI Core         │ Cube 16×16      │ L0 → L1 → L2 → HBM                │
│ MACA       │ 64-thread Warp  │ Tensor Core     │ Shared → L2 → HBM                 │
│ TPU        │ MXU Systolic    │ MXU 128×128     │ VMEM → HBM                        │
│ XPU        │ Xe Subgroup     │ XMX 8×8         │ SLM → L3 → HBM                    │
│ FPGA       │ Pipeline        │ DSP Block       │ BRAM/URAM → DDR/HBM               │
└────────────┴─────────────────┴─────────────────┴───────────────────────────────────┘

🎯 Hardware-Aware Kernel Optimizers

  • 60+ Optimization Methods across all 12 backends
  • Automatic Configuration based on hardware capabilities
  • Performance Modeling for each backend
  • Code Generation for Triton/NKI/MLIR

Example: CUDA Optimizer

from yirage.backends.cuda.config import CUDAArch, get_cuda_search_config config = get_cuda_search_config(CUDAArch.AMPERE) print(config["arch"], config["warp_size"], config["has_tensor_cores"]) # Auto-configured: Tensor Core, warps, shared memory, search space

Example: MPS Optimizer (Apple Silicon)

from yirage.backends.mps.config import AppleChipFamily, get_mps_search_config config = get_mps_search_config(AppleChipFamily.M3_MAX) print(config["chip_family"], config["gpu_cores"], config["max_threads_per_threadgroup"]) # Auto-configures: M-series family, GPU cores, threadgroup size

Example: Ascend Optimizer (Huawei NPU)

import yirage as yr # Create and optimize for Ascend NPU graph = yr.new_kernel_graph() X = graph.new_input(dims=(8, 4096), dtype=yr.float16) W = graph.new_input(dims=(4096, 4096), dtype=yr.float16) O = graph.matmul(X, W) graph.mark_output(O) # Optimize using Ascend backend (via BiSheng + Triton) optimized = graph.superoptimize(backend='ascend') # Auto-configures: AI Core blocks, Cube unit tiles, L1 buffer

Example: MACA Optimizer (MetaX GPU)

import yirage as yr # Create and optimize for MetaX MACA GPU graph = yr.new_kernel_graph() X = graph.new_input(dims=(8, 4096), dtype=yr.float16) W = graph.new_input(dims=(4096, 4096), dtype=yr.float16) O = graph.matmul(X, W) graph.mark_output(O) # Optimize using MACA backend (64-thread warps!) optimized = graph.superoptimize(backend='maca') # Auto-configures: 64-thread warps, tile sizes, shared memory # Environment: export MACA_HOME=/opt/maca

Example: ROCm Optimizer (AMD GPU) 🆕

import yirage as yr # Create and optimize for AMD GPU graph = yr.new_kernel_graph() X = graph.new_input(dims=(8, 4096), dtype=yr.float16) W = graph.new_input(dims=(4096, 4096), dtype=yr.float16) O = graph.matmul(X, W) graph.mark_output(O) # Optimize using ROCm backend optimized = graph.superoptimize(backend='rocm') # Auto-configures: 64-thread wavefronts, LDS, Matrix Cores (MI200/MI300) # Environment: export ROCM_PATH=/opt/rocm

Example: TPU Optimizer (Google Cloud) 🆕

import yirage as yr # Create and optimize for Google TPU graph = yr.new_kernel_graph() X = graph.new_input(dims=(8, 4096), dtype=yr.bfloat16) W = graph.new_input(dims=(4096, 4096), dtype=yr.bfloat16) O = graph.matmul(X, W) graph.mark_output(O) # Optimize using TPU backend optimized = graph.superoptimize(backend='tpu') # Auto-configures: 128x128 MXU, BF16 native, VMEM tiling

Example: MLIR JIT Compiler 🆕

from yirage.pk import MLIRPKBackend from yirage.threadblock.mlir_ops import MLIRCodeGenerator, MLIRTileConfig # Generate MLIR for MatMul config = MLIRTileConfig(tile_sizes=[32, 32, 32], vectorize=True) mlir_code = MLIRCodeGenerator.generate_matmul(1024, 1024, 1024, dtype=yr.float16, config=config) # JIT compile and execute backend = MLIRPKBackend(target=MLIRPKBackend.JIT_TARGET_CPU) backend.initialize() backend.jit_compile(mlir_code) backend.execute("matmul", [A_ptr, B_ptr, C_ptr], 3)

🔍 Backend-Specific Search Strategies

┌──────────────────────────────────────────────────────────────────────────────────────┐
│                              Search & Optimization Flow                              │
├──────────────────────────────────────────────────────────────────────────────────────┤
│                                                                                      │
│  ┌──────────────┐     ┌──────────────────────────────────────────────────────────┐   │
│  │ Kernel Graph │────▶│                  Search Engine                           │   │
│  └──────────────┘     │  ┌────────────────┐  ┌───────────────┐  ┌─────────────┐  │   │
│                       │  │ Candidate Gen  │──│ Fingerprint   │──│ Performance │  │   │
│                       │  │ (µGraph Space) │  │ Verification  │  │  Profiler   │  │   │
│                       │  └────────────────┘  └───────────────┘  └─────────────┘  │   │
│                       └─────────────────────────────┬────────────────────────────┘   │
│                                                     │                                │
│  ┌──────────────────────────────────────────────────▼───────────────────────────┐    │
│  │                       Backend-Specific Strategies                            │    │
│  ├────────────┬────────────┬────────────┬────────────┬────────────┬─────────────┤    │
│  │   CUDA     │   ROCm     │   MPS      │  Ascend    │   MACA     │    TPU      │    │
│  │ TensorCore │ MatrixCore │ ThreadGrp  │  CubeUnit  │  64-Warp   │    MXU      │    │
│  │  32-Warp   │  64-Wave   │   SIMD     │  AI Core   │ TensorCore │  128×128    │    │
│  ├────────────┼────────────┼────────────┼────────────┼────────────┼─────────────┤    │
│  │    XPU     │   FPGA     │  Triton    │    NKI     │   MLIR     │    CPU      │    │
│  │    XMX     │  Pipeline  │ AutoTune   │ TensorEng  │ LinalgOpt  │  SIMD/OMP   │    │
│  │   SYCL     │    DSP     │ TileFuse   │   SBUF     │ JIT/AOT    │ CacheBlock  │    │
│  └────────────┴────────────┴────────────┴────────────┴────────────┴─────────────┘    │
│                                       │                                              │
│                         ┌─────────────▼─────────────┐                                │
│                         │    Optimized Kernel       │                                │
│                         │  (Best Configuration)     │                                │
│                         └───────────────────────────┘                                │
└──────────────────────────────────────────────────────────────────────────────────────┘
  • 12 Independent Search Strategies with hardware-specific optimization
  • 20+ Candidate Generation Dimensions
  • 15 Performance Evaluation Metrics
  • Auto-tuning and performance modeling
  • Code generation for compiler backends (Triton, NKI, MLIR)

🔌 Hardware Device Management (New!)

YiRage provides a unified hardware registry that allows new chip architectures to be registered at runtime — no code changes required. This is the highest level of hardware adaptation: any new chip can be plugged into the system by describing its architecture once.

Architecture

┌──────────────────────────────────────────────────────────────────────────┐
│                     Hardware Device Management                           │
├──────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  ┌─────────────────────┐     ┌──────────────────────────────────────┐    │
│  │  HardwareRegistry   │     │  ChipArchitecture Dataclass          │    │
│  │  (Thread-safe       │────▶│  ┌──────────┐ ┌───────────┐          │    │
│  │   Singleton)        │     │  │MemorySpec│ │ComputeSpec│          │    │
│  │                     │     │  └──────────┘ └───────────┘          │    │
│  │  • register()       │     │  ┌────────────┐ ┌────────┐           │    │
│  │  • get()            │     │  │FeatureFlags│ │Metadata│           │    │
│  │  • list_by_vendor() │     │  └────────────┘ └────────┘           │    │
│  │  • list_by_backend()│     └──────────────────────────────────────┘    │
│  │  • import_json()    │                                                 │
│  │  • export_json()    │     ┌──────────────────────────────────────┐    │
│  │  • on_register()    │     │  Built-in: 20+ Chips Pre-registered  │    │
│  └─────────────────────┘     │  NVIDIA V100→B200 │ AMD MI250X/MI300X│    │
│                              │  Ascend 910/910B  │ MetaX C500       │    │
│  ┌─────────────────────┐     │  Apple M2–M4      │ Google TPU v4/v5e│    │
│  │  Auto-Detection     │     │  Intel PVC │ Xilinx Alveo │ AWS Trn2 │    │ 
│  │  nvidia-smi/rocm-smi│     └──────────────────────────────────────┘    │
│  │  npu-smi/mx-smi/MPS │                                                 │
│  └─────────────────────┘                                                 │
│                                                                          │
└──────────────────────────────────────────────────────────────────────────┘

20+ Built-in Chip Architectures

VendorChipsBackendCategory
NVIDIAV100, T4, A100, RTX 3090, RTX 4090, H100, B200cudaGPU
AMDMI250X, MI300XrocmGPU
IntelData Center GPU Max 1550 (PVC)xpuGPU
HuaweiAscend 910, 910B, 310PascendNPU
MetaXC500, C500 PromacaGPU
AppleM2 Ultra, M3 Max, M4 MaxmpsGPU
GoogleTPU v4, TPU v5etpuTPU
XilinxAlveo U250fpgaFPGA
AWSTrainium2nkiDSA

Quick Start — Query Built-in Chips

from yirage.hardware import HardwareRegistry reg = HardwareRegistry.instance() # Look up a chip h100 = reg.get("nvidia_h100") print(h100.summary()) # NVIDIA H100 SXM5 | 132 CUs | 80GB HBM3 | 989 TFLOPS FP16 # List chips by vendor / backend / category nvidia_chips = reg.list_by_vendor("nvidia") # 7 chips cuda_chips = reg.list_by_backend("cuda") # all CUDA-mapped chips gpu_chips = reg.list_by_category("gpu") # all GPUs across vendors

Register a New Chip at Runtime

from yirage.hardware import ( HardwareRegistry, ChipArchitecture, ChipVendor, ChipCategory, ComputeSpec, MemorySpec, MemoryType, FeatureFlags, ) reg = HardwareRegistry.instance() # Define the new chip new_chip = ChipArchitecture( chip_id="myvendor_x1", chip_name="MyVendor X1 Accelerator", vendor=ChipVendor.OTHER, category=ChipCategory.DSA, arch_name="X1", arch_code="x1_v1", backend="cuda", # maps to YiRage backend memory=MemorySpec( capacity_gb=128, bandwidth_gbps=6000, memory_type=MemoryType.HBM3E, ), compute=ComputeSpec( warp_size=32, num_compute_units=256, peak_tflops_fp16=2000, ), features=FeatureFlags( tensor_cores=True, fp8=True, bf16=True, ), ) reg.register(new_chip) print(f"Registry now has {reg.size} chips")

Bulk Import / Export (JSON)

# Export entire registry to a file reg.export_json("/path/to/chips.json") # Import chips from a JSON file (e.g. from a partner's chip catalog) count = reg.import_json("/path/to/new_chips.json") print(f"Imported {count} new chips")

Auto-detect Current Hardware

from yirage.hardware import detect_current_chip chip = detect_current_chip() if chip: print(f"Detected: {chip.summary()}") print(f"Backend: {chip.backend}") print(f"Memory: {chip.memory.capacity_gb} GB {chip.memory.memory_type.value}") print(f"FP16: {chip.compute.peak_tflops_fp16} TFLOPS")

React to New Registrations (Callback)

from yirage.hardware import ChipArchitecture, HardwareRegistry reg = HardwareRegistry.instance() def on_new_chip(chip): print(f"🆕 New chip registered: {chip.chip_name} ({chip.chip_id})") another_chip = ChipArchitecture( chip_id="callback_demo", chip_name="Callback Demo", backend="cpu", ) reg.on_register(on_new_chip) reg.register(another_chip, overwrite=True) # triggers callback

Module Structure

python/yirage/hardware/
├── __init__.py          # Public API — auto-populates built-in chips on import
├── chip_arch.py         # ChipArchitecture, MemorySpec, ComputeSpec, FeatureFlags
├── registry.py          # HardwareRegistry (thread-safe singleton)
├── builtin_chips.py     # 20+ pre-registered chip definitions
└── detector.py          # Runtime auto-detection (nvidia-smi, npu-smi, etc.)

🚀 Quick Start

Installation

Native runtime: import yirage requires yirage.core (Cython) linked against libyirage_runtime. Build from source with pip install -e . (see AGENTS.md) or use a wheel that includes the extension. Optional PyPI extras only add Python dependencies; they do not remove the need for native code.

Quick Install (Auto-detect Hardware)

git clone https://github.com/chenxingqiang/YiRage.git cd YiRage pip install -e . # Auto-detects CUDA/MPS/CPU

Specify Backend

# Using environment variable YIRAGE_BACKEND=cuda pip install -e . # NVIDIA GPU YIRAGE_BACKEND=rocm pip install -e . # AMD GPU YIRAGE_BACKEND=mps pip install -e . # Apple Silicon YIRAGE_BACKEND=ascend pip install -e . # Huawei NPU YIRAGE_BACKEND=maca pip install -e . # MetaX GPU YIRAGE_BACKEND=cpu pip install -e . # CPU backend (still full native build) # Multiple backends YIRAGE_BACKEND=cuda,cpu pip install -e .

Huawei Ascend NPU

📖 Full Ascend Guide

# Load environment source /usr/local/Ascend/ascend-toolkit/set_env.sh pip install torch_npu # Install YIRAGE_BACKEND=ascend pip install -e .

📖 Full Installation Guide - All backends and options

Basic Usage

import yirage as yr # Query available backends backends = yr.get_available_backends() print(f"Available backends: {backends}") # Output: ['cuda', 'cpu', 'mps'] # depends on your hardware # Check specific backend if yr.is_backend_available('mps'): print("Apple Silicon GPU ready!") # Create kernel with backend selection mpk = yr.PersistentKernel( mode="decode", backend="mps", # Specify backend fallback_backends=["cpu"], # Auto fallback world_size=1, mpi_rank=0, # ... other parameters )

Using Hardware-Specific Optimizers

# CUDA optimization from yirage.backends.cuda.config import CUDAArch, get_cuda_search_config cuda_config = get_cuda_search_config(CUDAArch.AMPERE) print(f"CUDA tile candidates: {cuda_config['block_dims_to_explore'][:3]}") # CPU optimization from yirage.backends.cpu.config import get_cpu_search_config cpu_config = get_cpu_search_config() print(f"CPU SIMD: {cpu_config['simd_type']} across {cpu_config['num_cores']} cores") # Auto-detects: SIMD type, CPU cores, cache-aware search space # MPS optimization (Apple Silicon) from yirage.backends.mps.config import AppleChipFamily, get_mps_search_config mps_config = get_mps_search_config(AppleChipFamily.M3_MAX) print(f"MPS chip: {mps_config['chip_family']} with {mps_config['gpu_cores']} GPU cores") # Auto-configures: GPU family, cores, memory/search limits

📊 Performance

M3 Mac Benchmarks

BenchmarkMPS (ms)CPU (ms)
gated_mlp0.6771.268
rms_norm0.4630.115
lora0.6370.590
gqa0.554-
norm_transformer1.195-

All benchmarks support CUDA, MPS, and CPU backends


🤖 RL-Guided Kernel Search

YiRage now supports RL-guided kernel search using Ray/RLlib, enabling intelligent exploration of the kernel configuration space.

Hierarchical Closed-Loop Architecture

┌─────────────────────────────────────────────────────────────────────┐
│                  RL-YiRage Hierarchical Closed Loop                 │
│                                                                     │
│  Level 1: Config Policy                                             │
│  ┌─────────────────────────────────────────────────────────────┐    │
│  │ HardwareConfig (grid_dim, block_dim, forloop) ────────────┐ │    │
│  └─────────────────────────────────────────────────────────┐ │ │    │
│                                                            │ │ │    │
│  Level 2: Graph Policy (constrained by Level 1)            │ │ │    │
│  ┌─────────────────────────────────────────────────────┐   │ │ │    │
│  │ µGraph actions ─▶ C++ Search ─▶ GPU Verify ─▶ reward│◀──┘ │ │    │
│  └─────────────────────────────────────────────────────┘     │ │    │
│         ▲                                                    │ │    │
│         └──── µGraph features (from C++) ◀───────────────────┘ │    │
│                                                                │    │
│                      policy update (RLlib)  ◀──────────────────┘    │
└─────────────────────────────────────────────────────────────────────┘

Quick Start

# Run integration tests (no GPU required) python scripts/test_rl_integration.py # Test locally (no GPU required) python scripts/train_rl_kernel_search.py --mode local --test-episodes 10 # Train with Ray/RLlib (requires GPU for verification) python scripts/train_rl_kernel_search.py --mode train \ --algorithm PPO \ --num-workers 8 \ --max-iterations 1000 # Search with trained policy python scripts/train_rl_kernel_search.py --mode search \ --checkpoint /path/to/checkpoint \ --target-graph examples/matmul.json

Python API

from yirage.rl import YiRageSearchEnv, EnvConfig, train_rl_search # Create environment env_config = EnvConfig( target_graph_json=target_graph, backend="cuda", num_gpus=4, ) # Option 1: Use as Gymnasium environment env = YiRageSearchEnv(vars(env_config)) obs, info = env.reset() action = env.action_space.sample() obs, reward, done, truncated, info = env.step(action) # Option 2: Train with RLlib from yirage.rl import TrainingConfig config = TrainingConfig( algorithm="PPO", num_workers=8, max_iterations=500, ) results = train_rl_search(config)

Hierarchical Search

from yirage.rl.search import ( HardwareConfig, SearchSpaceConstraints, ConstrainedGraphActionSpace, HierarchicalSearchEnv ) # Level 1: Configure hardware parameters config = HardwareConfig( grid_dim_x=4, grid_dim_y=2, grid_dim_z=1, block_dim_x=128, block_dim_y=1, block_dim_z=1, forloop_range=16, shared_memory_size=49152 ) # Level 2: Get constraints for graph search constraints = SearchSpaceConstraints(config) print(f"Valid imaps: {len(constraints.valid_imaps)}") print(f"Max operators: {constraints.max_operators}") # Create constrained graph action space graph_space = ConstrainedGraphActionSpace(constraints)

µGraph Feature Extraction

from yirage.rl.features import MuGraphFeature, FeatureProcessor # Features extracted from C++ layer (or simulated JSON) features = MuGraphFeature.from_json(features_json) print(f"Operators: {len(features.operators)}") print(f"Graph depth: {features.graph_depth}") # Process for neural network input processor = FeatureProcessor() processed = processor.process(features) # node_features: (num_nodes, 16) # edge_index: (2, num_edges) # global_features: (48,)

Key Features

  • Hierarchical Search: Level 1 (config) constrains Level 2 (µGraph)
  • Complete Closed Loop: RL decisions → C++ search → GPU verification → reward
  • AccelForge Pre-screening: virtual-hardware latency/energy/area/power modeling before physical profiling
  • µGraph Feature Extraction: Rich features from C++ layer for RL model input
  • Multi-objective Reward: Balances validity, performance, efficiency, exploration
  • Ray Integration: Distributed CPU workers + GPU verification
  • Action Masking: Prevents invalid actions based on search state
  • Model Persistence: Save/load trained policies, export to ONNX

Hardware-Aware Training

from yirage.rl.hardware import detect_hardware, get_optimal_config from yirage.rl.training import GRPOConfig, GRPOTrainer # Auto-detect hardware hardware = detect_hardware() print(f"Detected: {hardware.backend} - {hardware.device_name}") print(f"Peak FP16: {hardware.peak_tflops_fp16} TFLOPS") # Get optimal config for workload config = get_optimal_config(hardware, workload) # Train with GRPO (supports LoRA fine-tuning) grpo_config = GRPOConfig( group_size=8, learning_rate=1e-4, use_lora=True, lora_rank=16, )

AccelForge Hardware Co-design

YiRage can use AccelForge as a virtual hardware oracle for accelerator design-space exploration and kernel candidate pre-screening:

pip install "yirage[accelforge]"

See YiRage × AccelForge Quick Start for availability diagnostics, µGraph workload conversion, multi-objective metrics, pre-screening, and Pareto-front examples.

LLM Fine-tuning with TRL

from yirage.rl.training import FineTuningConfig, MuGraphPolicyTrainer # Configure fine-tuning with TRL config = FineTuningConfig( strategy="dpo", # sft, dpo, grpo, ppo model_name_or_path="meta-llama/Llama-2-7b-hf", use_lora=True, use_4bit=True, # QLoRA lora_r=16, ) # Train policy model trainer = MuGraphPolicyTrainer(config) trainer.train(train_data) # Generate optimal configs configs = trainer.generate_config(target_graph, hardware)

Universal Compute Optimization

Optimize any compute task on any hardware at any cluster scale with a single function call:

from yirage.rl.cluster import optimize_any_task # Optimize with one line result = optimize_any_task( {"type": "attention", "batch": 32, "seq_len": 2048, "num_heads": 32}, cluster_spec={"type": "multi_node", "num_nodes": 4, "gpus_per_node": 8} ) print(f"Strategy: {result.result.parallelism_strategy}") # e.g., "tensor_parallel_8" print(f"Latency: {result.result.estimated_latency_ms:.2f} ms") print(f"Throughput: {result.result.estimated_throughput_tps:.1f} samples/sec") # Get kernel configs for YiRage search for op_id, config in result.kernel_configs.items(): print(f"{op_id}: {config}")

Device Registry (25+ Pre-defined Devices):

from yirage.rl.cluster import ( ClusterTopology, DeviceRegistry, get_device_spec, register_custom_device ) # Create heterogeneous cluster from registry cluster = ClusterTopology.create_from_registry([ "H100_SXM:4", # 4x NVIDIA H100 "MI300X:2", # 2x AMD MI300X "TPUv4:2", # 2x Google TPU v4 "Ascend910B:2", # 2x Huawei Ascend ]) # Register custom hardware register_custom_device("MyAccelerator", { "device_type": "custom", "compute_units": 128, "peak_tflops_fp16": 500.0, "memory_gb": 64.0, "memory_bandwidth_gbps": 2000.0, })

Supported Device Types:

CategoryDevices
NVIDIA GPUH100, A100, V100, RTX 4090, RTX 3090
AMD GPUMI300X, MI250X
IntelMax 1550 (XPU)
GoogleTPU v4, TPU v5e
HuaweiAscend 910B, Ascend 910, Ascend 310
AWSTrainium2, Inferentia2
AppleM2 Ultra, M3 Max (MPS)
MetaXC500 (MACA)
CPUEPYC 9654, Xeon 8480
FPGAAlveo U280
CustomUser-defined devices

Key features:

  • Any Task: MatMul, Attention, MLP, Transformer, or custom graphs
  • Any Hardware: CPU, GPU, NPU, TPU, FPGA, or custom accelerators
  • Any Scale: Single device to multi-node clusters
  • Simulation-based: Accurate communication modeling without real cluster
  • µGraph Integration: Generates search space for YiRage kernel optimization
  • Device Registry: 25+ pre-defined devices with full specs

Design Documents


🔥 COMET: Compound Operations with Explicit Collectives

YiRage integrates the COMET framework for modeling and optimizing compound operation dataflows with explicit collective communication, based on the research paper:

"COMET: A Framework for Modeling Compound Operation Dataflows with Explicit Collectives" (Negi et al.)

Key Features

  • Compound Operations: Fused execution of GEMM-Softmax, GEMM-LayerNorm, Self-Attention, Gated MLP
  • Explicit Collectives: AllReduce, AllGather, ReduceScatter, Broadcast with accurate cost modeling
  • Data Staging Model: Ramp-up/steady-state/ramp-down phases for memory hierarchy
  • Scheduling Strategies: Sequential, Pipelined, Parallel execution modes
  • Energy & Latency Estimation: Detailed breakdown for optimization decisions

Compound Operations API

import yirage as yr # Create kernel graph graph = yr.new_kernel_graph() # GEMM-Softmax fusion (reduces DRAM traffic by keeping intermediate on-chip) A = graph.new_input(dims=(1024, 512), dtype=yr.float16) B = graph.new_input(dims=(512, 1024), dtype=yr.float16) result = graph.gemm_softmax(A, B, dim=-1) # GEMM-LayerNorm fusion result_ln = graph.gemm_layernorm(A, B, normalized_shape=(1024,)) # Self-Attention (FlashAttention-style fusion) Q = graph.new_input(dims=(8, 1024, 64), dtype=yr.float16) # [H, S, D] K = graph.new_input(dims=(8, 64, 1024), dtype=yr.float16) # [H, D, S] (transposed) V = graph.new_input(dims=(8, 1024, 64), dtype=yr.float16) # [H, S, D] attn_out = graph.self_attention(Q, K, V) # Gated MLP (LLM-style with SiLU activation) X = graph.new_input(dims=(8, 1024, 4096), dtype=yr.float16) W_gate = graph.new_input(dims=(4096, 11008), dtype=yr.float16) W_up = graph.new_input(dims=(4096, 11008), dtype=yr.float16) W_down = graph.new_input(dims=(11008, 4096), dtype=yr.float16) mlp_out = graph.gated_mlp(X, W_gate, W_up, W_down, activation="silu") # RMSNorm + Linear (common in attention QKV projection) norm_out = graph.rms_norm_linear(X, W_gate, normalized_shape=(4096,)) graph.mark_output(result) optimized = graph.superoptimize(backend="cuda")

COMET Cost Model

from yirage.rl.cluster.simulator import ( COMETCostModel, COMETHardwareConfig, SchedulingStrategy, MemoryLevel, CommunicationType ) # Create cost model with hardware config hw_config = COMETHardwareConfig( dram_bandwidth_gbps=900.0, # HBM2e global_buffer_bandwidth_gbps=3000.0, # On-chip L2 num_compute_units=108, # SMs on A100 peak_tflops_fp16=312.0, ) cost_model = COMETCostModel(hw_config=hw_config) # Estimate compound operation latency and energy latency, energy = cost_model.estimate_compound_operation( op_name="gemm_softmax", input_shapes=[(2048, 1024), (1024, 2048)], dtype_bytes=2, # FP16 num_devices=4, strategy=SchedulingStrategy.PIPELINED, ) print(f"Total latency: {latency.total_latency_ms:.3f} ms") print(f" - Compute: {latency.compute_latency_ms:.3f} ms") print(f" - Memory: {latency.total_memory_latency_ms:.3f} ms") print(f" - Collective: {latency.collective_latency_ms:.3f} ms") print(f"Total energy: {energy.total_energy_mj:.3f} mJ") # Compare distributed variants (local vs distributed execution) results = cost_model.compare_distributed_variants( op_name="gemm_softmax", input_shapes=[(4096, 2048), (2048, 4096)], num_devices=8, ) print(f"Speedup with distribution: {results['speedup']:.2f}x")

Collective Communication Cost Model

from yirage.rl.cluster.simulator import CommunicationModel, CommunicationType comm_model = CommunicationModel() # Ring AllReduce latency (Eq. 3-4 from COMET paper) latency_ms = comm_model.all_reduce_time_ms( size_bytes=100 * 1024 * 1024, # 100 MB num_devices=8, bandwidth_gbps=200.0, # NVLink latency_us=1.0, algorithm="ring", ) print(f"AllReduce latency: {latency_ms:.3f} ms") # AllGather and ReduceScatter gather_time = comm_model.all_gather_time_ms( size_bytes=50 * 1024 * 1024, num_devices=8, bandwidth_gbps=200.0, latency_us=1.0, ) print(f"AllGather latency: {gather_time:.3f} ms")

Latency Breakdown (COMET Equations)

The cost model implements the COMET paper equations:

EquationDescriptionFormula
Eq. 1Memory TransactionMemLat(T) = DV / BW
Eq. 2Data StagingTotalMem = RampUp + Steady + RampDown
Eq. 3-4Ring CollectiveCollLat = 2(n-1)/n × size / bw
Eq. 5-7SchedulingStall = CS + OS + CF

Where:

  • DV: Data Volume, BW: Bandwidth
  • CS: Compulsory Stall (data dependency)
  • OS: Optional Stall (resource blocking)
  • CF: Conflict Stall (resource contention)

COMET Search Strategy

YiRage provides a complete search strategy for COMET compound operations:

from yirage.search import ( COMETSearchStrategy, get_backend_config, detect_compound_patterns, optimize_compound_graph, ) # Auto-detect compound patterns in a graph op_types = ["matmul", "exp", "reduction", "div", "matmul"] # Self-attention patterns = detect_compound_patterns(op_types) print(f"Found {len(patterns)} compound patterns: {[p.op_type.name for p in patterns]}") # Get backend-specific configuration (15 hardware profiles) config = get_backend_config("cuda", "a100") # NVIDIA A100 # Or: get_backend_config("rocm", "mi300x") # AMD MI300X # Or: get_backend_config("tpu", "v5e") # Google TPU v5e # Or: get_backend_config("ascend", "910b") # Huawei Ascend # Run COMET search to find optimal configuration strategy = COMETSearchStrategy(config) result = strategy.search( op_types=op_types, problem_dims={"M": 4096, "K": 4096, "N": 4096} ) print(f"Best tile config: M={result.tile_config.tile_m}, N={result.tile_config.tile_n}") print(f"Scheduling: {result.scheduling.name}") print(f"Estimated latency: {result.latency_ns:.2f} ns")

Backend Hardware Profiles

BackendVariantDRAM BW (GB/s)Peak TFLOPSTile Sizes
CUDAH100335098964, 128, 256
CUDAA100203931264, 128, 256
CUDAV10090012532, 64, 128
ROCmMI300X5300130764, 128, 256
ROCmMI250X320038364, 128, 256
XPUPonte Vecchio320042032, 64, 128, 256
Ascend910B160032064, 128, 256
TPUv5e1600197128, 256, 512
TPUv41200275128, 256, 512
MACAMXC500200025664, 128, 256
MPSM3 Max40014.232, 64, 128
MPSM2 Ultra80027.232, 64, 128
CPUXeon2004.032, 64, 128, 256
CPUEPYC4605.032, 64, 128, 256
FPGAAlveo774.016, 32, 64, 128

🚀 Deep Ray Integration

YiRage provides production-grade distributed optimization with deep Ray integration:

Features

FeatureDescription
C++ BindingDirect search_partition() API via Cython for native performance
Object Storeray.put() for efficient large graph data sharing
Placement GroupsGPU affinity with PACK/SPREAD strategies for NVLink
Fault ToleranceExponential backoff retry + checkpoint/restore
Collective OpsEfficient all-reduce for gradient synchronization

Quick Start

from yirage.distributed import ( RayDeepIntegration, DeepIntegrationConfig, GPUPlacementConfig, RetryConfig, RetryStrategy, ) # Configure distributed optimization config = DeepIntegrationConfig( num_workers=8, gpu_placement=GPUPlacementConfig( gpus_per_worker=1, strategy="PACK", # NVLink locality ), retry=RetryConfig( strategy=RetryStrategy.EXPONENTIAL, max_retries=5, ), use_object_store=True, ) # Create engine and optimize engine = RayDeepIntegration(config) result = engine.optimize( graph={"type": "matmul", "input_shapes": [[1024, 2048], [2048, 4096]]}, search_space={"grid_dims": [(1,1,1), (2,1,1), (4,1,1)], "block_dims": [(128,1,1), (256,1,1)]}, ) print(f"Best latency: {result['best_latency_ms']:.3f} ms") print(f"Workers used: {result['num_workers']}")

All-Reduce for Gradients

# Distributed gradient synchronization gradients = [{"layer1": 0.1}, {"layer1": 0.3}, {"layer1": 0.2}, {"layer1": 0.4}] reduced = engine.all_reduce_gradients(gradients, reduce_op="mean") # reduced["layer1"] = 0.25

Run Demo

python examples/cluster/deep_ray_integration_demo.py

📚 Documentation

Hardware Device Management

ModuleDescription
yirage.hardwareHardware Registry — register, query, and auto-detect chip architectures at runtime
yirage.hardware.ChipArchitectureUnified dataclass for chip specs (memory, compute, features)
yirage.hardware.HardwareRegistryThread-safe singleton with register/query/import/export
yirage.hardware.detect_current_chip()Auto-detect NVIDIA/AMD/Ascend/MetaX/Apple hardware

Hardware-Specific Guides

PlatformGuideDescription
Huawei Ascend NPUInstallation GuideComplete setup, build, and test instructions
Huawei Ascend NPUQuick StartQuick API usage examples
MetaX MACA GPUQuick StartMetaX GPU integration 🆕

🎓 Examples

Run Benchmarks

# MPS backend (Apple Silicon) python benchmark/baselines/pytorch/gated_mlp.py -b 8 --backend mps # CUDA backend (NVIDIA GPU) python benchmark/baselines/pytorch/gated_mlp.py -b 8 --backend cuda # CPU backend python benchmark/baselines/pytorch/gated_mlp.py -b 8 --backend cpu # Ascend backend (Huawei NPU) - requires CANN + torch_npu python benchmark/baselines/pytorch/gated_mlp.py -b 8 --backend ascend # MACA backend (MetaX GPU) - requires MACA SDK python benchmark/baselines/pytorch/gated_mlp.py -b 8 --backend maca

Backend Selection

import yirage as yr # Method 1: Direct specification mpk = yr.PersistentKernel( backend="cpu", mode="decode", world_size=1, mpi_rank=0, ) # Method 2: With fallback mpk = yr.PersistentKernel( backend="cuda", fallback_backends=["mps", "cpu"], # Auto fallback mode="decode", world_size=1, mpi_rank=0, ) # Method 3: Query and select backends = yr.get_available_backends() best_backend = backends[0] # Use first available

🏆 Submit Your First Kernel

YiRage lets you validate optimization results quickly and then submit kernels to the gpu-mode leaderboard via popcorn-cli for community benchmarking on real hardware (A100, H100, etc.).

Quick Validation Workflow

# 1. Validate your kernel locally (no GPU required) python examples/submission.py --validate # Expected output: # ✅ NumPy kernel: shape=(256, 256), dtype=uint8 # ⏱ NumPy throughput: 0.312 ms/frame (0.21 Gpix/s) # ✅ Torch kernel: shape=(4, 1, 256, 256), device=cpu # ⏱ Torch throughput: 1.234 ms/batch # ✅ All validation steps completed.

Submit to Leaderboard (4 steps)

1. Install popcorn-cli

curl -fsSL https://raw.githubusercontent.com/gpu-mode/popcorn-cli/main/install.sh | bash

2. Register your account

popcorn-cli register discord

3. Set up your project

# Configure your project with a working example and optional agent skills popcorn-cli setup

4. Submit your kernel

# Submit the included grayscale example to the grayscale_v2 leaderboard on an A100 popcorn-cli submit --gpu A100 --leaderboard grayscale_v2 --mode leaderboard examples/submission.py

Tip: Replace examples/submission.py with any file that exports a solution(input_tensor) function.
See the popcorn-cli repo for the full list of available leaderboards and GPUs.

Writing Your Own Submission

A valid submission.py must export a solution function with the leaderboard's expected signature:

import torch def solution(input_tensor: torch.Tensor) -> torch.Tensor: """Your optimized kernel implementation.""" # Example: YiRage-optimized grayscale conversion coeffs = torch.tensor([0.299, 0.587, 0.114], device=input_tensor.device) return (input_tensor * coeffs[None, :, None, None]).sum(dim=1, keepdim=True)

For a complete example with local validation, benchmarking, and YiRage superoptimizer integration, see examples/submission.py.


🤝 Contributing

We welcome contributions! Please see CONTRIBUTING.md for guidelines.

Adding a New Backend

  1. Implement BackendInterface
  2. Create {Backend}KernelConfig
  3. Implement {Backend}Optimizer
  4. Create {Backend}SearchStrategy (optional)
  5. Update CMake configuration

See Ascend Implementation Guide for a complete example.


📄 License

YiRage is licensed under the Apache License 2.0.

Copyright:

  • YiRage Multi-Backend Extensions: Copyright 2025 Chen Xingqiang
  • Original Mirage: Copyright 2023-2024 Carnegie Mellon University

See LICENSE and NOTICE for details.


📚 Citation

@software{yirage2025, title={YiRage: Yield Revolutionary AGile Engine for Multi-Backend LLM Inference}, author={Chen, Xingqiang}, year={2025}, note={Multi-backend extension for LLM inference optimization}, url={https://github.com/chenxingqiang/YiRage} }

🙏 Acknowledgments

YiRage acknowledges CMU Mirage and the broader open-source systems and compiler communities whose work makes multi-backend optimization possible. Comprehensive third-party attribution details are maintained in NOTICE.


📞 Contact


YiRage - Yielding Maximum Performance Across All Hardware 🚀

Copyright 2025 Chen Xingqiang | Apache License 2.0

关于 About

YiRage (Yield Revolutionary AGile Engine) - Multi-Backend LLM Inference Optimization. Extends Mirage with comprehensive support for CUDA, MPS, CPU, Triton, NKI, cuDNN, and MKL backends.

语言 Languages

C++49.8%
Python31.1%
Cuda11.9%
CMake1.4%
Cython1.4%
HIP1.0%
MLIR0.8%
Metal0.8%
Shell0.7%
Objective-C0.5%
Rust0.2%
Objective-C++0.1%
C0.1%
Makefile0.1%
Dockerfile0.0%

提交活跃度 Commit Activity

代码提交热力图
过去 52 周的开发活跃度
641
Total Commits
峰值: 93次/周
Less
More

核心贡献者 Contributors