GPU: NVIDIA GB200 Models: Qwen3.5 family (397B, 122B, 35B, 27B, 9B, 4B, 2B, 0.8B), d=128 Config: Warmup=10, Repeats=100, Backend=cudagraph Library Versions: torch: 2.9.1 | fla: 0.5.0 | flashinfer: 0.6.13 | tilelang: 0.1.9 ============================================================================================================== >>> FORWARD BENCHMARKS Model Config Seqlens h_qk h_v flash_qla [fwd] FI [fwd] FLA [fwd] vs FLA vs FI ------------------------------------------------------------------------------------------------------------ 397B/122B TP8 1x32768 2 8 0.278ms 1.166ms 1.023ms 3.68x 4.20x 397B/122B TP8 1x16384 2 8 0.185ms 0.591ms 0.521ms 2.82x 3.20x 397B/122B TP8 1x8192 2 8 0.246ms 0.304ms 0.266ms 1.08x 1.23x 397B/122B TP8 1x4096 2 8 0.129ms 0.159ms 0.147ms 1.14x 1.23x 397B/122B TP8 1x2048 2 8 0.070ms 0.087ms 0.089ms 1.27x 1.24x 397B/122B TP8 28672+4096 2 8 0.275ms 1.024ms 0.929ms 3.38x 3.73x 397B/122B TP8 24576+8192 2 8 0.269ms 0.879ms 0.834ms 3.10x 3.26x 397B/122B TP8 16384+16384 2 8 0.259ms 0.589ms 0.649ms 2.51x 2.27x 397B/122B TP8 8192+24576 2 8 0.269ms 0.878ms 0.835ms 3.10x 3.26x 397B/122B TP8 4096+28672 2 8 0.275ms 1.023ms 0.929ms 3.38x 3.73x 397B/122B TP8 12288+4096 2 8 0.177ms 0.447ms 0.427ms 2.42x 2.53x 397B/122B TP8 6144+2048 2 8 0.191ms 0.232ms 0.219ms 1.15x 1.21x 397B/122B TP8 4096+4096 2 8 0.135ms 0.159ms 0.172ms 1.28x 1.18x 397B/122B TP8 2048+6144 2 8 0.191ms 0.231ms 0.219ms 1.15x 1.21x 397B/122B TP8 1024+7168 2 8 0.218ms 0.267ms 0.242ms 1.11x 1.22x 397B/122B TP8 8192x4 2 8 0.281ms 0.305ms 0.461ms 1.64x 1.09x 397B/122B TP8 4096x8 2 8 0.172ms 0.162ms 0.464ms 2.70x 0.94x 397B/122B TP8 2048x4 2 8 0.080ms 0.088ms 0.125ms 1.57x 1.10x 397B/122B TP8 1024x8 2 8 0.052ms 0.053ms 0.130ms 2.50x 1.01x 397B/122B TP4 1x32768 4 16 0.473ms 1.162ms 1.270ms 2.68x 2.46x 397B/122B TP4 1x16384 4 16 0.260ms 0.590ms 0.651ms 2.50x 2.27x 397B/122B TP4 1x8192 4 16 0.259ms 0.303ms 0.335ms 1.29x 1.17x 397B/122B TP4 1x4096 4 16 0.136ms 0.159ms 0.173ms 1.27x 1.17x 397B/122B TP4 1x2048 4 16 0.074ms 0.087ms 0.101ms 1.36x 1.17x 397B/122B TP4 28672+4096 4 16 0.467ms 1.020ms 1.174ms 2.52x 2.19x 397B/122B TP4 24576+8192 4 16 0.463ms 0.877ms 1.080ms 2.33x 1.89x 397B/122B TP4 16384+16384 4 16 0.450ms 0.592ms 0.892ms 1.98x 1.31x 397B/122B TP4 8192+24576 4 16 0.461ms 0.881ms 1.080ms 2.34x 1.91x 397B/122B TP4 4096+28672 4 16 0.467ms 1.025ms 1.173ms 2.51x 2.19x 397B/122B TP4 12288+4096 4 16 0.255ms 0.447ms 0.556ms 2.17x 1.75x 397B/122B TP4 6144+2048 4 16 0.203ms 0.231ms 0.287ms 1.41x 1.14x 397B/122B TP4 4096+4096 4 16 0.149ms 0.160ms 0.240ms 1.61x 1.08x 397B/122B TP4 2048+6144 4 16 0.204ms 0.232ms 0.287ms 1.41x 1.14x 397B/122B TP4 1024+7168 4 16 0.231ms 0.268ms 0.311ms 1.34x 1.16x 397B/122B TP4 8192x4 4 16 0.332ms 0.306ms 0.895ms 2.69x 0.92x 397B/122B TP4 4096x8 4 16 0.266ms 0.166ms 0.898ms 3.37x 0.62x 397B/122B TP4 2048x4 4 16 0.094ms 0.089ms 0.244ms 2.60x 0.95x 397B/122B TP4 1024x8 4 16 0.080ms 0.055ms 0.251ms 3.15x 0.69x 397B/122B TP2 1x32768 8 32 0.760ms 1.168ms 1.779ms 2.34x 1.54x 397B/122B TP2 1x16384 8 32 0.452ms 0.593ms 0.899ms 1.99x 1.31x 397B/122B TP2 1x8192 8 32 0.285ms 0.305ms 0.466ms 1.64x 1.07x 397B/122B TP2 1x4096 8 32 0.150ms 0.160ms 0.243ms 1.62x 1.07x 397B/122B TP2 1x2048 8 32 0.082ms 0.088ms 0.128ms 1.57x 1.08x 397B/122B TP2 28672+4096 8 32 0.749ms 1.026ms 1.684ms 2.25x 1.37x 397B/122B TP2 24576+8192 8 32 0.746ms 0.882ms 1.765ms 2.37x 1.18x 397B/122B TP2 16384+16384 8 32 0.654ms 0.596ms 1.777ms 2.71x 0.91x 397B/122B TP2 8192+24576 8 32 0.748ms 0.887ms 1.784ms 2.38x 1.19x 397B/122B TP2 4096+28672 8 32 0.749ms 1.030ms 1.785ms 2.38x 1.38x 397B/122B TP2 12288+4096 8 32 0.444ms 0.450ms 0.894ms 2.01x 1.01x 397B/122B TP2 6144+2048 8 32 0.230ms 0.233ms 0.467ms 2.03x 1.01x 397B/122B TP2 4096+4096 8 32 0.175ms 0.162ms 0.470ms 2.69x 0.93x 397B/122B TP2 2048+6144 8 32 0.230ms 0.234ms 0.472ms 2.05x 1.02x 397B/122B TP2 1024+7168 8 32 0.258ms 0.270ms 0.473ms 1.83x 1.05x 397B/122B TP2 8192x4 8 32 0.517ms 0.313ms 1.777ms 3.44x 0.61x 397B/122B TP2 4096x8 8 32 0.524ms 0.330ms 1.700ms 3.25x 0.63x 397B/122B TP2 2048x4 8 32 0.143ms 0.092ms 0.475ms 3.33x 0.65x 397B/122B TP2 1024x8 8 32 0.152ms 0.109ms 0.466ms 3.05x 0.71x 397B/122B TP1 1x32768 16 64 1.296ms 1.179ms 3.021ms 2.33x 0.91x 397B/122B TP1 1x16384 16 64 0.657ms 0.598ms 1.512ms 2.30x 0.91x 397B/122B TP1 1x8192 16 64 0.336ms 0.307ms 0.771ms 2.29x 0.91x 397B/122B TP1 1x4096 16 64 0.176ms 0.162ms 0.403ms 2.28x 0.92x 397B/122B TP1 1x2048 16 64 0.097ms 0.089ms 0.213ms 2.19x 0.92x 397B/122B TP1 28672+4096 16 64 1.461ms 1.035ms 2.901ms 1.99x 0.71x 397B/122B TP1 24576+8192 16 64 1.312ms 0.892ms 3.019ms 2.30x 0.68x 397B/122B TP1 16384+16384 16 64 1.008ms 0.607ms 3.026ms 3.00x 0.60x 397B/122B TP1 8192+24576 16 64 1.309ms 0.898ms 3.031ms 2.32x 0.69x 397B/122B TP1 4096+28672 16 64 1.459ms 1.042ms 3.033ms 2.08x 0.71x 397B/122B TP1 12288+4096 16 64 0.667ms 0.455ms 1.516ms 2.27x 0.68x 397B/122B TP1 6144+2048 16 64 0.344ms 0.237ms 0.776ms 2.26x 0.69x 397B/122B TP1 4096+4096 16 64 0.269ms 0.166ms 0.778ms 2.90x 0.62x 397B/122B TP1 2048+6144 16 64 0.343ms 0.238ms 0.779ms 2.27x 0.69x 397B/122B TP1 1024+7168 16 64 0.381ms 0.275ms 0.779ms 2.05x 0.72x 397B/122B TP1 8192x4 16 64 1.012ms 0.628ms 3.041ms 3.00x 0.62x 397B/122B TP1 4096x8 16 64 1.022ms 0.650ms 2.935ms 2.87x 0.64x 397B/122B TP1 2048x4 16 64 0.278ms 0.182ms 0.789ms 2.84x 0.66x 397B/122B TP1 1024x8 16 64 0.294ms 0.214ms 0.777ms 2.65x 0.73x 35B/9B/4B TP1 1x32768 16 32 0.757ms 1.170ms 1.808ms 2.39x 1.55x 35B/9B/4B TP1 1x16384 16 32 0.455ms 0.594ms 0.918ms 2.02x 1.31x 35B/9B/4B TP1 1x8192 16 32 0.285ms 0.306ms 0.476ms 1.67x 1.07x 35B/9B/4B TP1 1x4096 16 32 0.150ms 0.161ms 0.248ms 1.65x 1.07x 35B/9B/4B TP1 1x2048 16 32 0.082ms 0.088ms 0.130ms 1.59x 1.07x 35B/9B/4B TP1 28672+4096 16 32 0.747ms 1.028ms 1.713ms 2.29x 1.38x 35B/9B/4B TP1 24576+8192 16 32 0.744ms 0.884ms 1.794ms 2.41x 1.19x 35B/9B/4B TP1 16384+16384 16 32 0.656ms 0.600ms 1.807ms 2.75x 0.91x 35B/9B/4B TP1 8192+24576 16 32 0.747ms 0.888ms 1.813ms 2.43x 1.19x 35B/9B/4B TP1 4096+28672 16 32 0.747ms 1.032ms 1.813ms 2.43x 1.38x 35B/9B/4B TP1 12288+4096 16 32 0.444ms 0.451ms 0.913ms 2.05x 1.01x 35B/9B/4B TP1 6144+2048 16 32 0.230ms 0.235ms 0.475ms 2.06x 1.02x 35B/9B/4B TP1 4096+4096 16 32 0.175ms 0.163ms 0.478ms 2.73x 0.93x 35B/9B/4B TP1 2048+6144 16 32 0.230ms 0.234ms 0.480ms 2.09x 1.02x 35B/9B/4B TP1 1024+7168 16 32 0.258ms 0.270ms 0.481ms 1.87x 1.05x 35B/9B/4B TP1 8192x4 16 32 0.505ms 0.313ms 1.805ms 3.57x 0.62x 35B/9B/4B TP1 4096x8 16 32 0.520ms 0.330ms 1.728ms 3.32x 0.63x 35B/9B/4B TP1 2048x4 16 32 0.141ms 0.092ms 0.484ms 3.43x 0.65x 35B/9B/4B TP1 1024x8 16 32 0.153ms 0.108ms 0.474ms 3.09x 0.71x 27B TP2 1x32768 8 24 0.592ms 1.170ms 1.527ms 2.58x 1.98x 27B TP2 1x16384 8 24 0.353ms 0.594ms 0.779ms 2.21x 1.68x 27B TP2 1x8192 8 24 0.272ms 0.306ms 0.402ms 1.48x 1.12x 27B TP2 1x4096 8 24 0.144ms 0.160ms 0.218ms 1.52x 1.11x 27B TP2 1x2048 8 24 0.079ms 0.087ms 0.116ms 1.47x 1.11x 27B TP2 28672+4096 8 24 0.584ms 1.026ms 1.436ms 2.46x 1.76x 27B TP2 24576+8192 8 24 0.580ms 0.882ms 1.341ms 2.31x 1.52x 27B TP2 16384+16384 8 24 0.582ms 0.594ms 1.519ms 2.61x 1.02x 27B TP2 8192+24576 8 24 0.583ms 0.879ms 1.527ms 2.62x 1.51x 27B TP2 4096+28672 8 24 0.583ms 1.023ms 1.530ms 2.62x 1.75x 27B TP2 12288+4096 8 24 0.338ms 0.450ms 0.687ms 2.03x 1.33x 27B TP2 6144+2048 8 24 0.216ms 0.234ms 0.356ms 1.65x 1.08x 27B TP2 4096+4096 8 24 0.162ms 0.162ms 0.405ms 2.50x 1.00x 27B TP2 2048+6144 8 24 0.217ms 0.232ms 0.407ms 1.88x 1.07x 27B TP2 1024+7168 8 24 0.244ms 0.268ms 0.407ms 1.67x 1.10x 27B TP2 8192x4 8 24 0.604ms 0.311ms 1.340ms 2.22x 0.52x 27B TP2 4096x8 8 24 0.460ms 0.325ms 1.339ms 2.91x 0.71x 27B TP2 2048x4 8 24 0.167ms 0.091ms 0.363ms 2.18x 0.55x 27B TP2 1024x8 8 24 0.136ms 0.106ms 0.371ms 2.72x 0.78x 27B TP1 1x32768 16 48 1.105ms 1.171ms 2.527ms 2.29x 1.06x 27B TP1 1x16384 16 48 0.584ms 0.594ms 1.276ms 2.18x 1.02x 27B TP1 1x8192 16 48 0.309ms 0.306ms 0.657ms 2.12x 0.99x 27B TP1 1x4096 16 48 0.163ms 0.162ms 0.342ms 2.09x 0.99x 27B TP1 1x2048 16 48 0.090ms 0.088ms 0.190ms 2.10x 0.98x 27B TP1 28672+4096 16 48 1.092ms 1.028ms 2.407ms 2.20x 0.94x 27B TP1 24576+8192 16 48 0.973ms 0.887ms 2.284ms 2.35x 0.91x 27B TP1 16384+16384 16 48 1.194ms 0.604ms 2.530ms 2.12x 0.51x 27B TP1 8192+24576 16 48 1.200ms 0.891ms 2.532ms 2.11x 0.74x 27B TP1 4096+28672 16 48 1.095ms 1.035ms 2.532ms 2.31x 0.95x 27B TP1 12288+4096 16 48 0.496ms 0.452ms 1.153ms 2.33x 0.91x 27B TP1 6144+2048 16 48 0.257ms 0.235ms 0.597ms 2.32x 0.91x 27B TP1 4096+4096 16 48 0.318ms 0.165ms 0.662ms 2.08x 0.52x 27B TP1 2048+6144 16 48 0.322ms 0.236ms 0.662ms 2.06x 0.73x 27B TP1 1024+7168 16 48 0.319ms 0.272ms 0.662ms 2.08x 0.85x 27B TP1 8192x4 16 48 0.899ms 0.619ms 2.292ms 2.55x 0.69x 27B TP1 4096x8 16 48 0.768ms 0.488ms 2.306ms 3.00x 0.64x 27B TP1 2048x4 16 48 0.248ms 0.180ms 0.608ms 2.45x 0.73x 27B TP1 1024x8 16 48 0.223ms 0.161ms 0.623ms 2.80x 0.72x 2B/0.8B TP1 1x32768 16 16 0.473ms 1.182ms 1.298ms 2.74x 2.50x 2B/0.8B TP1 1x16384 16 16 0.265ms 0.600ms 0.667ms 2.51x 2.26x 2B/0.8B TP1 1x8192 16 16 0.257ms 0.308ms 0.343ms 1.33x 1.20x 2B/0.8B TP1 1x4096 16 16 0.136ms 0.159ms 0.178ms 1.31x 1.17x 2B/0.8B TP1 1x2048 16 16 0.074ms 0.087ms 0.102ms 1.37x 1.18x 2B/0.8B TP1 28672+4096 16 16 0.468ms 1.038ms 1.203ms 2.57x 2.22x 2B/0.8B TP1 24576+8192 16 16 0.463ms 0.892ms 1.109ms 2.40x 1.93x 2B/0.8B TP1 16384+16384 16 16 0.452ms 0.602ms 0.923ms 2.04x 1.33x 2B/0.8B TP1 8192+24576 16 16 0.461ms 0.895ms 1.110ms 2.41x 1.94x 2B/0.8B TP1 4096+28672 16 16 0.467ms 1.042ms 1.203ms 2.58x 2.23x 2B/0.8B TP1 12288+4096 16 16 0.260ms 0.453ms 0.573ms 2.20x 1.74x 2B/0.8B TP1 6144+2048 16 16 0.203ms 0.235ms 0.296ms 1.46x 1.16x 2B/0.8B TP1 4096+4096 16 16 0.148ms 0.163ms 0.250ms 1.68x 1.10x 2B/0.8B TP1 2048+6144 16 16 0.203ms 0.235ms 0.296ms 1.46x 1.16x 2B/0.8B TP1 1024+7168 16 16 0.230ms 0.272ms 0.319ms 1.39x 1.18x 2B/0.8B TP1 8192x4 16 16 0.332ms 0.311ms 0.798ms 2.40x 0.94x 2B/0.8B TP1 4096x8 16 16 0.261ms 0.167ms 0.799ms 3.07x 0.64x 2B/0.8B TP1 2048x4 16 16 0.094ms 0.091ms 0.221ms 2.34x 0.97x 2B/0.8B TP1 1024x8 16 16 0.080ms 0.056ms 0.224ms 2.81x 0.70x Sym h32 1x32768 32 32 0.760ms 1.187ms 1.835ms 2.41x 1.56x Sym h32 1x16384 32 32 0.456ms 0.602ms 0.929ms 2.04x 1.32x Sym h32 1x8192 32 32 0.283ms 0.309ms 0.483ms 1.70x 1.09x Sym h32 1x4096 32 32 0.150ms 0.163ms 0.252ms 1.68x 1.09x Sym h32 1x2048 32 32 0.082ms 0.089ms 0.133ms 1.62x 1.08x Sym h32 28672+4096 32 32 0.748ms 1.041ms 1.744ms 2.33x 1.39x Sym h32 24576+8192 32 32 0.748ms 0.894ms 1.820ms 2.43x 1.20x Sym h32 16384+16384 32 32 0.653ms 0.604ms 1.832ms 2.81x 0.92x Sym h32 8192+24576 32 32 0.749ms 0.897ms 1.837ms 2.45x 1.20x Sym h32 4096+28672 32 32 0.750ms 1.042ms 1.838ms 2.45x 1.39x Sym h32 12288+4096 32 32 0.443ms 0.456ms 0.923ms 2.08x 1.03x Sym h32 6144+2048 32 32 0.230ms 0.237ms 0.482ms 2.10x 1.03x Sym h32 4096+4096 32 32 0.175ms 0.164ms 0.485ms 2.77x 0.94x Sym h32 2048+6144 32 32 0.230ms 0.237ms 0.488ms 2.12x 1.03x Sym h32 1024+7168 32 32 0.257ms 0.273ms 0.489ms 1.90x 1.06x Sym h32 8192x4 32 32 0.505ms 0.316ms 1.833ms 3.63x 0.63x Sym h32 4096x8 32 32 0.516ms 0.332ms 1.755ms 3.40x 0.64x Sym h32 2048x4 32 32 0.142ms 0.093ms 0.492ms 3.46x 0.66x Sym h32 1024x8 32 32 0.154ms 0.109ms 0.483ms 3.14x 0.71x Benchmark Finished. GPU: NVIDIA GB200 Models: Qwen3.5 family (397B, 122B, 35B, 27B, 9B, 4B, 2B, 0.8B), d=128 Config: Warmup=10, Repeats=100, Backend=cudagraph Library Versions: torch: 2.9.1 | fla: 0.5.0 | flashinfer: 0.6.13 | tilelang: 0.1.9 ============================================================================================================== ============================================================================================================== >>> BACKWARD BENCHMARKS Model Config SeqLen h_qk h_v #seq flash_qla [bwd] FLA [bwd] Speedup ------------------------------------------------------------------------------------ hk16_hv16 32k_1seq 16 16 1 1.521ms N/A N/A hk16_hv16 32k_2seq 16 16 2 1.522ms N/A N/A hk16_hv16 32k_4seq 16 16 4 1.454ms N/A N/A hk16_hv16 32k_5seq 16 16 5 1.455ms N/A N/A hk16_hv16 32k_6seq 16 16 6 1.446ms N/A N/A hk16_hv16 32k_8seq 16 16 8 1.436ms N/A N/A hk4_hv16 32k_1seq 4 16 1 1.555ms N/A N/A hk4_hv16 32k_2seq 4 16 2 1.561ms N/A N/A hk4_hv16 32k_4seq 4 16 4 1.495ms N/A N/A hk4_hv16 32k_5seq 4 16 5 1.478ms N/A N/A hk4_hv16 32k_6seq 4 16 6 1.473ms N/A N/A hk4_hv16 32k_8seq 4 16 8 1.469ms N/A N/A hk12_hv12 32k_1seq 12 12 1 1.204ms N/A N/A hk12_hv12 32k_2seq 12 12 2 1.209ms N/A N/A hk12_hv12 32k_4seq 12 12 4 1.128ms N/A N/A hk12_hv12 32k_5seq 12 12 5 1.139ms N/A N/A hk12_hv12 32k_6seq 12 12 6 1.137ms N/A N/A hk12_hv12 32k_8seq 12 12 8 1.142ms N/A N/A hk4_hv12 32k_1seq 4 12 1 1.226ms N/A N/A hk4_hv12 32k_2seq 4 12 2 1.242ms N/A N/A hk4_hv12 32k_4seq 4 12 4 1.167ms N/A N/A hk4_hv12 32k_5seq 4 12 5 1.177ms N/A N/A hk4_hv12 32k_6seq 4 12 6 1.163ms N/A N/A hk4_hv12 32k_8seq 4 12 8 1.166ms N/A N/A hk8_hv8 32k_1seq 8 8 1 0.874ms N/A N/A hk8_hv8 32k_2seq 8 8 2 0.883ms N/A N/A hk8_hv8 32k_4seq 8 8 4 0.823ms N/A N/A hk8_hv8 32k_5seq 8 8 5 0.840ms N/A N/A hk8_hv8 32k_6seq 8 8 6 0.823ms N/A N/A hk8_hv8 32k_8seq 8 8 8 0.823ms N/A N/A hk4_hv8 32k_1seq 4 8 1 0.891ms N/A N/A hk4_hv8 32k_2seq 4 8 2 0.909ms N/A N/A hk4_hv8 32k_4seq 4 8 4 0.845ms N/A N/A hk4_hv8 32k_5seq 4 8 5 0.856ms N/A N/A hk4_hv8 32k_6seq 4 8 6 0.849ms N/A N/A hk4_hv8 32k_8seq 4 8 8 0.853ms N/A N/A hk4_hv4 32k_1seq 4 4 1 0.683ms N/A N/A hk4_hv4 32k_2seq 4 4 2 0.664ms N/A N/A hk4_hv4 32k_4seq 4 4 4 0.579ms N/A N/A hk4_hv4 32k_5seq 4 4 5 0.590ms N/A N/A hk4_hv4 32k_6seq 4 4 6 0.542ms N/A N/A hk4_hv4 32k_8seq 4 4 8 0.541ms N/A N/A Benchmark Finished.