GPU: NVIDIA H200 Models: Qwen3.5 family (397B, 122B, 35B, 27B, 9B, 4B, 2B, 0.8B), d=128 Config: Warmup=10, Repeats=100, Backend=cudagraph Library Versions: torch: 2.8.0+cu128 | fla: 0.5.0 | flashinfer: 0.6.13 | tilelang: 0.1.9 ============================================================================================================== >>> FORWARD BENCHMARKS Model Config Seqlens h_qk h_v flash_qla [fwd] FI [fwd] FLA [fwd] vs FLA vs FI ------------------------------------------------------------------------------------------------------------ 397B/122B TP8 1x32768 2 8 0.322ms 1.655ms 1.021ms 3.17x 5.13x 397B/122B TP8 1x16384 2 8 0.186ms 0.830ms 0.515ms 2.77x 4.47x 397B/122B TP8 1x8192 2 8 0.120ms 0.417ms 0.265ms 2.20x 3.46x 397B/122B TP8 1x4096 2 8 0.087ms 0.210ms 0.148ms 1.69x 2.41x 397B/122B TP8 1x2048 2 8 0.061ms 0.109ms 0.088ms 1.44x 1.78x 397B/122B TP8 28672+4096 2 8 0.317ms 1.449ms 0.933ms 2.94x 4.56x 397B/122B TP8 24576+8192 2 8 0.312ms 1.242ms 0.847ms 2.71x 3.98x 397B/122B TP8 16384+16384 2 8 0.302ms 0.825ms 0.676ms 2.24x 2.74x 397B/122B TP8 8192+24576 2 8 0.312ms 1.239ms 0.848ms 2.72x 3.97x 397B/122B TP8 4096+28672 2 8 0.316ms 1.444ms 0.933ms 2.95x 4.57x 397B/122B TP8 12288+4096 2 8 0.180ms 0.623ms 0.428ms 2.37x 3.45x 397B/122B TP8 6144+2048 2 8 0.116ms 0.313ms 0.221ms 1.91x 2.71x 397B/122B TP8 4096+4096 2 8 0.111ms 0.211ms 0.178ms 1.61x 1.90x 397B/122B TP8 2048+6144 2 8 0.116ms 0.313ms 0.221ms 1.91x 2.71x 397B/122B TP8 1024+7168 2 8 0.118ms 0.365ms 0.242ms 2.05x 3.09x 397B/122B TP8 8192x4 2 8 0.283ms 0.392ms 0.512ms 1.81x 1.39x 397B/122B TP8 4096x8 2 8 0.204ms 0.200ms 0.517ms 2.54x 0.98x 397B/122B TP8 2048x4 2 8 0.106ms 0.105ms 0.137ms 1.29x 0.99x 397B/122B TP8 1024x8 2 8 0.062ms 0.057ms 0.141ms 2.29x 0.93x 397B/122B TP4 1x32768 4 16 0.500ms 1.645ms 1.354ms 2.71x 3.29x 397B/122B TP4 1x16384 4 16 0.304ms 0.823ms 0.670ms 2.20x 2.71x 397B/122B TP4 1x8192 4 16 0.176ms 0.414ms 0.344ms 1.96x 2.35x 397B/122B TP4 1x4096 4 16 0.111ms 0.209ms 0.180ms 1.62x 1.88x 397B/122B TP4 1x2048 4 16 0.083ms 0.108ms 0.104ms 1.26x 1.31x 397B/122B TP4 28672+4096 4 16 0.490ms 1.429ms 1.271ms 2.59x 2.91x 397B/122B TP4 24576+8192 4 16 0.486ms 1.219ms 1.187ms 2.44x 2.51x 397B/122B TP4 16384+16384 4 16 0.476ms 0.778ms 1.020ms 2.14x 1.63x 397B/122B TP4 8192+24576 4 16 0.483ms 1.178ms 1.188ms 2.46x 2.44x 397B/122B TP4 4096+28672 4 16 0.486ms 1.380ms 1.271ms 2.61x 2.84x 397B/122B TP4 12288+4096 4 16 0.291ms 0.607ms 0.587ms 2.02x 2.09x 397B/122B TP4 6144+2048 4 16 0.174ms 0.306ms 0.303ms 1.74x 1.76x 397B/122B TP4 4096+4096 4 16 0.172ms 0.200ms 0.261ms 1.52x 1.16x 397B/122B TP4 2048+6144 4 16 0.175ms 0.300ms 0.303ms 1.74x 1.72x 397B/122B TP4 1024+7168 4 16 0.175ms 0.352ms 0.324ms 1.85x 2.01x 397B/122B TP4 8192x4 4 16 0.398ms 0.393ms 1.025ms 2.58x 0.99x 397B/122B TP4 4096x8 4 16 0.312ms 0.206ms 1.034ms 3.31x 0.66x 397B/122B TP4 2048x4 4 16 0.112ms 0.105ms 0.266ms 2.36x 0.93x 397B/122B TP4 1024x8 4 16 0.096ms 0.061ms 0.273ms 2.86x 0.64x 397B/122B TP2 1x32768 8 32 0.886ms 1.550ms 2.046ms 2.31x 1.75x 397B/122B TP2 1x16384 8 32 0.480ms 0.778ms 0.993ms 2.07x 1.62x 397B/122B TP2 1x8192 8 32 0.282ms 0.393ms 0.506ms 1.79x 1.39x 397B/122B TP2 1x4096 8 32 0.173ms 0.201ms 0.263ms 1.51x 1.16x 397B/122B TP2 1x2048 8 32 0.109ms 0.105ms 0.140ms 1.29x 0.97x 397B/122B TP2 28672+4096 8 32 0.882ms 1.359ms 2.047ms 2.32x 1.54x 397B/122B TP2 24576+8192 8 32 0.884ms 1.165ms 2.048ms 2.32x 1.32x 397B/122B TP2 16384+16384 8 32 0.771ms 0.783ms 2.051ms 2.66x 1.02x 397B/122B TP2 8192+24576 8 32 0.877ms 1.177ms 2.050ms 2.34x 1.34x 397B/122B TP2 4096+28672 8 32 0.880ms 1.372ms 2.051ms 2.33x 1.56x 397B/122B TP2 12288+4096 8 32 0.480ms 0.587ms 0.996ms 2.07x 1.22x 397B/122B TP2 6144+2048 8 32 0.274ms 0.297ms 0.509ms 1.86x 1.08x 397B/122B TP2 4096+4096 8 32 0.207ms 0.200ms 0.510ms 2.46x 0.96x 397B/122B TP2 2048+6144 8 32 0.274ms 0.297ms 0.510ms 1.86x 1.08x 397B/122B TP2 1024+7168 8 32 0.283ms 0.346ms 0.511ms 1.80x 1.22x 397B/122B TP2 8192x4 8 32 0.606ms 0.406ms 2.052ms 3.39x 0.67x 397B/122B TP2 4096x8 8 32 0.617ms 0.413ms 2.071ms 3.36x 0.67x 397B/122B TP2 2048x4 8 32 0.168ms 0.109ms 0.517ms 3.07x 0.65x 397B/122B TP2 1024x8 8 32 0.179ms 0.117ms 0.530ms 2.96x 0.66x 397B/122B TP1 1x32768 16 64 1.528ms 1.558ms 3.543ms 2.32x 1.02x 397B/122B TP1 1x16384 16 64 0.773ms 0.785ms 1.702ms 2.20x 1.02x 397B/122B TP1 1x8192 16 64 0.397ms 0.393ms 0.859ms 2.17x 0.99x 397B/122B TP1 1x4096 16 64 0.209ms 0.200ms 0.442ms 2.11x 0.95x 397B/122B TP1 1x2048 16 64 0.114ms 0.105ms 0.233ms 2.04x 0.92x 397B/122B TP1 28672+4096 16 64 1.703ms 1.365ms 3.555ms 2.09x 0.80x 397B/122B TP1 24576+8192 16 64 1.523ms 1.172ms 3.547ms 2.33x 0.77x 397B/122B TP1 16384+16384 16 64 1.200ms 0.811ms 3.545ms 2.95x 0.68x 397B/122B TP1 8192+24576 16 64 1.524ms 1.223ms 3.549ms 2.33x 0.80x 397B/122B TP1 4096+28672 16 64 1.703ms 1.451ms 3.551ms 2.09x 0.85x 397B/122B TP1 12288+4096 16 64 0.777ms 0.590ms 1.702ms 2.19x 0.76x 397B/122B TP1 6144+2048 16 64 0.402ms 0.299ms 0.864ms 2.15x 0.74x 397B/122B TP1 4096+4096 16 64 0.314ms 0.208ms 0.868ms 2.76x 0.66x 397B/122B TP1 2048+6144 16 64 0.402ms 0.311ms 0.865ms 2.15x 0.77x 397B/122B TP1 1024+7168 16 64 0.446ms 0.364ms 0.867ms 1.94x 0.82x 397B/122B TP1 8192x4 16 64 1.210ms 0.815ms 3.561ms 2.94x 0.67x 397B/122B TP1 4096x8 16 64 1.238ms 0.828ms 3.595ms 2.90x 0.67x 397B/122B TP1 2048x4 16 64 0.324ms 0.215ms 0.879ms 2.71x 0.67x 397B/122B TP1 1024x8 16 64 0.345ms 0.228ms 0.903ms 2.62x 0.66x 35B/9B/4B TP1 1x32768 16 32 0.882ms 1.550ms 2.065ms 2.34x 1.76x 35B/9B/4B TP1 1x16384 16 32 0.476ms 0.779ms 1.013ms 2.13x 1.64x 35B/9B/4B TP1 1x8192 16 32 0.283ms 0.397ms 0.514ms 1.82x 1.40x 35B/9B/4B TP1 1x4096 16 32 0.174ms 0.201ms 0.268ms 1.54x 1.15x 35B/9B/4B TP1 1x2048 16 32 0.110ms 0.105ms 0.143ms 1.30x 0.96x 35B/9B/4B TP1 28672+4096 16 32 0.881ms 1.369ms 2.064ms 2.34x 1.55x 35B/9B/4B TP1 24576+8192 16 32 0.882ms 1.173ms 2.062ms 2.34x 1.33x 35B/9B/4B TP1 16384+16384 16 32 0.778ms 0.780ms 2.062ms 2.65x 1.00x 35B/9B/4B TP1 8192+24576 16 32 0.882ms 1.177ms 2.070ms 2.35x 1.34x 35B/9B/4B TP1 4096+28672 16 32 0.882ms 1.373ms 2.070ms 2.35x 1.56x 35B/9B/4B TP1 12288+4096 16 32 0.479ms 0.592ms 1.018ms 2.12x 1.24x 35B/9B/4B TP1 6144+2048 16 32 0.273ms 0.299ms 0.519ms 1.90x 1.09x 35B/9B/4B TP1 4096+4096 16 32 0.207ms 0.200ms 0.517ms 2.50x 0.97x 35B/9B/4B TP1 2048+6144 16 32 0.275ms 0.299ms 0.519ms 1.89x 1.09x 35B/9B/4B TP1 1024+7168 16 32 0.284ms 0.348ms 0.520ms 1.83x 1.22x 35B/9B/4B TP1 8192x4 16 32 0.612ms 0.415ms 2.076ms 3.39x 0.68x 35B/9B/4B TP1 4096x8 16 32 0.620ms 0.420ms 2.090ms 3.37x 0.68x 35B/9B/4B TP1 2048x4 16 32 0.169ms 0.111ms 0.527ms 3.11x 0.65x 35B/9B/4B TP1 1024x8 16 32 0.181ms 0.119ms 0.541ms 2.99x 0.66x 27B TP2 1x32768 8 24 0.676ms 1.629ms 1.719ms 2.54x 2.41x 27B TP2 1x16384 8 24 0.414ms 0.816ms 0.844ms 2.04x 1.97x 27B TP2 1x8192 8 24 0.268ms 0.412ms 0.430ms 1.60x 1.53x 27B TP2 1x4096 8 24 0.175ms 0.208ms 0.233ms 1.33x 1.19x 27B TP2 1x2048 8 24 0.104ms 0.108ms 0.124ms 1.20x 1.05x 27B TP2 28672+4096 8 24 0.673ms 1.420ms 1.638ms 2.43x 2.11x 27B TP2 24576+8192 8 24 0.671ms 1.207ms 1.559ms 2.32x 1.80x 27B TP2 16384+16384 8 24 0.673ms 0.786ms 1.729ms 2.57x 1.17x 27B TP2 8192+24576 8 24 0.674ms 1.180ms 1.726ms 2.56x 1.75x 27B TP2 4096+28672 8 24 0.674ms 1.377ms 1.727ms 2.56x 2.04x 27B TP2 12288+4096 8 24 0.408ms 0.606ms 0.766ms 1.88x 1.49x 27B TP2 6144+2048 8 24 0.264ms 0.304ms 0.395ms 1.49x 1.15x 27B TP2 4096+4096 8 24 0.192ms 0.200ms 0.435ms 2.27x 1.04x 27B TP2 2048+6144 8 24 0.267ms 0.298ms 0.435ms 1.63x 1.11x 27B TP2 1024+7168 8 24 0.267ms 0.348ms 0.435ms 1.63x 1.30x 27B TP2 8192x4 8 24 0.540ms 0.394ms 1.562ms 2.89x 0.73x 27B TP2 4096x8 8 24 0.550ms 0.402ms 1.573ms 2.86x 0.73x 27B TP2 2048x4 8 24 0.155ms 0.107ms 0.397ms 2.56x 0.69x 27B TP2 1024x8 8 24 0.169ms 0.114ms 0.409ms 2.41x 0.67x 27B TP1 1x32768 16 48 1.276ms 1.568ms 2.927ms 2.29x 1.23x 27B TP1 1x16384 16 48 0.681ms 0.787ms 1.416ms 2.08x 1.16x 27B TP1 1x8192 16 48 0.418ms 0.396ms 0.716ms 1.71x 0.95x 27B TP1 1x4096 16 48 0.195ms 0.200ms 0.369ms 1.89x 1.03x 27B TP1 1x2048 16 48 0.107ms 0.105ms 0.203ms 1.90x 0.98x 27B TP1 28672+4096 16 48 1.260ms 1.370ms 2.817ms 2.24x 1.09x 27B TP1 24576+8192 16 48 1.420ms 1.173ms 2.722ms 1.92x 0.83x 27B TP1 16384+16384 16 48 1.065ms 0.780ms 2.956ms 2.78x 0.73x 27B TP1 8192+24576 16 48 1.429ms 1.182ms 2.954ms 2.07x 0.83x 27B TP1 4096+28672 16 48 1.259ms 1.377ms 2.951ms 2.34x 1.09x 27B TP1 12288+4096 16 48 0.729ms 0.589ms 1.313ms 1.80x 0.81x 27B TP1 6144+2048 16 48 0.379ms 0.300ms 0.674ms 1.78x 0.79x 27B TP1 4096+4096 16 48 0.289ms 0.201ms 0.730ms 2.52x 0.69x 27B TP1 2048+6144 16 48 0.378ms 0.300ms 0.727ms 1.92x 0.79x 27B TP1 1024+7168 16 48 0.424ms 0.347ms 0.728ms 1.72x 0.82x 27B TP1 8192x4 16 48 1.069ms 0.799ms 2.722ms 2.55x 0.75x 27B TP1 4096x8 16 48 0.926ms 0.621ms 2.753ms 2.97x 0.67x 27B TP1 2048x4 16 48 0.302ms 0.212ms 0.676ms 2.24x 0.70x 27B TP1 1024x8 16 48 0.274ms 0.174ms 0.697ms 2.54x 0.63x 2B/0.8B TP1 1x32768 16 16 0.502ms 1.653ms 1.349ms 2.69x 3.29x 2B/0.8B TP1 1x16384 16 16 0.309ms 0.824ms 0.676ms 2.19x 2.67x 2B/0.8B TP1 1x8192 16 16 0.179ms 0.414ms 0.347ms 1.94x 2.31x 2B/0.8B TP1 1x4096 16 16 0.115ms 0.209ms 0.181ms 1.58x 1.83x 2B/0.8B TP1 1x2048 16 16 0.086ms 0.108ms 0.106ms 1.24x 1.26x 2B/0.8B TP1 28672+4096 16 16 0.493ms 1.430ms 1.269ms 2.57x 2.90x 2B/0.8B TP1 24576+8192 16 16 0.490ms 1.211ms 1.190ms 2.43x 2.47x 2B/0.8B TP1 16384+16384 16 16 0.482ms 0.779ms 1.035ms 2.15x 1.62x 2B/0.8B TP1 8192+24576 16 16 0.486ms 1.178ms 1.190ms 2.45x 2.42x 2B/0.8B TP1 4096+28672 16 16 0.489ms 1.386ms 1.270ms 2.60x 2.83x 2B/0.8B TP1 12288+4096 16 16 0.295ms 0.612ms 0.601ms 2.04x 2.07x 2B/0.8B TP1 6144+2048 16 16 0.178ms 0.308ms 0.310ms 1.74x 1.73x 2B/0.8B TP1 4096+4096 16 16 0.175ms 0.202ms 0.271ms 1.55x 1.15x 2B/0.8B TP1 2048+6144 16 16 0.178ms 0.303ms 0.310ms 1.74x 1.70x 2B/0.8B TP1 1024+7168 16 16 0.179ms 0.354ms 0.329ms 1.84x 1.98x 2B/0.8B TP1 8192x4 16 16 0.400ms 0.396ms 1.039ms 2.60x 0.99x 2B/0.8B TP1 4096x8 16 16 0.324ms 0.216ms 1.050ms 3.24x 0.67x 2B/0.8B TP1 2048x4 16 16 0.113ms 0.106ms 0.275ms 2.44x 0.94x 2B/0.8B TP1 1024x8 16 16 0.097ms 0.063ms 0.283ms 2.91x 0.65x Sym h32 1x32768 32 32 0.892ms 1.552ms 2.077ms 2.33x 1.74x Sym h32 1x16384 32 32 0.482ms 0.779ms 1.032ms 2.14x 1.62x Sym h32 1x8192 32 32 0.285ms 0.397ms 0.526ms 1.84x 1.39x Sym h32 1x4096 32 32 0.176ms 0.201ms 0.275ms 1.56x 1.14x Sym h32 1x2048 32 32 0.113ms 0.105ms 0.147ms 1.30x 0.94x Sym h32 28672+4096 32 32 0.887ms 1.371ms 2.078ms 2.34x 1.55x Sym h32 24576+8192 32 32 0.894ms 1.175ms 2.077ms 2.32x 1.31x Sym h32 16384+16384 32 32 0.786ms 0.782ms 2.079ms 2.65x 0.99x Sym h32 8192+24576 32 32 0.893ms 1.177ms 2.081ms 2.33x 1.32x Sym h32 4096+28672 32 32 0.892ms 1.372ms 2.086ms 2.34x 1.54x Sym h32 12288+4096 32 32 0.481ms 0.593ms 1.036ms 2.15x 1.23x Sym h32 6144+2048 32 32 0.275ms 0.298ms 0.529ms 1.93x 1.08x Sym h32 4096+4096 32 32 0.208ms 0.201ms 0.530ms 2.55x 0.97x Sym h32 2048+6144 32 32 0.275ms 0.300ms 0.531ms 1.93x 1.09x Sym h32 1024+7168 32 32 0.286ms 0.349ms 0.531ms 1.86x 1.22x Sym h32 8192x4 32 32 0.633ms 0.425ms 2.088ms 3.30x 0.67x Sym h32 4096x8 32 32 0.642ms 0.433ms 2.107ms 3.28x 0.68x Sym h32 2048x4 32 32 0.172ms 0.114ms 0.539ms 3.14x 0.66x Sym h32 1024x8 32 32 0.182ms 0.122ms 0.555ms 3.06x 0.67x ============================================================================================================== >>> BACKWARD BENCHMARKS Model Config SeqLen h_qk h_v #seq flash_qla [bwd] FLA [bwd] Speedup ------------------------------------------------------------------------------------ hk16_hv16 32k_1seq 16 16 1 1.259ms 3.857ms 3.06x hk16_hv16 32k_2seq 16 16 2 1.293ms 3.475ms 2.69x hk16_hv16 32k_4seq 16 16 4 1.441ms 3.241ms 2.25x hk16_hv16 32k_5seq 16 16 5 1.437ms 3.071ms 2.14x hk16_hv16 32k_6seq 16 16 6 1.434ms 3.082ms 2.15x hk16_hv16 32k_8seq 16 16 8 1.413ms 3.119ms 2.21x hk4_hv16 32k_1seq 4 16 1 1.326ms N/A N/A hk4_hv16 32k_2seq 4 16 2 1.355ms N/A N/A hk4_hv16 32k_4seq 4 16 4 1.512ms N/A N/A hk4_hv16 32k_5seq 4 16 5 1.505ms N/A N/A hk4_hv16 32k_6seq 4 16 6 1.504ms N/A N/A hk4_hv16 32k_8seq 4 16 8 1.485ms N/A N/A hk12_hv12 32k_1seq 12 12 1 1.135ms 3.358ms 2.96x hk12_hv12 32k_2seq 12 12 2 1.149ms 2.955ms 2.57x hk12_hv12 32k_4seq 12 12 4 1.103ms 2.710ms 2.46x hk12_hv12 32k_5seq 12 12 5 1.107ms 2.542ms 2.30x hk12_hv12 32k_6seq 12 12 6 1.102ms 2.514ms 2.28x hk12_hv12 32k_8seq 12 12 8 1.098ms 2.481ms 2.26x hk4_hv12 32k_1seq 4 12 1 1.175ms N/A N/A hk4_hv12 32k_2seq 4 12 2 1.187ms N/A N/A hk4_hv12 32k_4seq 4 12 4 1.147ms N/A N/A hk4_hv12 32k_5seq 4 12 5 1.157ms N/A N/A hk4_hv12 32k_6seq 4 12 6 1.143ms N/A N/A hk4_hv12 32k_8seq 4 12 8 1.156ms N/A N/A hk8_hv8 32k_1seq 8 8 1 0.831ms 2.831ms 3.41x hk8_hv8 32k_2seq 8 8 2 0.837ms 2.445ms 2.92x hk8_hv8 32k_4seq 8 8 4 0.794ms 1.798ms 2.26x hk8_hv8 32k_5seq 8 8 5 0.793ms 1.805ms 2.28x hk8_hv8 32k_6seq 8 8 6 0.780ms 1.774ms 2.27x hk8_hv8 32k_8seq 8 8 8 0.793ms 1.769ms 2.23x hk4_hv8 32k_1seq 4 8 1 0.873ms N/A N/A hk4_hv8 32k_2seq 4 8 2 0.877ms N/A N/A hk4_hv8 32k_4seq 4 8 4 0.836ms N/A N/A hk4_hv8 32k_5seq 4 8 5 0.835ms N/A N/A hk4_hv8 32k_6seq 4 8 6 0.824ms N/A N/A hk4_hv8 32k_8seq 4 8 8 0.832ms N/A N/A hk4_hv4 32k_1seq 4 4 1 0.517ms 2.350ms 4.54x hk4_hv4 32k_2seq 4 4 2 0.511ms 1.936ms 3.79x hk4_hv4 32k_4seq 4 4 4 0.498ms 1.279ms 2.57x hk4_hv4 32k_5seq 4 4 5 0.496ms 1.279ms 2.58x hk4_hv4 32k_6seq 4 4 6 0.480ms 1.075ms 2.24x hk4_hv4 32k_8seq 4 4 8 0.510ms 1.074ms 2.10x Benchmark Finished.