CANN/cannbot-skills内核索引-编程实验室

Kernel Index

【免费下载链接】cannbot-skillsCANNBot 是面向 CANN 开发的用于提升开发效率的系列智能体，本仓库为其提供可复用的 Skills 模块。项目地址: https://gitcode.com/cann/cannbot-skills

Use this file to filter down to ≤3 candidate kernels before openingkernel-catalog.md. Fastest path for agents:

conda run -n torch210npu python agent/scripts/select_kernel_example.py --query "<formula or task>" --topology '<topology>' --limit 3 --catalog
use this markdown table when you want a manual filter or the tool query is still too vague

Each row gives device, topology, path, and a one-line formula hint. Forstudy_foranddo_not_copy_when, read the matching entry inkernel-catalog.md. For machine-readable use, seeagent/index/kernels.json.

How to use

Pick the rows whosedevicematches your target (a2 or a5).
Narrow bytopology: cube-only / cube -> vec / vec -> cube / vec -> cube -> vec / vec -> cube -> vec -> cube -> vec / cube -> vec -> cube / cube -> vec -> cube -> vec / vec-only / micro-only.
Narrow byformula shape: pure matmul vs with postprocess, with reduction, with softmax, with online-accumulation, quantized, causal, etc.
For each remaining candidate path, jump straight intokernel-catalog.mdwith Grep on the filename (e.g.^### .kernels/a2/flash_attn_full\.py.) — do not scroll. Read only that one entry, and stop afterstudy_for/do_not_copy_whenunless you still need deeper notes.
Open the source file only after the catalogstudy_for/do_not_copy_whenconfirms the candidate.

Vec-only and micro references

Device	Topology	Path	Formula hint
a2	vec-only	`agent/example/kernels/a2/to_hif8_torch.py`	`to_hif8_torch(x)`— emulated hif8 round, saturation sentinels
a2	vec-only	`agent/example/kernels/a2/sort_rows.py`	per-row`torch.sort(x, dim=-1)`for`[ROWS=40, COLS=4096]`
a5	vec-only	`agent/example/kernels/a5/chunk_row_cumsum.py`	chunked row-recursive cumsum
a5	vec-only	`agent/example/kernels/a5/recurrent_state_attn_vec.py`	recurrent attention-state update,`D=128`
a5	vec-only	`agent/example/kernels/a5/vec_unaligned_gm_to_ub_pad.py`	`exp(x) + 2`on padded unaligned GM width
a5	micro-only	`agent/example/kernels/a5/micro_cast_fp8_pack4_dual.py`	`src.to(float8_e5m2)`via`micro`

Cube-only

Device	Topology	Path	Formula hint
a5	cube-only	`agent/example/kernels/a5/matmul_float_mmad.py`	`z = x @ y.t()`— shortest cube baseline
a5	cube-only	`agent/example/kernels/a5/matmul_e5m2_shortcut.py`	`z = x.float() @ y.float().t()`with fp8 inputs
a5	cube-only	`agent/example/kernels/a5/matmul_kmkn_fp32_out.py`	`z = x.float().t() @ y.float()`(KM @ KN -> MN)
a5	cube-only	`agent/example/kernels/a5/matmul_mknk_2dgrid_splitn.py`	`z = x @ y.t()`with`splitn`and 2D core grid
a5	cube-only	`agent/example/kernels/a5/matmul_mknk_2dgrid_splitk.py`	`z = x @ y.t()`with`splitk`for large-K
a2	cube-only	`agent/example/kernels/a2/qk_matmul_batched.py`	`qk = q.float() @ k.float().t()`with batched BH flatten
a2	cube-only	`agent/example/kernels/a2/attn_backward_dense_stage1_tail_dbuf.py`	`qk = q.float() @ k.float().t()`— DBuff tail variant

Cube -> vec (postprocess on a5)

Device	Topology	Path	Formula hint
a5	cube -> vec	`agent/example/kernels/a5/basic_cube_vec_mix.py`	`z = abs(x @ y.t()) + 1.0`— smallest mixed baseline
a5	cube -> vec	`agent/example/kernels/a5/matmul_half_splitn_bias10p2_vf.py`	`((x @ y) + 10.2).half()`— bias + half output via`@vf`
a5	cube -> vec	`agent/example/kernels/a5/matmul_rowwise_norm.py`	`z = (x @ y.t()) / row_sum(x @ y.t())`
a5	cube -> vec	`agent/example/kernels/a5/matmul_rowwise_norm_large_nk.py`	same as rowwise_norm, larger N/K
a5	cube -> vec	`agent/example/kernels/a5/matmul_rowwise_l2_norm.py`	L2-normalized matmul output
a5	cube -> vec	`agent/example/kernels/a5/matmul_chunk_absmax_norm128.py`	per-row absmax normalize over 128-column chunks
a5	cube -> vec	`agent/example/kernels/a5/matmul_kmkn_blockwise_quant128.py`	`x.float().t() @ y.float()`with blockwise-128 quant
a5	cube -> vec	`agent/example/kernels/a5/matmul_mknk_2dgrid_splitk_add1.py`	`x @ y.t() + 1.0`with`splitk`
a5	cube -> vec (dual-output atomic)	`agent/example/kernels/a5/cube_vec_atomic_add_two_outputs.py`	`out_cube += x @ y.t()`with atomics, two sinks

Vec -> cube (preprocess on a5)

Device	Topology	Path	Formula hint
a5	vec -> cube	`agent/example/kernels/a5/vec_cube_abs_sqrt_matmul.py`	`z = abs(x).sqrt() @ y.t()`
a5	vec -> cube	`agent/example/kernels/a5/vec_cube_abs_sqrt_matmul_nz.py`	same as above, NZ-published
a5	vec -> cube	`agent/example/kernels/a5/recompute_wu_cube_vec.py`	`k_cumdecay = attn @ (k_beta * decay_exp)`

Vec -> cube -> vec fusion (a5)

Device	Topology	Path	Formula hint
a5	vec -> cube -> vec	`agent/example/kernels/a5/vec_cube_vec_scale2_abs_add1_matmul.py`	`abs((x*2).half() @ y.t()) + 1.0`

Vec -> cube -> vec -> cube -> vec state bridge (a5)

Device	Topology	Path	Formula hint
a5	vec -> cube -> vec -> cube -> vec	`agent/example/kernels/a5/delta_h_state_bridge_v1_c8.py`	aligned`delta_h`baseline with persistent state snapshots and delayed state update
a5	vec -> cube -> vec -> cube -> vec	`agent/example/kernels/a5/delta_h_psudo_state_bridge_c8.py`	pseudo-reference comparison on the same stable state-bridge schedule

Cube -> vec -> cube -> vec lookahead (a5, MLA / MHA style)

Device	Topology	Path	Formula hint
a5	cube -> vec -> cube -> vec	`agent/example/kernels/a5/test_mla_entire.py`	streamed MLA: score, softmax, delayed`p @ k_nope`, final normalize
a5	cube -> vec -> cube -> vec	`agent/example/kernels/a5/mha_ifa.py`	streamed single-row`softmax(q @ k.t()) @ v`
a5	cube -> vec -> cube -> vec	`agent/example/kernels/a5/mha_ifa_256.py`	same,`BASES=256`
a5	cube -> vec -> cube -> vec	`agent/example/kernels/a5/mha_ifa_fp8_scale_256.py`	fp8 q/k/v, fp8-scaled p tiles,`BASES=256`
a5	cube -> vec -> cube -> vec	`agent/example/kernels/a5/flash_attn_full_fp8_causal.py`	multi-row causal full attention, fp8 q/k/v + fp8`p`tiles, tail-safe`S1/S2`
a5	cube -> vec -> cube -> vec	`agent/example/kernels/a5/mha_ifa_nz.py`	same, NZ-published probability tiles
a5	cube -> vec -> cube -> vec	`agent/example/kernels/a5/mha_ifa_nz_256.py`	same,`BASES=256`+ NZ

a2 mixed-pipeline (GM workspace bridges)

Device	Topology	Path	Formula hint
a2	cube -> vec (single GM bridge)	`agent/example/kernels/a2/attn_backward_dense_stage12_tail.py`	`qk = q.float() @ k.float().t()`stage-1+2 with tail
a2	cube -> vec (single GM bridge)	`agent/example/kernels/a2/flash_attn_score.py`	`exp(Q @ K^T / sqrt(D) - row_max)`cast to half
a2	cube -> vec (single GM bridge, running max)	`agent/example/kernels/a2/flash_attn_score_iter.py`	same, with cross-tile running row_max
a2	cube -> vec -> cube	`agent/example/kernels/a2/attn_backward_dense_total_tail.py`	dense attn-backward with tail
a2	cube -> vec -> cube	`agent/example/kernels/a2/attn_backward_dense_total_tail_causal.py`	same, causal masking
a2	cube -> vec -> cube	`agent/example/kernels/a2/attn_backward_dense_total_tail_causal_hif8.py`	same, hif8 probability path
a2	cube -> vec -> cube (double GM bridge, one-tile lookahead)	`agent/example/kernels/a2/flash_attn_score_pv.py`	`score_j = q @ k_j.t() * scale`with delayed`p @ v`
a2	cube -> vec -> cube -> vec (triple GM bridge, one-tile lookahead)	`agent/example/kernels/a2/flash_attn_unnorm.py`	unnormalized flash-attn numerator
a2	cube -> vec -> cube -> vec (triple GM bridge, final vec divide)	`agent/example/kernels/a2/flash_attn_full.py`	full flash-attn with running sum and final divide
a2	cube -> vec -> cube -> vec (triple GM bridge, hif8 stage-1 vec path)	`agent/example/kernels/a2/flash_attn_full_pj_hif8.py`	same math as`flash_attn_full.py`, hif8 probability
a2	cube -> vec -> cube -> vec (hif8 + diagonal causal mask, shared slot buffer)	`agent/example/kernels/a2/flash_attn_full_pj_hif8_causal.py`	same as hif8 variant, causal + future-tile skip
a2	cube -> vec -> cube -> vec (half probability, block-32 diagonal causal)	`agent/example/kernels/a2/flash_attn_full_pj_half_block32_causal.py`	same math, half`p`, block-32 causal
a2	cube -> vec -> cube -> vec (shared vec-side slot buffer for score and pv tiles)	`agent/example/kernels/a2/flash_attn_full_pj_hif8_commonub.py`	same as hif8 variant with shared UB slot

Going deeper

Forstudy_for/do_not_copy_whendetail on any single entry: openagent/references/examples/kernel-catalog.mdat the matching###heading.
For programmatic filtering:agent/index/kernels.json.

创作声明：本文部分内容由AI辅助生成（AIGC），仅供参考