news 2026/5/9 20:26:30

CANN/cannbot-skills内核索引

作者头像

张小明

前端开发工程师

1.2k 24
文章封面图
CANN/cannbot-skills内核索引

Kernel Index

【免费下载链接】cannbot-skillsCANNBot 是面向 CANN 开发的用于提升开发效率的系列智能体,本仓库为其提供可复用的 Skills 模块。项目地址: https://gitcode.com/cann/cannbot-skills

Use this file to filter down to ≤3 candidate kernels before openingkernel-catalog.md. Fastest path for agents:

  • conda run -n torch210npu python agent/scripts/select_kernel_example.py --query "<formula or task>" --topology '<topology>' --limit 3 --catalog
  • use this markdown table when you want a manual filter or the tool query is still too vague

Each row gives device, topology, path, and a one-line formula hint. Forstudy_foranddo_not_copy_when, read the matching entry inkernel-catalog.md. For machine-readable use, seeagent/index/kernels.json.

How to use

  1. Pick the rows whosedevicematches your target (a2 or a5).
  2. Narrow bytopology: cube-only / cube -> vec / vec -> cube / vec -> cube -> vec / vec -> cube -> vec -> cube -> vec / cube -> vec -> cube / cube -> vec -> cube -> vec / vec-only / micro-only.
  3. Narrow byformula shape: pure matmul vs with postprocess, with reduction, with softmax, with online-accumulation, quantized, causal, etc.
  4. For each remaining candidate path, jump straight intokernel-catalog.mdwith Grep on the filename (e.g.^### .kernels/a2/flash_attn_full\.py.) — do not scroll. Read only that one entry, and stop afterstudy_for/do_not_copy_whenunless you still need deeper notes.
  5. Open the source file only after the catalogstudy_for/do_not_copy_whenconfirms the candidate.

Vec-only and micro references

DeviceTopologyPathFormula hint
a2vec-onlyagent/example/kernels/a2/to_hif8_torch.pyto_hif8_torch(x)— emulated hif8 round, saturation sentinels
a2vec-onlyagent/example/kernels/a2/sort_rows.pyper-rowtorch.sort(x, dim=-1)for[ROWS=40, COLS=4096]
a5vec-onlyagent/example/kernels/a5/chunk_row_cumsum.pychunked row-recursive cumsum
a5vec-onlyagent/example/kernels/a5/recurrent_state_attn_vec.pyrecurrent attention-state update,D=128
a5vec-onlyagent/example/kernels/a5/vec_unaligned_gm_to_ub_pad.pyexp(x) + 2on padded unaligned GM width
a5micro-onlyagent/example/kernels/a5/micro_cast_fp8_pack4_dual.pysrc.to(float8_e5m2)viamicro

Cube-only

DeviceTopologyPathFormula hint
a5cube-onlyagent/example/kernels/a5/matmul_float_mmad.pyz = x @ y.t()— shortest cube baseline
a5cube-onlyagent/example/kernels/a5/matmul_e5m2_shortcut.pyz = x.float() @ y.float().t()with fp8 inputs
a5cube-onlyagent/example/kernels/a5/matmul_kmkn_fp32_out.pyz = x.float().t() @ y.float()(KM @ KN -> MN)
a5cube-onlyagent/example/kernels/a5/matmul_mknk_2dgrid_splitn.pyz = x @ y.t()withsplitnand 2D core grid
a5cube-onlyagent/example/kernels/a5/matmul_mknk_2dgrid_splitk.pyz = x @ y.t()withsplitkfor large-K
a2cube-onlyagent/example/kernels/a2/qk_matmul_batched.pyqk = q.float() @ k.float().t()with batched BH flatten
a2cube-onlyagent/example/kernels/a2/attn_backward_dense_stage1_tail_dbuf.pyqk = q.float() @ k.float().t()— DBuff tail variant

Cube -> vec (postprocess on a5)

DeviceTopologyPathFormula hint
a5cube -> vecagent/example/kernels/a5/basic_cube_vec_mix.pyz = abs(x @ y.t()) + 1.0— smallest mixed baseline
a5cube -> vecagent/example/kernels/a5/matmul_half_splitn_bias10p2_vf.py((x @ y) + 10.2).half()— bias + half output via@vf
a5cube -> vecagent/example/kernels/a5/matmul_rowwise_norm.pyz = (x @ y.t()) / row_sum(x @ y.t())
a5cube -> vecagent/example/kernels/a5/matmul_rowwise_norm_large_nk.pysame as rowwise_norm, larger N/K
a5cube -> vecagent/example/kernels/a5/matmul_rowwise_l2_norm.pyL2-normalized matmul output
a5cube -> vecagent/example/kernels/a5/matmul_chunk_absmax_norm128.pyper-row absmax normalize over 128-column chunks
a5cube -> vecagent/example/kernels/a5/matmul_kmkn_blockwise_quant128.pyx.float().t() @ y.float()with blockwise-128 quant
a5cube -> vecagent/example/kernels/a5/matmul_mknk_2dgrid_splitk_add1.pyx @ y.t() + 1.0withsplitk
a5cube -> vec (dual-output atomic)agent/example/kernels/a5/cube_vec_atomic_add_two_outputs.pyout_cube += x @ y.t()with atomics, two sinks

Vec -> cube (preprocess on a5)

DeviceTopologyPathFormula hint
a5vec -> cubeagent/example/kernels/a5/vec_cube_abs_sqrt_matmul.pyz = abs(x).sqrt() @ y.t()
a5vec -> cubeagent/example/kernels/a5/vec_cube_abs_sqrt_matmul_nz.pysame as above, NZ-published
a5vec -> cubeagent/example/kernels/a5/recompute_wu_cube_vec.pyk_cumdecay = attn @ (k_beta * decay_exp)

Vec -> cube -> vec fusion (a5)

DeviceTopologyPathFormula hint
a5vec -> cube -> vecagent/example/kernels/a5/vec_cube_vec_scale2_abs_add1_matmul.pyabs((x*2).half() @ y.t()) + 1.0

Vec -> cube -> vec -> cube -> vec state bridge (a5)

DeviceTopologyPathFormula hint
a5vec -> cube -> vec -> cube -> vecagent/example/kernels/a5/delta_h_state_bridge_v1_c8.pyaligneddelta_hbaseline with persistent state snapshots and delayed state update
a5vec -> cube -> vec -> cube -> vecagent/example/kernels/a5/delta_h_psudo_state_bridge_c8.pypseudo-reference comparison on the same stable state-bridge schedule

Cube -> vec -> cube -> vec lookahead (a5, MLA / MHA style)

DeviceTopologyPathFormula hint
a5cube -> vec -> cube -> vecagent/example/kernels/a5/test_mla_entire.pystreamed MLA: score, softmax, delayedp @ k_nope, final normalize
a5cube -> vec -> cube -> vecagent/example/kernels/a5/mha_ifa.pystreamed single-rowsoftmax(q @ k.t()) @ v
a5cube -> vec -> cube -> vecagent/example/kernels/a5/mha_ifa_256.pysame,BASES=256
a5cube -> vec -> cube -> vecagent/example/kernels/a5/mha_ifa_fp8_scale_256.pyfp8 q/k/v, fp8-scaled p tiles,BASES=256
a5cube -> vec -> cube -> vecagent/example/kernels/a5/flash_attn_full_fp8_causal.pymulti-row causal full attention, fp8 q/k/v + fp8ptiles, tail-safeS1/S2
a5cube -> vec -> cube -> vecagent/example/kernels/a5/mha_ifa_nz.pysame, NZ-published probability tiles
a5cube -> vec -> cube -> vecagent/example/kernels/a5/mha_ifa_nz_256.pysame,BASES=256+ NZ

a2 mixed-pipeline (GM workspace bridges)

DeviceTopologyPathFormula hint
a2cube -> vec (single GM bridge)agent/example/kernels/a2/attn_backward_dense_stage12_tail.pyqk = q.float() @ k.float().t()stage-1+2 with tail
a2cube -> vec (single GM bridge)agent/example/kernels/a2/flash_attn_score.pyexp(Q @ K^T / sqrt(D) - row_max)cast to half
a2cube -> vec (single GM bridge, running max)agent/example/kernels/a2/flash_attn_score_iter.pysame, with cross-tile running row_max
a2cube -> vec -> cubeagent/example/kernels/a2/attn_backward_dense_total_tail.pydense attn-backward with tail
a2cube -> vec -> cubeagent/example/kernels/a2/attn_backward_dense_total_tail_causal.pysame, causal masking
a2cube -> vec -> cubeagent/example/kernels/a2/attn_backward_dense_total_tail_causal_hif8.pysame, hif8 probability path
a2cube -> vec -> cube (double GM bridge, one-tile lookahead)agent/example/kernels/a2/flash_attn_score_pv.pyscore_j = q @ k_j.t() * scalewith delayedp @ v
a2cube -> vec -> cube -> vec (triple GM bridge, one-tile lookahead)agent/example/kernels/a2/flash_attn_unnorm.pyunnormalized flash-attn numerator
a2cube -> vec -> cube -> vec (triple GM bridge, final vec divide)agent/example/kernels/a2/flash_attn_full.pyfull flash-attn with running sum and final divide
a2cube -> vec -> cube -> vec (triple GM bridge, hif8 stage-1 vec path)agent/example/kernels/a2/flash_attn_full_pj_hif8.pysame math asflash_attn_full.py, hif8 probability
a2cube -> vec -> cube -> vec (hif8 + diagonal causal mask, shared slot buffer)agent/example/kernels/a2/flash_attn_full_pj_hif8_causal.pysame as hif8 variant, causal + future-tile skip
a2cube -> vec -> cube -> vec (half probability, block-32 diagonal causal)agent/example/kernels/a2/flash_attn_full_pj_half_block32_causal.pysame math, halfp, block-32 causal
a2cube -> vec -> cube -> vec (shared vec-side slot buffer for score and pv tiles)agent/example/kernels/a2/flash_attn_full_pj_hif8_commonub.pysame as hif8 variant with shared UB slot

Going deeper

  • Forstudy_for/do_not_copy_whendetail on any single entry: openagent/references/examples/kernel-catalog.mdat the matching###heading.
  • For programmatic filtering:agent/index/kernels.json.

【免费下载链接】cannbot-skillsCANNBot 是面向 CANN 开发的用于提升开发效率的系列智能体,本仓库为其提供可复用的 Skills 模块。项目地址: https://gitcode.com/cann/cannbot-skills

创作声明:本文部分内容由AI辅助生成(AIGC),仅供参考

版权声明: 本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若内容造成侵权/违法违规/事实不符,请联系邮箱:809451989@qq.com进行投诉反馈,一经查实,立即删除!
网站建设 2026/5/9 20:25:28

AI金融风险深度解析:恶意使用、信息误导与市场结构挑战

1. 项目概述&#xff1a;当AI成为金融市场的“双刃剑” 最近和几个在投行、量化基金和监管科技部门的朋友聊天&#xff0c;话题总绕不开一个词&#xff1a;AI。大家既兴奋于它带来的效率革命&#xff0c;又隐隐担忧它可能捅出的新篓子。这让我想起一个经典的比喻&#xff1a;给…

作者头像 李华
网站建设 2026/5/9 20:22:32

SLING实战:如何构建自己的知识抽取系统

SLING实战&#xff1a;如何构建自己的知识抽取系统 【免费下载链接】sling SLING - A natural language frame semantics parser 项目地址: https://gitcode.com/gh_mirrors/sling1/sling 在信息爆炸的时代&#xff0c;如何从海量文本中精准提取结构化知识是许多开发者面…

作者头像 李华
网站建设 2026/5/9 20:18:22

企业级应用如何利用 Taotoken 实现稳定且低成本的大模型能力集成

&#x1f680; 告别海外账号与网络限制&#xff01;稳定直连全球优质大模型&#xff0c;限时半价接入中。 &#x1f449; 点击领取海量免费额度 企业级应用如何利用 Taotoken 实现稳定且低成本的大模型能力集成 将大模型能力集成到企业级应用中&#xff0c;已成为提升产品智能…

作者头像 李华
网站建设 2026/5/9 20:16:32

CANN/shmem编译构建指南

编译与构建 【免费下载链接】shmem CANN SHMEM 是面向昇腾平台的多机多卡内存通信库&#xff0c;基于OpenSHMEM 标准协议&#xff0c;实现跨设备的高效内存访问与数据同步。 项目地址: https://gitcode.com/cann/shmem SHMEM编译 下载SHMEM源码 git clone https://git…

作者头像 李华