news 2026/5/9 13:45:35

CANN/cannbot-skills:A2三桥核在线Softmax尾部处理

作者头像

张小明

前端开发工程师

1.2k 24
文章封面图
CANN/cannbot-skills:A2三桥核在线Softmax尾部处理

Online Softmax Tail Handling on A2 Triple-Bridge Kernels

【免费下载链接】cannbot-skillsCANNBot 是面向 CANN 开发的用于提升开发效率的系列智能体,本仓库为其提供可复用的 Skills 模块。项目地址: https://gitcode.com/cann/cannbot-skills

Read this file when debugging or extending an a2 (easyasc.a2, deviceb3) normalized online softmax kernel with delayedp/pvstages and a non-alignedS2tail orS1tail.

Typical targets:

  • agent/example/kernels/a2/flash_attn_full.py
  • agent/example/kernels/a2/flash_attn_full_pj_hif8.py
  • agent/example/kernels/a2/flash_attn_full_pj_hif8_causal.py

Do not use this file as the first reference for generic tail bugs. For generic GM-boundary tail rules, readagent/references/constraints/tail-safety.mdfirst. This file only covers the extra rule that appears once the kernel has:

  • runningrow_max
  • runningrow_sum
  • delayedexpdiff
  • delayedp @ v

Goal

Handle a non-alignedS2tail orS1tail without breaking the normalized online softmax math.

The two axes arenotsymmetric:

  • S2tail means invalidcolumnsinside otherwise valid score rows
  • S1tail means invalidrowsinside an otherwise full local score tile

The stable rules are therefore different:

  • S2still usesvalid_nat GM boundaries, but also needs score-domain-infmasking beforerowmax
  • S1usesvalid_mat GM boundaries, then masks only the local invalid rows afterscore - rowmaxand beforeexp

1. Why GM-boundary slicing alone is not enough

The generic tail rule still applies:

  • local tensors stay full-tile sized
  • only GM loads/stores usevalid_n

That prevents out-of-bounds reads, but it isnotenough for online softmax.

If the lastk/vtile is loaded withvalid_n < TILE_N, the padded columns look like zeros in the staged full tile. That creates a second problem:

  • rowmax(score_j)can see the padded columns
  • curr_m = maximum(prev_m, rowmax(score_j))can become too large
  • expdiff_j = exp(prev_m - curr_m)then rescales previous accumulated state incorrectly
  • row_sumandoutare both corrupted even if laterp_jis masked to zero

So for normalized online softmax:

  • padded tail columns must behave like-infbeforerowmax
  • the same padded columns then naturally become0afterexp

2. Do not start from ap-domain-only fix

Ap-domain-only tail mask is insufficient for normalized online softmax.

It can fix:

  • delayedp @ v
  • any later use ofp_j

It cannot fix:

  • rowmax(score_j)
  • curr_m
  • delayedexpdiff_j
  • row_sum

If the kernel has runningrow_max/row_sum, fix the score tile first.

3. Stable semantic rule for invalid tail columns

For the lastS2tile:

  • beforecmax: invalid columns must look like-inf
  • afterexp: invalid columns must behave like0

This rule preserves the exact reference update:

  • curr_m = maximum(prev_m, rowmax(score_j_valid_only))
  • p_j = exp(score_j_valid_only - curr_m)
  • row_sum = row_sum * expdiff_j + p_j.sum(-1)

You donotneed a separatep-domain tail mask if the score tile already uses this-infrule and the delayedvload also usesvalid_n.

4. Stable a2 implementation shape

For the current validated flash-attention kernels:

  • TILE_N = 128
  • score is processed in vec as[HALF_M, TILE_N]
  • the practical split is two[HALF_M, 64]halves

That gives a stable rule:

  • left half handles columns[0:64)
  • right half handles columns[64:128)

Tail cases:

  • valid_n == 128
    • both halves fully valid
  • 64 < valid_n < 128
    • left half fully valid
    • right half needs a suffix invalid mask
  • valid_n == 64
    • left half fully valid
    • right half fully invalid
  • 0 < valid_n < 64
    • left half needs a suffix invalid mask
    • right half fully invalid
  • valid_n == 0
    • both halves fully invalid

5. Why vec mask + finite negative sentinel is the simplest score-domain fix

Forfloatvec ops on a2:

  • the active mask prefix length is64
  • the same64-lane mask prefix is reused for each repeat

That matches a[HALF_M, 64]score half perfectly:

  • one row uses one repeat
  • each row wants the same tail-column mask

So the stable suffix invalidation pattern is:

  1. compute a 64-bit suffix-invalid mask
  2. set_mask(0, low_mask)
  3. dup(score_half, neg_large)
  4. reset_mask()

This is usually simpler than materializing a[HALF_M, 64]flag tensor and then doingselect(...)on the score half. The intent is still-infbehavior, but the concrete fill should stay finite on hardware paths.

Read next for exact vec mask semantics:

  • agent/references/constraints/mask.md

6. Bit order and mask meaning

Instruction semantics:

  • lowwritesmask[0:64]
  • bit0maps to the lowest logical lane in that prefix
  • bit63maps to the highest logical lane in that prefix

Stub call note:

  • the current a2 stub is called asset_mask(mask_high, mask_low)
  • so a low-only score-half mask is written withset_mask(0, low_mask)

So for a suffix invalid mask on one 64-column score half:

  • columns[0:valid_cols)should be0
  • columns[valid_cols:64)should be1

Examples:

  • valid_cols = 64-> no invalid bits
  • valid_cols = 63-> only bit63is1
  • valid_cols = 10-> bits[10:63]are1
  • valid_cols = 0-> all bits are1

Validated repository tests:

  • testcases/simulator/micro/test_simulator_v2_muladddst_mask.py
  • testcases/simulator/micro/test_simulator_v2_vec_ops_extended.py

7. Stable scalar-mask construction trick

The obvious unsigned construction:

  • build a hugeuint64value like18446744073709550592

can trip the simulator's scalar cast path because the current runtime first creates a Python/Torch signed integer before converting touint64.

The stable workaround is:

  • start from signed-1
  • left-shift itvalid_colstimes
  • then assign the signed result into auint64Var

For one 64-lane score half this builds the same suffix-invalid bit pattern:

@func() def build_suffix_invalid_mask(valid_cols: Var, out_mask: Var): signed_mask = Var(-1, DT.int64) two_i64 = Var(2, DT.int64) for _ in range(0, valid_cols): signed_mask <<= signed_mask * two_i64 out_mask <<= signed_mask

Why this works:

  • -1 << valid_colsequals the desired suffix-invalid mask in two's-complement
  • the intermediate signed values stay representable inint64
  • the finaluint64assignment preserves the bit pattern

8. Minimal integration recipe

For a normalized online softmax stage-1 score tile:

  1. loadkwithvalid_n
  2. stage the full[HALF_M, TILE_N]score tile
  3. apply score-tail masking only whenvalid_n < TILE_N
  4. only then run:
    • vmax(...)
    • cmax(...)
    • delayedexpdiff
    • exp(...)
    • cadd(...)
  5. stage delayedp
  6. later loadvwith the recomputed previous-tilevalid_n

The score-tail masking point should be:

  • after scale is applied
  • before anyrowmax/cmax

9. Minimal validation set

Do not validate only aligned cases.

ForTILE_N = 128, keep at least:

  • one aligned baseline:S2 % 128 == 0
  • one small left-half tail:S2 % 128 == 10
  • one first-right-half case:S2 % 128 == 65
  • one mid-right-half case:S2 % 128 == 96
  • one last-column case:S2 % 128 == 127

Forflash_attn_full_pj_hif8.py, the validated runnable regression lives in the kernel self-check:

  • agent/example/kernels/a2/flash_attn_full_pj_hif8.py

10. WhyS1tail is a different problem

Do not try to solveS1tail by reusing theS2column-tail mental model.

ForS1tail:

  • the invalid region is a suffix ofrows, not columns
  • qmust usevalid_mat the GM boundary
  • finaloutmust also usevalid_mat the GM boundary
  • the vec side still sees a fixed physical[HALF_M, TILE_N]score tile

Current validated a2 flash-attention shape:

  • the two vec subblocks read fixed physical row ranges
    • subblock0reads rows[0:64)
    • subblock1reads rows[64:128)
  • this isnotthe a5-styleCeilDiv(valid_m, 2)compact half split

So the stable local quantity is:

  • local_valid_m = clamp(valid_m - sb_row, 0, HALF_M)

where:

  • valid_mis the tile-level valid query-row count
  • sb_rowis the fixed physical subblock row origin (0or64)

11. StableS1implementation rule

For a normalized online softmax stage-1 score tile withS1tail:

  1. loadqwithvalid_m
  2. rely on the currentgm_to_l1_nd2nzzero-fill behavior for the local tail rows
  3. run the normal score tile,rowmax,curr_m, andexpdiffflow on the full local score tile
  4. afterscore_j - curr_m, but beforeexp(score_j), overwrite the local invalid row suffix with a sufficiently negative finite sentinel
  5. keep the delayedp/pvpath full-tile sized
  6. write back onlylocal_valid_mrows to GM

Why this point is stable:

  • masking invalid rowsbeforecmaxcan create invalid-row sentinelrowmaxand unstable invalid-row subtraction behavior analogous to-inf - (-inf)
  • masking themaftersubtractingcurr_mpreserves the valid-row online softmax math
  • the invalid local rows then become0afterexp, so they contribute nothing to delayedp @ v

Current repository tolerance:

  • invalidS1tail rows may still becomeNaNafter the finalout / row_sumon local UB rows
  • this is acceptable because those rows are not written back to GM

12. MinimalS1validation set

Do not validate only one row-tail case.

ForTILE_M = 128, keep at least:

  • one aligned baseline:S1 % 128 == 0
  • one one-row tail:S1 % 128 == 1
  • one last-row-in-first-half case:S1 % 128 == 63
  • one exact half case:S1 % 128 == 64
  • one first-row-in-second-half case:S1 % 128 == 65
  • one last-row case:S1 % 128 == 127
  • one larger shape beyond two tiles, for exampleS1 == 257
  • one multi-head shape

KeepS2aligned while validating the newS1path first, so failures are easier to attribute.

13. Causal diagonal tiles on a2

Read this when extending the same normalized online-softmax pipeline from plain tail handling to left-up causal masking (k_pos <= q_pos).

The stable tile classification is:

  • nt < lmt: the tile is fully valid
  • nt == lmt: the tile is diagonal and contains mixed valid/invalid columns
  • nt > lmt: the tile is fully invalid and should be skipped

For the current validated causal kernel, the stable scheduling rule is:

  • clamp the stage-1/stage-2 loop toactive_tiles_n = Min(tiles_n, lmt + 1)
  • still keep the outern_loops + 1style drain shape by iterating toactive_tiles_n + 1
  • this preserves the delayedp @ vflush while removing future fully-invalid tiles

The diagonal tile isnota plainvalid_ntail:

  • invalid columns vary by row
  • the stable local quantity isvalid_cols = sb_row + row + 1
  • sb_rowis the fixed subblock row origin (0or64)

Stable implementation rule for the diagonal tile:

  1. load and scale the full[HALF_M, TILE_N]score tile
  2. prebuild reusable packed-bit masks once per subblock before the main tile loop:
    • causal_mask_left: Tensor(DT.uint8, [HALF_M, HALF_N // 8], Position.UB)
    • causal_mask_right: Tensor(DT.uint8, [HALF_M, HALF_N // 8], Position.UB)
  3. initialize one reusable integer column-index row for[0, 1, ..., 63]; the current validated kernel writes twoint32entries at a time through anint64reinterpret to keepSetValueTo(...)count low
  4. use a Python-unrolled row loop (py_range(HALF_M)) only for the per-row threshold, and synthesize packed mask bytes withcompare_scalar(...):
    • ifsb_row == 0, build onlycausal_mask_left[row]with thresholdrow + 1
    • ifsb_row == 64, fillcausal_mask_leftto all ones and build onlycausal_mask_right[row]with thresholdrow + 1
  5. apply the packed masks withselect(..., SelectMode.TENSOR_SCALAR)beforecmax/rowmax
  6. if the same tile is also the finalS2tail tile, apply the diagonal causal mask first andvalid_ntail masking second

Why this is the stable path:

  • it matches the current hardware / simulator rule thatcompare_scalar(...)andselect(...)use packed-bituint8control
  • it keeps the control path cheap by building the causal masks once per subblock instead of reconstructing them inside every diagonal-tile visit
  • it avoids the large simulator overhead of byte-by-byteSetValueTo(...)loops for mask construction
  • it avoids trying to repair causal semantics later in theporpvpath
  • it preserves the exact online-softmax updates forrow_max,expdiff, androw_sum

Minimal causal validation set:

  • oneS1 == S2aligned case
  • oneS1 == S2unaligned case
  • oneS1 < S2case
  • oneS1 > S2case
  • one multi-head case

Validated runnable example:

  • agent/example/kernels/a2/flash_attn_full_pj_hif8_causal.py

14. Files to study

  • agent/example/kernels/a2/flash_attn_full_pj_hif8.py
  • agent/example/kernels/a2/flash_attn_full_pj_hif8_causal.py
  • testcases/simulator/micro/test_simulator_v2_muladddst_mask.py
  • testcases/simulator/micro/test_simulator_v2_vec_ops_extended.py
  • agent/references/constraints/tail-safety.md
  • agent/references/constraints/mask.md
  • agent/references/patterns/a2-cube-vec-cube-vec-softmax.md

【免费下载链接】cannbot-skillsCANNBot 是面向 CANN 开发的用于提升开发效率的系列智能体,本仓库为其提供可复用的 Skills 模块。项目地址: https://gitcode.com/cann/cannbot-skills

创作声明:本文部分内容由AI辅助生成(AIGC),仅供参考

版权声明: 本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若内容造成侵权/违法违规/事实不符,请联系邮箱:809451989@qq.com进行投诉反馈,一经查实,立即删除!
网站建设 2026/5/9 13:44:55

KrkrzExtract终极指南:新一代krkrz引擎资源解包工具完全解析

KrkrzExtract终极指南&#xff1a;新一代krkrz引擎资源解包工具完全解析 【免费下载链接】KrkrzExtract The next generation of KrkrExtract 项目地址: https://gitcode.com/gh_mirrors/kr/KrkrzExtract KrkrzExtract是专门为krkrz引擎设计的下一代资源处理工具&#x…

作者头像 李华
网站建设 2026/5/9 13:35:32

AI赋能无人机通信与导航:端到端智能优化与关键技术解析

1. 项目概述&#xff1a;当无人机遇上AI&#xff0c;通信与导航的范式革命最近几年&#xff0c;无人机&#xff08;UAV&#xff09;的应用场景正以前所未有的速度扩张&#xff0c;从最初的航拍娱乐&#xff0c;到如今的物流配送、农业植保、电力巡检、应急救援&#xff0c;甚至…

作者头像 李华
网站建设 2026/5/9 13:33:01

鸿蒙开源阅读:打造完全自定义的无广告阅读体验终极指南

鸿蒙开源阅读&#xff1a;打造完全自定义的无广告阅读体验终极指南 【免费下载链接】legado-Harmony 开源阅读鸿蒙版仓库 项目地址: https://gitcode.com/gh_mirrors/le/legado-Harmony 开源阅读鸿蒙版是一款专为鸿蒙系统优化的免费开源小说阅读器&#xff0c;通过自定义…

作者头像 李华
网站建设 2026/5/9 13:32:51

AI赋能卫星通信:智能波束跳变与抗干扰技术实践

1. 项目概述&#xff1a;当AI遇见卫星通信的“矛”与“盾”卫星通信&#xff0c;这个听起来有些“高冷”的领域&#xff0c;其实早已渗透进我们生活的方方面面。从偏远地区的网络覆盖&#xff0c;到远洋船舶的实时通信&#xff0c;再到应急救灾的指挥调度&#xff0c;都离不开头…

作者头像 李华
网站建设 2026/5/9 13:32:24

AI如何革新系统文献综述:从自动化筛选到LLM深度分析

1. 项目概述&#xff1a;当AI遇见文献综述如果你做过一次完整的系统文献综述&#xff0c;你大概能理解那种“痛并快乐着”的感觉。快乐在于&#xff0c;通过严谨的梳理&#xff0c;你能清晰地看到一个领域的发展脉络&#xff1b;而痛苦&#xff0c;则来自于海量文献的筛选、阅读…

作者头像 李华
网站建设 2026/5/9 13:31:48

如何用Sunshine构建跨平台游戏串流系统:从硬件限制到游戏自由

如何用Sunshine构建跨平台游戏串流系统&#xff1a;从硬件限制到游戏自由 【免费下载链接】Sunshine Self-hosted game stream host for Moonlight. 项目地址: https://gitcode.com/GitHub_Trending/su/Sunshine 在数字娱乐体验不断升级的今天&#xff0c;游戏玩家面临着…

作者头像 李华