Transformer中的Cross-Attention实战：从机器翻译到图像字幕生成的代码实现-编程实验室

Transformer中的Cross-Attention实战：从机器翻译到图像字幕生成的代码实现

在深度学习领域，Transformer架构已经成为处理序列数据的黄金标准。而其中最具创新性的组件之一——Cross-Attention（跨注意力机制），更是让模型能够实现不同序列之间的信息融合。本文将带您深入探索Cross-Attention的实际应用，通过具体代码示例展示其在机器翻译和图像字幕生成任务中的强大表现。

1. Cross-Attention机制的核心原理

Cross-Attention与传统Self-Attention（自注意力）最大的区别在于其处理的是两个不同的输入序列。让我们先理解其数学表达：

# Cross-Attention的简化数学表达 def cross_attention(Q, K, V): """ Q: 查询矩阵 (来自序列A) K: 键矩阵 (来自序列B) V: 值矩阵 (来自序列B) """ attention_scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(d_k) attention_weights = F.softmax(attention_scores, dim=-1) output = torch.matmul(attention_weights, V) return output

关键区别在于：

Self-Attention：Q、K、V都来自同一序列
Cross-Attention：Q来自序列A，K和V来自序列B

这种设计使得模型能够建立两个序列元素之间的直接关联，例如：

机器翻译中源语言和目标语言单词的对应关系
图像字幕生成中图像区域与描述词汇的匹配

2. 机器翻译实战：构建双语对齐模型

让我们用PyTorch实现一个简化的机器翻译模型，重点展示Cross-Attention层的实现。

2.1 模型架构设计

import torch import torch.nn as nn import math class CrossAttentionLayer(nn.Module): def __init__(self, d_model, nhead, dropout=0.1): super().__init__() self.multihead_attn = nn.MultiheadAttention(d_model, nhead, dropout=dropout) self.norm = nn.LayerNorm(d_model) self.dropout = nn.Dropout(dropout) def forward(self, query, key_value, key_padding_mask=None): """ query: 目标语言序列 (L_tgt, N, E) key_value: 源语言序列 (L_src, N, E) """ attn_output, _ = self.multihead_attn( query, key_value, key_value, key_padding_mask=key_padding_mask ) output = self.norm(query + self.dropout(attn_output)) return output class TranslationTransformer(nn.Module): def __init__(self, src_vocab_size, tgt_vocab_size, d_model=512, nhead=8, num_layers=6): super().__init__() self.src_embedding = nn.Embedding(src_vocab_size, d_model) self.tgt_embedding = nn.Embedding(tgt_vocab_size, d_model) self.cross_attn_layers = nn.ModuleList([ CrossAttentionLayer(d_model, nhead) for _ in range(num_layers) ]) self.fc_out = nn.Linear(d_model, tgt_vocab_size) def forward(self, src, tgt, src_mask=None): src_emb = self.src_embedding(src) tgt_emb = self.tgt_embedding(tgt) for layer in self.cross_attn_layers: tgt_emb = layer(tgt_emb, src_emb, src_mask) return self.fc_out(tgt_emb)

2.2 注意力可视化分析

训练完成后，我们可以提取注意力权重，观察模型如何建立双语对齐：

def visualize_attention(model, src_sentence, tgt_sentence, src_vocab, tgt_vocab): src_tokens = [src_vocab[word] for word in src_sentence.split()] tgt_tokens = [tgt_vocab[word] for word in tgt_sentence.split()] src = torch.LongTensor(src_tokens).unsqueeze(1) # (L_src, 1) tgt = torch.LongTensor(tgt_tokens).unsqueeze(1) # (L_tgt, 1) with torch.no_grad(): src_emb = model.src_embedding(src) tgt_emb = model.tgt_embedding(tgt) # 获取最后一层的注意力权重 _, attn_weights = model.cross_attn_layers[-1].multihead_attn( tgt_emb, src_emb, src_emb ) # 绘制热力图 plt.figure(figsize=(10, 8)) sns.heatmap(attn_weights.squeeze().numpy(), xticklabels=src_sentence.split(), yticklabels=tgt_sentence.split()) plt.xlabel("Source Language") plt.ylabel("Target Language") plt.title("Cross-Attention Alignment")

典型输出会显示目标语言每个词最关注的源语言词汇，这种对齐关系正是机器翻译质量的关键。

3. 图像字幕生成：跨模态信息融合

Cross-Attention同样擅长处理不同模态数据间的关联。下面我们实现一个图像字幕生成模型，其中视觉特征与文本特征通过Cross-Attention交互。

3.1 模型架构

class ImageCaptioningModel(nn.Module): def __init__(self, vocab_size, d_model=512, nhead=8, num_layers=6): super().__init__() # 图像特征提取 (使用预训练的CNN) self.cnn = torchvision.models.resnet50(pretrained=True) self.cnn.fc = nn.Linear(self.cnn.fc.in_features, d_model) # 文本处理 self.embedding = nn.Embedding(vocab_size, d_model) self.pos_encoder = PositionalEncoding(d_model) # Cross-Attention层 self.cross_attn_layers = nn.ModuleList([ CrossAttentionLayer(d_model, nhead) for _ in range(num_layers) ]) self.fc_out = nn.Linear(d_model, vocab_size) def forward(self, image, caption): # 提取图像特征 (L_img=1, N, E) img_feat = self.cnn(image).unsqueeze(0) # 文本嵌入 (L_txt, N, E) txt_emb = self.pos_encoder(self.embedding(caption)) # 多层Cross-Attention for layer in self.cross_attn_layers: txt_emb = layer(txt_emb, img_feat) return self.fc_out(txt_emb) class PositionalEncoding(nn.Module): def __init__(self, d_model, max_len=5000): super().__init__() position = torch.arange(max_len).unsqueeze(1) div_term = torch.exp(torch.arange(0, d_model, 2) * (-math.log(10000.0) / d_model)) pe = torch.zeros(max_len, 1, d_model) pe[:, 0, 0::2] = torch.sin(position * div_term) pe[:, 0, 1::2] = torch.cos(position * div_term) self.register_buffer('pe', pe) def forward(self, x): return x + self.pe[:x.size(0)]

3.2 训练技巧与可视化

训练图像字幕模型时，有几个关键技巧：

注意力热力图：可视化模型关注图像的哪些区域来生成特定词汇
课程学习：先训练模型预测短字幕，再逐步增加长度
束搜索：生成时使用束搜索提高结果质量

def generate_caption(model, image, vocab, max_len=20, beam_size=3): model.eval() img_feat = model.cnn(image).unsqueeze(0) # 束搜索初始化 sequences = [[[vocab['<start>']], 0.0]] for _ in range(max_len): all_candidates = [] for seq, score in sequences: # 转换为张量 seq_tensor = torch.LongTensor(seq).unsqueeze(1) # 获取预测 with torch.no_grad(): output = model.fc_out(model.cross_attn_layers( model.pos_encoder(model.embedding(seq_tensor)), img_feat )) log_probs = F.log_softmax(output[-1, :], dim=0) # 保留top k候选 top_k_probs, top_k_ids = log_probs.topk(beam_size) for i in range(beam_size): candidate = [seq + [top_k_ids[i].item()], score + top_k_probs[i].item()] all_candidates.append(candidate) # 选择总概率最高的k个序列 ordered = sorted(all_candidates, key=lambda x: x[1], reverse=True) sequences = ordered[:beam_size] # 选择最佳序列 best_seq = sequences[0][0] return ' '.join([vocab.idx2word[idx] for idx in best_seq if idx not in [vocab['<start>'], vocab['<end>']]])

4. 高级应用与优化策略

4.1 多模态融合的进阶技巧

在实际应用中，我们可以进一步优化Cross-Attention的表现：

多头注意力扩展：

class MultiModalCrossAttention(nn.Module): def __init__(self, d_model, nhead, dropout=0.1): super().__init__() assert d_model % nhead == 0 self.d_k = d_model // nhead self.nhead = nhead self.w_q = nn.Linear(d_model, d_model) self.w_k = nn.Linear(d_model, d_model) self.w_v = nn.Linear(d_model, d_model) self.fc = nn.Linear(d_model, d_model) self.dropout = nn.Dropout(dropout) def forward(self, query, key_value, mask=None): batch_size = query.size(1) # 线性变换并分头 Q = self.w_q(query).view(-1, batch_size, self.nhead, self.d_k).transpose(1, 2) K = self.w_k(key_value).view(-1, batch_size, self.nhead, self.d_k).transpose(1, 2) V = self.w_v(key_value).view(-1, batch_size, self.nhead, self.d_k).transpose(1, 2) # 计算注意力 scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k) if mask is not None: scores = scores.masked_fill(mask == 0, -1e9) attn = F.softmax(scores, dim=-1) attn = self.dropout(attn) # 合并多头 context = torch.matmul(attn, V).transpose(1, 2).contiguous() context = context.view(-1, batch_size, self.nhead * self.d_k) return self.fc(context)

跨模态预训练策略：

对比学习：使用InfoNCE损失函数对齐视觉和语言表示
掩码语言建模：随机掩码部分文本，让模型根据图像预测
图像-文本匹配：二分类任务判断图像和文本是否匹配

4.2 计算效率优化

处理高分辨率图像或长文本时，标准Cross-Attention的计算复杂度可能成为瓶颈。以下是几种优化方案：

优化方法	原理	适用场景	实现复杂度
局部注意力	限制注意力范围	空间/时间局部性强的数据	★★☆
稀疏注意力	预设注意力模式	结构化数据	★★★
线性注意力	近似注意力矩阵	通用场景	★★☆
内存压缩	降低KV序列长度	长序列处理	★☆☆

线性注意力示例：

class LinearCrossAttention(nn.Module): def __init__(self, d_model, feature_dim=256): super().__init__() self.el_proj = nn.Linear(d_model, feature_dim) self.k_proj = nn.Linear(d_model, feature_dim) self.v_proj = nn.Linear(d_model, d_model) def forward(self, query, key_value): Q = F.elu(self.el_proj(query)) # (L_tgt, N, E') K = F.elu(self.el_proj(key_value)) # (L_src, N, E') V = self.v_proj(key_value) # (L_src, N, E) KV = torch.einsum('nse,nsd->ned', K, V) # (N, E', E) Z = 1 / (torch.einsum('nse,ne->ns', Q, K.sum(dim=0)) + 1e-6) # (N, L_tgt) return torch.einsum('nse,ned,ns->nsd', Q, KV, Z)