用Python从零实现Skip-gram:可视化理解词向量生成原理
当第一次接触词向量时,很多人会被Skip-gram模型中复杂的数学公式吓退。但实际上,这个模型的核心理念可以用简单的代码直观呈现。本文将带你用Python从零开始构建一个Skip-gram模型,通过可视化手段让抽象的词向量生成过程变得触手可及。
1. 准备工作与环境搭建
在开始编码之前,我们需要准备一个干净的Python环境。推荐使用Anaconda创建独立环境以避免依赖冲突:
conda create -n skipgram python=3.8 conda activate skipgram pip install numpy matplotlib seaborn接下来准备一个简单的文本数据集。为了演示方便,我们使用自构造的微型语料库:
corpus = [ "the quick brown fox jumps over the lazy dog", "i love natural language processing", "deep learning is changing the world" ]2. 文本预处理与词汇表构建
Skip-gram模型的第一步是将文本转换为模型可处理的数字形式。这个过程包括以下几个关键步骤:
- 分词与低频词过滤:将句子拆分为单词列表,并移除出现频率过低的词
- 构建词汇表:为每个唯一单词分配唯一索引
- 生成训练样本:根据滑动窗口创建(中心词,背景词)对
from collections import defaultdict import numpy as np def build_vocab(corpus, min_count=1): word_counts = defaultdict(int) for sentence in corpus: for word in sentence.split(): word_counts[word] += 1 vocab = {word: idx for idx, (word, count) in enumerate(word_counts.items()) if count >= min_count} idx_to_word = {idx: word for word, idx in vocab.items()} return vocab, idx_to_word vocab, idx_to_word = build_vocab(corpus) vocab_size = len(vocab) print(f"词汇表大小: {vocab_size}") print(f"词汇表示例: {list(vocab.items())[:5]}")3. Skip-gram模型的核心实现
Skip-gram模型的核心是通过中心词预测周围背景词。我们将分步骤实现这个过程的每个组件。
3.1 初始化词向量矩阵
词向量矩阵是模型的核心参数,包含两个部分:
- 中心词矩阵W (vocab_size × embedding_dim)
- 背景词矩阵W' (embedding_dim × vocab_size)
def initialize_weights(vocab_size, embedding_dim=2): np.random.seed(42) W = np.random.randn(vocab_size, embedding_dim) * 0.01 W_prime = np.random.randn(embedding_dim, vocab_size) * 0.01 return W, W_prime embedding_dim = 2 # 为可视化方便使用2维 W, W_prime = initialize_weights(vocab_size, embedding_dim)3.2 生成训练样本
Skip-gram的训练依赖于(中心词,背景词)对。以下函数根据滑动窗口生成这些样本:
def generate_training_data(corpus, vocab, window_size=2): training_data = [] for sentence in corpus: words = sentence.split() for center_pos, center_word in enumerate(words): if center_word not in vocab: continue center_idx = vocab[center_word] # 确定上下文窗口边界 start = max(0, center_pos - window_size) end = min(len(words), center_pos + window_size + 1) for context_pos in range(start, end): if context_pos == center_pos: continue context_word = words[context_pos] if context_word in vocab: context_idx = vocab[context_word] training_data.append((center_idx, context_idx)) return np.array(training_data) train_data = generate_training_data(corpus, vocab) print(f"生成的训练样本数: {len(train_data)}")4. 模型训练与可视化
现在我们可以开始训练模型了。为了直观理解训练过程,我们将实现:
- 前向传播计算损失
- 反向传播更新参数
- 实时可视化词向量变化
4.1 训练循环实现
import matplotlib.pyplot as plt from sklearn.manifold import TSNE def train_skipgram(train_data, W, W_prime, learning_rate=0.01, epochs=100): losses = [] for epoch in range(epochs): total_loss = 0 np.random.shuffle(train_data) for center_idx, context_idx in train_data: # 前向传播 h = W[center_idx] # 中心词向量 u = np.dot(W_prime.T, h) # 未归一化logits y_pred = np.exp(u - np.max(u)) / np.sum(np.exp(u - np.max(u))) # softmax # 计算损失 (交叉熵) loss = -np.log(y_pred[context_idx]) total_loss += loss # 反向传播 grad = y_pred.copy() grad[context_idx] -= 1 # 对正确类的梯度 # 更新参数 dW_prime = np.outer(h, grad) dW = np.dot(W_prime, grad) W_prime -= learning_rate * dW_prime W[center_idx] -= learning_rate * dW losses.append(total_loss) # 每10轮可视化一次词向量 if epoch % 10 == 0 or epoch == epochs - 1: visualize_embeddings(W, idx_to_word, epoch) return W, W_prime, losses def visualize_embeddings(W, idx_to_word, epoch): plt.figure(figsize=(10, 8)) for i in range(len(W)): plt.scatter(W[i, 0], W[i, 1], alpha=0.5) plt.text(W[i, 0], W[i, 1], idx_to_word[i], fontsize=9) plt.title(f"词向量空间 (Epoch {epoch})") plt.xlabel("维度1") plt.ylabel("维度2") plt.show()4.2 启动训练过程
W_trained, W_prime_trained, losses = train_skipgram(train_data, W, W_prime) # 绘制损失曲线 plt.plot(losses) plt.title("训练损失曲线") plt.xlabel("Epoch") plt.ylabel("Loss") plt.show()5. 模型优化技巧
基础的Skip-gram实现虽然直观,但在实际应用中存在效率问题。以下是三种常用的优化方法:
5.1 负采样优化
负采样通过只更新少量负样本的参数大幅提升训练速度:
def negative_sampling_loss(center_idx, context_idx, W, W_prime, k=5): # 正样本损失 h = W[center_idx] u_pos = np.dot(W_prime[:, context_idx], h) loss = -np.log(1 / (1 + np.exp(-u_pos))) # 负样本损失 neg_indices = np.random.choice( [i for i in range(len(W)) if i != context_idx], size=k, replace=False ) for neg_idx in neg_indices: u_neg = np.dot(W_prime[:, neg_idx], h) loss -= np.log(1 / (1 + np.exp(u_neg))) return loss5.2 二次采样高频词
通过概率性丢弃高频词来平衡词频影响:
def subsample_frequent_words(word_counts, threshold=1e-5): total_count = sum(word_counts.values()) word_probs = {word: count/total_count for word, count in word_counts.items()} discard_probs = {word: 1 - np.sqrt(threshold/word_probs[word]) for word in word_counts} return discard_probs5.3 短语检测
将常共现的词对视为单个词单元:
from itertools import combinations def detect_phrases(corpus, threshold=10): word_counts = defaultdict(int) pair_counts = defaultdict(int) # 统计词和词对出现次数 for sentence in corpus: words = sentence.split() for word in words: word_counts[word] += 1 for i in range(len(words)-1): pair = (words[i], words[i+1]) pair_counts[pair] += 1 # 计算词对得分 phrase_scores = {} for pair, count in pair_counts.items(): word1, word2 = pair score = (count - threshold) / (word_counts[word1] * word_counts[word2]) if score > 0: phrase_scores[pair] = score return phrase_scores6. 实际应用与扩展
训练好的词向量可以用于多种NLP任务。以下是一些典型应用示例:
6.1 词相似度计算
from sklearn.metrics.pairwise import cosine_similarity def most_similar(word, W, vocab, idx_to_word, topn=5): if word not in vocab: return [] word_vec = W[vocab[word]].reshape(1, -1) similarities = cosine_similarity(word_vec, W)[0] similar_indices = np.argsort(-similarities)[1:topn+1] # 排除自身 return [(idx_to_word[idx], similarities[idx]) for idx in similar_indices] similar_words = most_similar("language", W_trained, vocab, idx_to_word) print(f"与'language'最相似的词: {similar_words}")6.2 词向量可视化增强
使用t-SNE对高维词向量进行降维可视化:
def visualize_with_tsne(W, idx_to_word): tsne = TSNE(n_components=2, random_state=42) W_tsne = tsne.fit_transform(W) plt.figure(figsize=(12, 10)) for i in range(len(W_tsne)): plt.scatter(W_tsne[i, 0], W_tsne[i, 1], alpha=0.5) plt.text(W_tsne[i, 0], W_tsne[i, 1], idx_to_word[i], fontsize=9) plt.title("t-SNE降维后的词向量空间") plt.xlabel("t-SNE维度1") plt.ylabel("t-SNE维度2") plt.show() visualize_with_tsne(W_trained, idx_to_word)6.3 词类比任务
def word_analogy(word_a, word_b, word_c, W, vocab, idx_to_word): vec_a = W[vocab[word_a]] vec_b = W[vocab[word_b]] vec_c = W[vocab[word_c]] target_vec = vec_b - vec_a + vec_c target_vec = target_vec.reshape(1, -1) similarities = cosine_similarity(target_vec, W)[0] similar_indices = np.argsort(-similarities)[:5] return [(idx_to_word[idx], similarities[idx]) for idx in similar_indices] analogy_result = word_analogy("king", "man", "queen", W_trained, vocab, idx_to_word) print(f"king:man :: queen:? 结果: {analogy_result}")