Word2Vec Skip-Gram 模型-编程实验室

Word2Vec 是什么？

Word2Vec 是一种将单词映射为固定长度稠密向量（词向量）的神经网络模型。

它通过无监督学习从大规模语料中捕捉词的语义和句法信息，使语义相近的词在向量空间中距离较近（如 “king” 和 “queen”）。

核心思想是：一个词的含义可以由其上下文来体现。

Word2Vec 有两种主要架构：

CBOW (Continuous Bag-of-Words)：用上下文词预测中心词。
Skip-gram：用中心词预测上下文词（本文重点）。

Skip-gram 模型结构

Skip-gram 是一个三层神经网络：

输入层：中心词（one-hot 编码）
隐藏层：词嵌入（低维稠密向量）
输出层：softmax 多分类，预测上下文词训练目标：最大化给定中心词时，实际上下文词出现的概率。

具体例子

图例

设定

词表大小 V=5V=5（词汇：[the, cat, dog, mouse, bird]）
嵌入维度 N=3N=3
中心词 = “cat”（索引 1）
上下文词 = “dog”（索引 2）

输入层 → 隐藏层

输入向量：“cat” 的 one-hot 表示（长度为 5，位置 1 为 1，其他为 0）：
x=[01000] x = \begin{bmatrix} 0 \\ 1 \\ 0 \\ 0 \\ 0 \end{bmatrix}x=01000

权重矩阵W 是 V×N 的矩阵，随机初始化（例如）：
W=[0.20.10.50.30.60.80.90.40.20.70.10.30.50.20.4] W = \begin{bmatrix} 0.2 & 0.1 & 0.5 \\ 0.3 & 0.6 & 0.8 \\ 0.9 & 0.4 & 0.2 \\ 0.7 & 0.1 & 0.3 \\ 0.5 & 0.2 & 0.4 \end{bmatrix}W=0.20.30.90.70.50.10.60.40.10.20.50.80.20.30.4

隐藏层h 计算：h=WT⋅xh = W^T \cdot xh=WT⋅x。由于 x 是 one-hot，结果就是 W 的第 1 行（索引从 0 开始）：
$
W^T = \begin{bmatrix}
0.2 & 0.3 & 0.9 & 0.7 & 0.5 \
0.1 & 0.6 & 0.4 & 0.1 & 0.2 \
0.5 & 0.8 & 0.2 & 0.3 & 0.4
\end{bmatrix}
$ , $
h = W^T \cdot x = \begin{bmatrix}
0.2 * 0 + 0.3 * 1 + 0.9 * 0 + 0.7 * 0 + 0.5 * 0 \
0.1 * 0 + 0.6 * 1 + 0.4 * 0 + 0.1 * 0 + 0.2 * 0 \
0.5 * 0 + 0.8 * 1 + 0.2 * 0 + 0.3 * 0 + 0.4 * 0
\end{bmatrix}
$

然后得到 $
h = \begin{bmatrix}
0.3 & 0.6 & 0.8
\end{bmatrix}
$, 这行的值就是“cat”的初始词向量。

隐藏层 → 输出层

输出权重矩阵W′W'W′是 N×V 矩阵（也随机初始化）：
W′=[0.10.20.30.40.50.60.70.80.91.00.20.30.40.50.6] W' = \begin{bmatrix} 0.1 & 0.2 & 0.3 & 0.4 & 0.5 \\ 0.6 & 0.7 & 0.8 & 0.9 & 1.0 \\ 0.2 & 0.3 & 0.4 & 0.5 & 0.6 \end{bmatrix}W′=0.10.60.20.20.70.30.30.80.40.40.90.50.51.00.6

输出得分u=h⋅W′u = h \cdot W'u=h⋅W′（形状为 1×V）：
u=[0.30.60.8]⋅[0.10.20.30.40.50.60.70.80.91.00.20.30.40.50.6]=[0.3∗0.1+0.6∗0.6+0.8∗0.20.3∗0.2+0.6∗0.7+0.8∗0.30.3∗0.3+0.6∗0.8+0.8∗0.40.3∗0.4+0.6∗0.9+0.8∗0.50.3∗0.5+0.6∗1.0+0.8∗0.6]T=[0.550.720.891.061.23] u = \begin{bmatrix} 0.3 & 0.6 & 0.8 \end{bmatrix} \cdot \begin{bmatrix} 0.1 & 0.2 & 0.3 & 0.4 & 0.5 \\ 0.6 & 0.7 & 0.8 & 0.9 & 1.0 \\ 0.2 & 0.3 & 0.4 & 0.5 & 0.6 \end{bmatrix} = \begin{bmatrix} 0.3 * 0.1 + 0.6 * 0.6 + 0.8 * 0.2 \\ 0.3 * 0.2 + 0.6 * 0.7 + 0.8 * 0.3 \\ 0.3 * 0.3 + 0.6 * 0.8 + 0.8 * 0.4 \\ 0.3 * 0.4 + 0.6 * 0.9 + 0.8 * 0.5 \\ 0.3 * 0.5 + 0.6 * 1.0 + 0.8 * 0.6 \end{bmatrix}^T = \begin{bmatrix} 0.55 & 0.72 & 0.89 & 1.06 & 1.23 \end{bmatrix}u=[0.30.60.8]⋅0.10.60.20.20.70.30.30.80.40.40.90.50.51.00.6=0.3∗0.1+0.6∗0.6+0.8∗0.20.3∗0.2+0.6∗0.7+0.8∗0.30.3∗0.3+0.6∗0.8+0.8∗0.40.3∗0.4+0.6∗0.9+0.8∗0.50.3∗0.5+0.6∗1.0+0.8∗0.6T=[0.550.720.891.061.23]

Softmax 归一化得到概率分布：先计算指数：
e0.55=1.733,e0.72=2.054,e0.89=2.435,e1.06=2.886,e1.23=3.421 e^{0.55} = 1.733, e^{0.72} = 2.054, e^{0.89} = 2.435, e^{1.06} = 2.886, e^{1.23} = 3.421e0.55=1.733,e0.72=2.054,e0.89=2.435,e1.06=2.886,e1.23=3.421

总和 = 1.733+2.054+2.435+2.886+3.421 = 12.529

概率：
p(the)=1.733/12.529=0.138p(cat)=2.054/12.529=0.164p(dog)=2.435/12.529=0.194p(mouse)=2.886/12.529=0.230p(bird)=3.421/12.529=0.273 \begin{aligned} p(\text{the}) &= 1.733 / 12.529 = 0.138 \\ p(\text{cat}) &= 2.054 / 12.529 = 0.164 \\ p(\text{dog}) &= 2.435 / 12.529 = 0.194 \\ p(\text{mouse}) &= 2.886 / 12.529 = 0.230 \\ p(\text{bird}) &= 3.421 / 12.529 = 0.273 \end{aligned}p(the)p(cat)p(dog)p(mouse)p(bird)=1.733/12.529=0.138=2.054/12.529=0.164=2.435/12.529=0.194=2.886/12.529=0.230=3.421/12.529=0.273