Attention的计算过程相当于在一个hash table里根据key来查找对应的value,不同之处在于这里不是硬匹配,而是根据query和所有key的相关性跟value计算了一个加权和作为找到的信息。计算公式如下:
$$
\text {attention}(Q, K, V) = \text {softmax} \left ( \frac{Q\cdot K^T}{\sqrt{d_{k}}} \right ) \cdot V
$$
来一步步看一下计算过程,比如有这么一句”the cat set on mat”,里面有5个词,第一个词cat跟其他单词的相关性表示成这样$w_{cat, the},w_{cat, set},w_{cat,on},w_{cat,mat}$的相关性,每个对表示一个0-1之间的权重,所有相关性加和是1,假如每个词都有自己确定的value,那么计算一下
$$\text{Info}_{cat} = w_{cat,the}v_{the} + w_{cat, set}v_{set}$$
Value我们认为是每个词处在当前整个句子时含有的信息,这个信息量是学习到的,然后单词本身跟自己也会有一个相关性$w_{cat,cat}$。
实际应用中一个词会embedding到一个向量,假如向量的维度$d_{model}=16$,”the cat set on mat”会被映射为$\text {inputs} \in R^{5\times 16}$的矩阵,对这个矩阵执行三个线性变换得到$Q, K, V$,$Q = \text {inputs} \cdot W_Q^{T}$,$K = \text {inputs} \cdot W_K^{T}$,$V = \text {inputs} \cdot W_V^{T}$
其中$W_{Q}\in R^{d_{model} \times d_{model}}, W_{K}\in R^{d_{model} \times d_{model}}, W_{V}\in R^{d_{model} \times d_{model}}$,在pytorh的实现中$W_Q, W_K, W_V$被整合成一个叫in_proj_weights的参数变量,它的维度是$R^{3 * d_{model}} \times d_{model}$,毕竟执行一次矩阵运算的效率会更高,后面分开就是我们需要的$Q,K,V$了,就是query, key, value. QKV三个矩阵的每一行都是原始的inputs的每一行经过先行变换得到的,每一行依旧代表的是每个词。拿出Q的第一行$q \in R^{1 \times 16}$跟K的每一行$K \in R^{5 \times 16}$计算内积,得到一个新的向量$a \in R^{1 \times 5}$, 它表示的就是第一个word跟其他所有的词的相关性,经过softmax将这个相关性标准化到0-1之间。按照这个过程所有单词都要计算一次彼此的相关性,会得到5个这样的向量,整个过程其实就是$Q \cdot K^{T} = A\in R^{5 \times 5}$。
如果embedding向量$d_{model}$非常大,两个向量计算内积可能会产生很大很小的数,进入softmax饱和区间,影响梯度反向传播,所以加入了一个缩放系数$\sqrt{d_k}$,这个缩放系数来理解一下。如果两个正太分布的随机变量A,B,$A\in R^{1000 \times 64}, B\in R^{1000 \times 64}$,这里正太分布的意思是矩阵$A$中的元素级的符合正态分布,不是说他是64维的多变量正态分布,所以你把矩阵flatten到64000维的向量,它依旧是正太分布。根据随机变量的性质$\text {var}(A, B) = \text {var} (A) + \text{var} (B)$,举个例子:
a = np.random.normal(size=10000)
b = np.random.normal(size=10000)
np.mean(a), np.var(a)
>>> 0.012120657907471737, 1.0050670288852406
np.mean(b), np.var(b)
>> -0.0022273866402510775, 1.0149182079551393
np.mean(a + b), np.var(a + b)
>> 0.009893271267220663, 1.9960134883163734
上面的例子是体现两个随机变量和的方差,矩阵运算的话看下面:
A = np.random.normal(size=(10000, 64))
B = np.random.normal(size=(10000, 64))
np.mean(np.matmul(A, B.T)), np.var(np.matmul(A, B.T))
>>> 6.668697207603404e-05, 64.29069222735822
这里方差是64,因为$A\cdot B^T$的每个元素,本身是$A$的一个元素64维的向量和$B$的一个元素的64维的内积
$a_{1,1}b_{1,1} + a_{1,2}b_{2,1} + … + a_{1,64}b_{64,1}$
这里面有两个变量相乘$\text{var}(a\cdot b)=\text{var}(a) \cdot \text{var}(b)$,把$ab$看成新的随机变量,加了64次,所以方差变成64了。于是我们给他一个缩放系数$\sqrt{64}$,会把方差缩小到1:
np.var(np.matmul(A, B.T) / np.sqrt(64))
>>> 1.0045420660524722
这个例子主要是想说明随着embedding维度增加数据集的方差会变的很大,就是异常数据会变多,加入缩放系数,会避免出现特别大(特别小)的值,避免这部分的梯度消失,无法反向传播。大概是这样的。
既然有in_proj_weights那必然会有out_proj_weights,在pytorch实现里,在执行完multi-head attention后,会对输出执行一次限行变换,得到最终输出。先用numpy实现一下只有一个头的self attention
def attention(inputs, in_proj_weights, out_proj_weights):
# inputs's shape is (seq_len, batch_size, embed_dim)
# in_proj_weights's is (embed_dim * 3, embed_dim)
embed_dim = inputs.shape[-1]
# The Q, K and V matrices are derived from the inputs through linear transformations.
qkv = np.matmul(inputs, in_proj_weights.T)
Q = qkv[:, :, :embed_dim]
K = qkv[:, :, embed_dim:embed_dim * 2]
V = qkv[:, :, embed_dim * 2:]
# To simplify calculations, the shapes of Q, K and V matrices are reshaped (batch_size, seq_len, embed_dim).
Q = np.swapaxes(Q, 0, 1)
K = np.swapaxes(K, 0, 1)
V = np.swapaxes(V, 0, 1)
# The attention weights are computed from the dot product of Q and K matrices.
atten_scores = np.matmul(Q, np.swapaxes(K, -2, -1)) # The shape of K is reshaped (batch_size, embed_dim, seq_len) when the product of two matrices.
scaled_scores = atten_score / np.sqrt(embed_dim) # scaled dot product
atten_weights = softmax(scaled_scores, axis=-1)
# Each query's corresponding value is calculated as the weighted sum of the V matrix, using the attention weights(atten_weights).
atten_output = np.matmul(atten_weights, V)
output = np.matmul(atten_output, out_proj_weights.T)
return np.swapaxes(output, 0, 1)
multi-head attention来画一下可能会更好理解,一个头的:

一个样本$\text{inputs} \in R^{6 \times 16}$经过$W_q, W_k, W_v \in R^{16 * 16}$线性变换到$Q, K, V \in R^{6 \times 16}$,根据矩阵运算的性质,$Q, K, V$的每一行都是原输入中每个词的变换。

多头的计算过程:

多头的稍微有些不同,在得到$Q, K, V$之后,把$Q, K, V$ 分成了多个子矩阵,按列分开的,从上面的矩阵乘法示意图可以看出来,$Q$的第一列,是$W_q^{T}$的第一列跟所有inputs算的内积.直白一点就是之前线性变换的是
linear_q = Linear(16, 16)
Q = linear_q(inputs)
Q.shape
>>> (6, 16)
现在变成了
linear_q_h1 = Linear(16, 4)
linear_q_h2 = Linear(16, 4)
linear_q_h3 = Linear(16, 4)
linear_q_h4 = Linear(16, 4)
Q1 = linear_q_h1(inputs)
Q2 = linear_q_h2(inputs)
Q3 = linear_q_h3(inputs)
Q4 = linear_q_h4(inputs)
Q = torch.cat([Q1, Q2, Q3, Q4], dim=-1)

之前的$W_q \in R^{16 \times 16}$,变成4个$W_q\in R^{16 \times 4}$,这样就得到了4个$Q \in R^{6 \times 4}$,再这样生成4个$K \in R^{6 \times 4}, V \in R^{6 \times 4}$, 得到4个$A \in R^{6 \times 4}$,最后把4个$A$contact起来也是$A \in R^{6 \times 16}$,所以相比直接学习$W_q \in R^{16 \times 16}$不如学习4个$W_q \in R^{16 \times 4}$的效果好(原论文说的)。
代码如下,是类似pytorch的实现过程,中间有很多矩阵转置的操作:
def multihead_atten(inputs, num_heads, in_proj_weights, out_proj_weights):
# inputs's shape is (seq_len, batch_size, ebmed_dim)
(seq_len, batch_size, embed_dim) = inputs.shape
assert embed_dim % num_heads == 0, 'embed_dim must be divisible by num_heads'
head_dim = embed_dim // num_heads
qkv = np.matmul(inputs, in_proj_weights.T)
Q = qkv[:, :, :embed_dim]
K = qkv[:, :, embed_dim:embed_dim*2]
V = qkv[:, :, embed_dim*2:]
# Reshape Q, K, V into (seq_len, num_head * batch_size, head_dim)
# Partitioned the Q matrix into num_heads submatrices [Q1, Q2, Q3, Q4, ...] along column axis,
# where each Qi belongs to seq_len times head_dim.
# for examples, partitioned a 2x8 matrix into 2 2x4 matrices, 2 heads.
# [1, 2, 3, 4, 5, 6, 7 ,8] [1, 2, 3, 4] [5, 6, 7, 8]
# [9, 8, 7, 6, 5, 4, 3, 2] [9, 8, 7, 6] [5, 4, 3, 2]
# So matrix will be a 4x4 matrix:
# [1, 2, 3, 4]
# [5, 6, 7, 8]
# [9, 8, 7, 6]
# [5, 4, 3, 2]
Q = Q.reshape(seq_len, num_heads * batch_size, head_dim)
K = K.reshape(seq_len, num_heads * batch_size, head_dim)
V = V.reshape(seq_len, num_heads * batch_size, head_dim)
# Current the shape of Q, K or V is (seq_len, num_head * batch_size, head_dim)
# Permute the dimensions of Q, K and V into (num_head * batch_size, seq_len, head_dim)
Q = np.swapaxes(Q, 0, 1)
K = np.swapaxes(K, 0, 1)
V = np.swapaxes(V, 0, 1)
# attention
atten_scores = np.matmul(Q, np.swapaxes(K, -2, -1))
scaled_scores = atten_scores / np.sqrt(K.shape[-1])
atten_weights = softmax(scaled_scores, axis=-1)
atten_output = np.matmul(atten_weights, V)
# We need to calculate linear transformation of atten_output, so permute it again.
atten_output = np.swapaxes(atten_output, 0, 1)
atten_output = atten_output.reshape(seq_len, batch_size, -1)
output = np.matmul(atten_output, out_proj_weights.T)
return output