Multi-Head Attention 计算过程

Attention的计算过程相当于在一个hash table里根据key来查找对应的value，不同之处在于这里不是硬匹配，而是根据query和所有key的相关性跟value计算了一个加权和作为找到的信息。计算公式如下：

$$
\text {attention}(Q, K, V) = \text {softmax} \left ( \frac{Q\cdot K^T}{\sqrt{d_{k}}} \right ) \cdot V
$$

来一步步看一下计算过程，比如有这么一句”the cat set on mat”，里面有5个词，第一个词cat跟其他单词的相关性表示成这样$w_{cat, the},w_{cat, set},w_{cat,on},w_{cat,mat}$的相关性，每个对表示一个0-1之间的权重，所有相关性加和是1，假如每个词都有自己确定的value，那么计算一下

$$\text{Info}_{cat} = w_{cat,the}v_{the} + w_{cat, set}v_{set}$$

Value我们认为是每个词处在当前整个句子时含有的信息，这个信息量是学习到的，然后单词本身跟自己也会有一个相关性$w_{cat,cat}$。

实际应用中一个词会embedding到一个向量，假如向量的维度$d_{model}=16$，”the cat set on mat”会被映射为$\text {inputs} \in R^{5\times 16}$的矩阵，对这个矩阵执行三个线性变换得到$Q, K, V$，$Q = \text {inputs} \cdot W_Q^{T}$，$K = \text {inputs} \cdot W_K^{T}$，$V = \text {inputs} \cdot W_V^{T}$

其中$W_{Q}\in R^{d_{model} \times d_{model}}, W_{K}\in R^{d_{model} \times d_{model}}, W_{V}\in R^{d_{model} \times d_{model}}$,在pytorh的实现中$W_Q, W_K, W_V$被整合成一个叫in_proj_weights的参数变量，它的维度是$R^{3 * d_{model}} \times d_{model}$，毕竟执行一次矩阵运算的效率会更高，后面分开就是我们需要的$Q,K,V$了，就是query, key, value. QKV三个矩阵的每一行都是原始的inputs的每一行经过先行变换得到的，每一行依旧代表的是每个词。拿出Q的第一行$q \in R^{1 \times 16}$跟K的每一行$K \in R^{5 \times 16}$计算内积，得到一个新的向量$a \in R^{1 \times 5}$, 它表示的就是第一个word跟其他所有的词的相关性，经过softmax将这个相关性标准化到0-1之间。按照这个过程所有单词都要计算一次彼此的相关性，会得到5个这样的向量，整个过程其实就是$Q \cdot K^{T} = A\in R^{5 \times 5}$。
如果embedding向量$d_{model}$非常大，两个向量计算内积可能会产生很大很小的数，进入softmax饱和区间，影响梯度反向传播，所以加入了一个缩放系数$\sqrt{d_k}$，这个缩放系数来理解一下。如果两个正太分布的随机变量A,B，$A\in R^{1000 \times 64}, B\in R^{1000 \times 64}$，这里正太分布的意思是矩阵$A$中的元素级的符合正态分布，不是说他是64维的多变量正态分布，所以你把矩阵flatten到64000维的向量，它依旧是正太分布。根据随机变量的性质$\text {var}(A, B) = \text {var} (A) + \text{var} (B)$，举个例子：

a = np.random.normal(size=10000)
b = np.random.normal(size=10000)
np.mean(a), np.var(a)
>>> 0.012120657907471737, 1.0050670288852406
np.mean(b), np.var(b)
>> -0.0022273866402510775, 1.0149182079551393
np.mean(a + b), np.var(a + b)
>> 0.009893271267220663, 1.9960134883163734

上面的例子是体现两个随机变量和的方差，矩阵运算的话看下面：

A = np.random.normal(size=(10000, 64))
B = np.random.normal(size=(10000, 64))
np.mean(np.matmul(A, B.T)), np.var(np.matmul(A, B.T))
>>> 6.668697207603404e-05, 64.29069222735822

这里方差是64，因为$A\cdot B^T$的每个元素，本身是$A$的一个元素64维的向量和$B$的一个元素的64维的内积

$a_{1,1}b_{1,1} + a_{1,2}b_{2,1} + … + a_{1,64}b_{64,1}$

这里面有两个变量相乘$\text{var}(a\cdot b)=\text{var}(a) \cdot \text{var}(b)$，把$ab$看成新的随机变量，加了64次，所以方差变成64了。于是我们给他一个缩放系数$\sqrt{64}$，会把方差缩小到1:

np.var(np.matmul(A, B.T) / np.sqrt(64))
>>> 1.0045420660524722

这个例子主要是想说明随着embedding维度增加数据集的方差会变的很大，就是异常数据会变多，加入缩放系数，会避免出现特别大（特别小）的值，避免这部分的梯度消失，无法反向传播。大概是这样的。

既然有in_proj_weights那必然会有out_proj_weights，在pytorch实现里，在执行完multi-head attention后，会对输出执行一次限行变换，得到最终输出。先用numpy实现一下只有一个头的self attention

def attention(inputs, in_proj_weights, out_proj_weights):
    # inputs's shape is (seq_len, batch_size, embed_dim)
    # in_proj_weights's is (embed_dim * 3, embed_dim)
    embed_dim = inputs.shape[-1]
    # The Q, K and V matrices are derived from the inputs through linear transformations.
    qkv = np.matmul(inputs, in_proj_weights.T)
    Q = qkv[:, :, :embed_dim]
    K = qkv[:, :, embed_dim:embed_dim * 2]
    V = qkv[:, :, embed_dim * 2:]

    # To simplify calculations, the shapes of Q, K and V matrices are reshaped (batch_size, seq_len, embed_dim).
    Q = np.swapaxes(Q, 0, 1)
    K = np.swapaxes(K, 0, 1)
    V = np.swapaxes(V, 0, 1)

    # The attention weights are computed from the dot product of Q and K matrices.
    atten_scores = np.matmul(Q, np.swapaxes(K, -2, -1)) # The shape of K is reshaped (batch_size, embed_dim, seq_len) when the product of two matrices.
    scaled_scores = atten_score / np.sqrt(embed_dim) # scaled dot product
    atten_weights = softmax(scaled_scores, axis=-1)

    # Each query's corresponding value is calculated as the weighted sum of the V matrix, using the attention weights(atten_weights).
    atten_output = np.matmul(atten_weights, V)

    output = np.matmul(atten_output, out_proj_weights.T)
    return np.swapaxes(output, 0, 1)

multi-head attention来画一下可能会更好理解，一个头的:

一个样本$\text{inputs} \in R^{6 \times 16}$经过$W_q, W_k, W_v \in R^{16 * 16}$线性变换到$Q, K, V \in R^{6 \times 16}$，根据矩阵运算的性质，$Q, K, V$的每一行都是原输入中每个词的变换。

多头的计算过程：

多头的稍微有些不同，在得到$Q, K, V$之后，把$Q, K, V$ 分成了多个子矩阵，按列分开的，从上面的矩阵乘法示意图可以看出来，$Q$的第一列，是$W_q^{T}$的第一列跟所有inputs算的内积.直白一点就是之前线性变换的是

linear_q = Linear(16, 16)
Q = linear_q(inputs)
Q.shape
>>> (6, 16)

现在变成了

linear_q_h1 = Linear(16, 4)
linear_q_h2 = Linear(16, 4)
linear_q_h3 = Linear(16, 4)
linear_q_h4 = Linear(16, 4)
Q1 = linear_q_h1(inputs)
Q2 = linear_q_h2(inputs)
Q3 = linear_q_h3(inputs)
Q4 = linear_q_h4(inputs)
Q = torch.cat([Q1, Q2, Q3, Q4], dim=-1)

之前的$W_q \in R^{16 \times 16}$，变成4个$W_q\in R^{16 \times 4}$，这样就得到了4个$Q \in R^{6 \times 4}$，再这样生成4个$K \in R^{6 \times 4}, V \in R^{6 \times 4}$, 得到4个$A \in R^{6 \times 4}$，最后把4个$A$contact起来也是$A \in R^{6 \times 16}$，所以相比直接学习$W_q \in R^{16 \times 16}$不如学习4个$W_q \in R^{16 \times 4}$的效果好（原论文说的）。

代码如下，是类似pytorch的实现过程，中间有很多矩阵转置的操作：

def multihead_atten(inputs, num_heads, in_proj_weights, out_proj_weights):
    # inputs's shape is (seq_len, batch_size, ebmed_dim)
    (seq_len, batch_size, embed_dim) = inputs.shape
    
    assert embed_dim % num_heads == 0, 'embed_dim must be divisible by num_heads'
    head_dim = embed_dim // num_heads
    qkv = np.matmul(inputs, in_proj_weights.T)
    Q = qkv[:, :, :embed_dim]
    K = qkv[:, :, embed_dim:embed_dim*2]
    V = qkv[:, :, embed_dim*2:]    
    
    # Reshape Q, K, V into (seq_len, num_head * batch_size, head_dim)
    # Partitioned the Q matrix into num_heads submatrices [Q1, Q2, Q3, Q4, ...] along column axis, 
    # where each Qi belongs to seq_len times head_dim.
    # for examples, partitioned a 2x8 matrix into 2 2x4 matrices, 2 heads.
    # [1, 2, 3, 4, 5, 6, 7 ,8]   [1, 2, 3, 4] [5, 6, 7, 8]
    # [9, 8, 7, 6, 5, 4, 3, 2]   [9, 8, 7, 6] [5, 4, 3, 2]
    # So matrix will be a 4x4 matrix:
    # [1, 2, 3, 4]
    # [5, 6, 7, 8]
    # [9, 8, 7, 6]
    # [5, 4, 3, 2]
    Q = Q.reshape(seq_len, num_heads * batch_size, head_dim)
    K = K.reshape(seq_len, num_heads * batch_size, head_dim)
    V = V.reshape(seq_len, num_heads * batch_size, head_dim)

    # Current the shape of Q, K or V is (seq_len, num_head * batch_size, head_dim)
    # Permute the dimensions of Q, K and V into (num_head * batch_size, seq_len, head_dim)
    Q = np.swapaxes(Q, 0, 1)
    K = np.swapaxes(K, 0, 1)
    V = np.swapaxes(V, 0, 1)

    # attention
    atten_scores = np.matmul(Q, np.swapaxes(K, -2, -1))
    scaled_scores = atten_scores / np.sqrt(K.shape[-1])
    atten_weights = softmax(scaled_scores, axis=-1)
    
    atten_output = np.matmul(atten_weights, V)

    # We need to calculate linear transformation of atten_output, so permute it again.
    atten_output = np.swapaxes(atten_output, 0, 1)
    atten_output = atten_output.reshape(seq_len, batch_size, -1)
    output = np.matmul(atten_output, out_proj_weights.T)
    return output

Multi-Head Attention 计算过程

By tensorzen

发表回复取消回复

You Missed

Step by Step实现RAG

timeScale vs fixedDeltaTime

Difference between Gradient and Derivative

Fixed update with Physics.Simulate in Unity

By tensorzen

Related Post

发表回复 取消回复

You Missed

发表回复取消回复