
self-attention
q=linear(query,_w,_b)
k=linear(key,_w,_b)
v=linear(value,_w,_b)
head_dim=embed_dim//num_heads
scaling=float(head_dim)**-0.5
q=q*scaling
attn_output_weights=torch.bmm(q,k.transpose(1,2))
attn_output_weights=softmax(attn_output_weights,
dim=-1)
attn_output=torch.bmm(attn_output_weights,v)
X X X X
Soft-max
q
k
0
垂直内积为0,相关度小
内积越大,关系越近
3