Self-Attention Mechanism
Self-attention is a fundamental component of transformer architectures that allows models to weigh the importance of different parts of the input sequence when processing each element. The basic self-attention formula is:
\[
\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V
\]
Where:
- Q (Query), K (Key), and V (Value) are learned linear transformations of the input
- \(d_k\) is the dimension of the key vectors
- The scaling factor \(\sqrt{d_k}\) prevents the dot products from growing too large
Encoder vs Decoder Attention
In the encoder:
- Uses bidirectional self-attention where each position can attend to all other positions
- The attention mask is all ones, allowing full visibility
- Mathematically represented as: \(M_{enc} = \mathbf{1}_{n \times n}\)
In the decoder:
- Uses masked self-attention to prevent positions from attending to subsequent positions
- The attention mask is lower triangular: \(M_{dec}[i,j] = \begin{cases} 0 & \text{if } j > i \\ 1 & \text{otherwise} \end{cases}\)
- Cross-attention to encoder outputs:
\[
\text{CrossAttention}(Q_d, K_e, V_e) = \text{softmax}\left(\frac{Q_dK_e^T}{\sqrt{d_k}}\right)V_e
\]
Parameter-Efficient Fine-Tuning (PEFT) with LoRA
Low-Rank Adaptation (LoRA) is a PEFT technique that modifies the attention weights efficiently:
\[
W = W_0 + BA
\]
Where:
- \(W_0\) is the frozen pretrained weights
- \(B \in \mathbb{R}^{d\times r}\) and \(A \in \mathbb{R}^{r\times d}\) are low-rank matrices
- \(r\) is the rank (typically 8 or 16)
Implementation Details
# Original attention projection
y = Wx
# LoRA equivalent
y = W_0x + BA(x)
# Efficient implementation
def lora_layer(x, W_0, A, B, alpha):
return W_0(x) + alpha * B(A(x))