Ahan M R - Blog

Self-Attention Mechanism

Self-attention is a fundamental component of transformer architectures that allows models to weigh the importance of different parts of the input sequence when processing each element. The basic self-attention formula is:

\[ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V \]

Where:

Q (Query), K (Key), and V (Value) are learned linear transformations of the input
\(d_k\) is the dimension of the key vectors
The scaling factor \(\sqrt{d_k}\) prevents the dot products from growing too large

Encoder vs Decoder Attention

In the encoder:

Uses bidirectional self-attention where each position can attend to all other positions
The attention mask is all ones, allowing full visibility
Mathematically represented as: \(M_{enc} = \mathbf{1}_{n \times n}\)

In the decoder:

Uses masked self-attention to prevent positions from attending to subsequent positions
The attention mask is lower triangular: \(M_{dec}[i,j] = \begin{cases} 0 & \text{if } j > i \\ 1 & \text{otherwise} \end{cases}\)
Cross-attention to encoder outputs:

\[ \text{CrossAttention}(Q_d, K_e, V_e) = \text{softmax}\left(\frac{Q_dK_e^T}{\sqrt{d_k}}\right)V_e \]

Parameter-Efficient Fine-Tuning (PEFT) with LoRA

Low-Rank Adaptation (LoRA) is a PEFT technique that modifies the attention weights efficiently:

\[ W = W_0 + BA \]

Where:

\(W_0\) is the frozen pretrained weights
\(B \in \mathbb{R}^{d\times r}\) and \(A \in \mathbb{R}^{r\times d}\) are low-rank matrices
\(r\) is the rank (typically 8 or 16)

Implementation Details


# Original attention projection
y = Wx

# LoRA equivalent
y = W_0x + BA(x)

# Efficient implementation
def lora_layer(x, W_0, A, B, alpha):
    return W_0(x) + alpha * B(A(x))

class DecoderWithKVCache: def __init__(self): self.kv_cache = {} def forward(self, x, cache_position): # Generate new key and value new_k = self.k_proj(x) # [B, 1, H] new_v = self.v_proj(x) # [B, 1, H] # Update cache if cache_position in self.kv_cache: k = torch.cat([self.kv_cache[cache_position][0], new_k], dim=1) v = torch.cat([self.kv_cache[cache_position][1], new_v], dim=1) else: k, v = new_k, new_v self.kv_cache[cache_position] = (k, v) # Compute attention using cached keys and values q = self.q_proj(x) # [B, 1, H] attn = torch.matmul(q, k.transpose(-2, -1)) / math.sqrt(self.head_dim) attn = F.softmax(attn, dim=-1) return torch.matmul(attn, v)

Understanding Self-Attention, Encoder-Decoder Architecture, and PEFT

Self-Attention Mechanism

Encoder vs Decoder Attention

Parameter-Efficient Fine-Tuning (PEFT) with LoRA

Implementation Details

KV Cache: Accelerating Transformer Inference

Understanding KV Cache

Why KV Cache Matters

Implementation in Decoder Architecture

Memory Efficiency

Performance Impact