Look backward, never forward. The causal constraint is what makes language models autoregressive and generation possible.— Day 24 Principle
I. Efficient Causal Attention
Raschka’s implementation emphasizes efficiency: pre-computed causal mask, fused QKV projection, and proper memory layout for GPU throughput.
class CausalAttention(nn.Module):
def __init__(self, d_in, d_out, ctx_len, dropout, qkv_bias=False):
super().__init__()
self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)
self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
self.dropout = nn.Dropout(dropout)
self.register_buffer('mask', torch.triu(torch.ones(ctx_len, ctx_len), diagonal=1))
V. Deliverables
- Efficient causal attention
- Register buffer for mask
- QKV projection options
- Dropout on attention
- Memory analysis
Causal attention is the engine of autoregressive generation. Tomorrow: the full GPT-2 architecture.— Day 24 Closing