Day 24 - Causal Self-Attention

1-232425···80

Look backward, never forward. The causal constraint is what makes language models autoregressive and generation possible.— Day 24 Principle

I. Efficient Causal Attention

Raschka’s implementation emphasizes efficiency: pre-computed causal mask, fused QKV projection, and proper memory layout for GPU throughput.

class CausalAttention(nn.Module):
    def __init__(self, d_in, d_out, ctx_len, dropout, qkv_bias=False):
        super().__init__()
        self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.dropout = nn.Dropout(dropout)
        self.register_buffer('mask', torch.triu(torch.ones(ctx_len, ctx_len), diagonal=1))

V. Deliverables

Efficient causal attention
Register buffer for mask
QKV projection options
Dropout on attention
Memory analysis

Causal attention is the engine of autoregressive generation. Tomorrow: the full GPT-2 architecture.— Day 24 Closing