1-232425···80
← Back to index
PHASE 3 LLM Architecture · Day 24 of 80 · Raschka LLMs From Scratch

Causal Self-Attention — Masked & Scaled

Raschka’s systematic treatment: efficient implementation, memory layout, and numerical stability.

Look backward, never forward. The causal constraint is what makes language models autoregressive and generation possible.— Day 24 Principle

I. Efficient Causal Attention

Raschka’s implementation emphasizes efficiency: pre-computed causal mask, fused QKV projection, and proper memory layout for GPU throughput.

class CausalAttention(nn.Module): def __init__(self, d_in, d_out, ctx_len, dropout, qkv_bias=False): super().__init__() self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias) self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias) self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias) self.dropout = nn.Dropout(dropout) self.register_buffer('mask', torch.triu(torch.ones(ctx_len, ctx_len), diagonal=1))

V. Deliverables

Causal attention is the engine of autoregressive generation. Tomorrow: the full GPT-2 architecture.— Day 24 Closing