I. Tensor Fundamentals — Shape, Stride, View
A tensor is a multi-dimensional array stored as a contiguous block of memory. Its shape tells you
the dimensions (e.g., [3, 4] means 3 rows, 4 columns). Its stride tells you how many
elements to skip to reach the next position along each dimension. A view creates a new tensor that
shares the same underlying data but interprets it with a different shape.
II. Broadcasting Rules — The Three-Step Check
Broadcasting lets you perform arithmetic between tensors of different shapes without explicit copying. PyTorch follows NumPy’s broadcasting rules. Align shapes from the right. At each dimension, sizes must be equal or one of them must be 1.
The keepdim Trap
If you write P.sum(dim=1) without keepdim=True, the result has shape [2] instead of [2, 1]. Division then broadcasts incorrectly — silently producing wrong results. This is the #1 tensor bug. Always use keepdim=True when you need the result to broadcast back against the original tensor.
III. One-Hot Encoding — Integers to Tensors
Neural networks operate on continuous numbers, not discrete integers. One-hot encoding converts a character index (e.g., 5) into a vector of 27 zeros with a single 1 at position 5. This is the bridge between discrete tokens and the continuous world of matrix multiplication.
One-Hot @ W = Table Lookup
Multiplying a one-hot vector by a weight matrix is equivalent to selecting a row from the matrix. So one_hot(5) @ W just returns W[5]. This is why embedding layers exist — they skip the one-hot multiplication and directly index into the weight matrix. Same result, zero wasted computation.
IV. The Matrix — What Matters Today
DO FIRST
Practice broadcasting: create tensors of shapes [3,4], [4], [3,1] and verify all arithmetic combinations. Break one on purpose.
DO IF TIME
Explore .view(), .reshape(), .unsqueeze(), .squeeze(). Understand which share memory (views) vs. copy.
DO CAREFULLY
Re-implement the Day 6 bigram model using one-hot encoding + matrix multiply. Verify you get the same NLL as the counting approach.
AVOID TODAY
GPU tensors, CUDA operations, or distributed tensor ops. CPU tensors are sufficient for all Phase 1 work.
V. Today’s Deliverables
- Shape manipulation: Create, reshape, view, squeeze, unsqueeze tensors fluently
- Broadcasting: Verify 4+ broadcasting examples by hand, then in code
- One-hot encoding: Convert character indices to one-hot vectors with
F.one_hot() - Matrix multiply:
xenc @ Wto produce logits from one-hot inputs - keepdim: Demonstrate the bug without
keepdim=Trueand the fix with it - Equivalence: Show that
one_hot(i) @ W == W[i](embedding = table lookup)
PyTorch tensor deep dive: shapes, broadcasting rules, one-hot encoding, matrix multiply for neural nets, and softmax.