Day 7 — Tensors, Broadcasting & torch.Tensor Deep Dive

A spreadsheet is a 2D grid of numbers. A tensor is an N-dimensional grid of numbers. Just as portfolio analytics requires fluency with spreadsheets, neural network work requires fluency with tensors. Broadcasting — the automatic expansion of shapes during arithmetic — is the single most important concept to internalize. Get it wrong, and silent shape mismatches corrupt your results. Get it right, and you write code that runs 100× faster than loops. — Day 7 Principle, adapted from the Marks framework

I. Tensor Fundamentals — Shape, Stride, View

A tensor is a multi-dimensional array stored as a contiguous block of memory. Its shape tells you the dimensions (e.g., [3, 4] means 3 rows, 4 columns). Its stride tells you how many elements to skip to reach the next position along each dimension. A view creates a new tensor that shares the same underlying data but interprets it with a different shape.

Exhibit A — Tensor Shapes: Scalar → Vector → Matrix → 3D Tensor

II. Broadcasting Rules — The Three-Step Check

Broadcasting lets you perform arithmetic between tensors of different shapes without explicit copying. PyTorch follows NumPy’s broadcasting rules. Align shapes from the right. At each dimension, sizes must be equal or one of them must be 1.

✓ Compatible Shapes

✗ Incompatible

      [3,4] + [   4] → [3,4]

      [3,4] + [3,1] → [3,4]

      [3,1] + [1,4] → [3,4]

      [2,3,4] + [4] → [2,3,4]

      [3,4] + [3] → ERROR

      [3,4] + [2,4] → ERROR

      sizes 3 vs 2 ≠ 1

      must be equal or 1

import torch

# Broadcasting in action
a = torch.tensor([[1,2,3],
                  [4,5,6]])     # shape [2, 3]
b = torch.tensor([10,20,30])  # shape [3]

c = a + b  # b broadcasts to [2, 3]
# tensor([[11, 22, 33],
#         [14, 25, 36]])

# Row-wise normalization (sum to 1 per row)
P = a.float()
P = P / P.sum(dim=1, keepdim=True)  # keepdim=True is crucial!
# P.sum(1, keepdim=True) shape: [2, 1] — broadcasts against [2, 3]
  

The keepdim Trap

If you write P.sum(dim=1) without keepdim=True, the result has shape [2] instead of [2, 1]. Division then broadcasts incorrectly — silently producing wrong results. This is the #1 tensor bug. Always use keepdim=True when you need the result to broadcast back against the original tensor.

III. One-Hot Encoding — Integers to Tensors

Neural networks operate on continuous numbers, not discrete integers. One-hot encoding converts a character index (e.g., 5) into a vector of 27 zeros with a single 1 at position 5. This is the bridge between discrete tokens and the continuous world of matrix multiplication.

import torch.nn.functional as F

# One-hot encode character indices
xenc = F.one_hot(torch.tensor([5, 13, 1]), num_classes=27).float()
# shape: [3, 27] — three chars, each a 27-dim vector

# Multiply by weight matrix: this IS the neural net’s first layer
W = torch.randn((27, 27), requires_grad=True)
logits = xenc @ W  # [3, 27] @ [27, 27] = [3, 27]
  

One-Hot @ W = Table Lookup

Multiplying a one-hot vector by a weight matrix is equivalent to selecting a row from the matrix. So one_hot(5) @ W just returns W[5]. This is why embedding layers exist — they skip the one-hot multiplication and directly index into the weight matrix. Same result, zero wasted computation.

IV. The Matrix — What Matters Today

Builds Deep Intuition

Surface-Level Only

Quick to Do

🎯

DO FIRST

Practice broadcasting: create tensors of shapes [3,4], [4], [3,1] and verify all arithmetic combinations. Break one on purpose.

⏭️

DO IF TIME

Explore .view(), .reshape(), .unsqueeze(), .squeeze(). Understand which share memory (views) vs. copy.

Slow but Worth It

🖐

DO CAREFULLY

Re-implement the Day 6 bigram model using one-hot encoding + matrix multiply. Verify you get the same NLL as the counting approach.

🚫

AVOID TODAY

GPU tensors, CUDA operations, or distributed tensor ops. CPU tensors are sufficient for all Phase 1 work.

V. Today’s Deliverables

Shape manipulation: Create, reshape, view, squeeze, unsqueeze tensors fluently
Broadcasting: Verify 4+ broadcasting examples by hand, then in code
One-hot encoding: Convert character indices to one-hot vectors with F.one_hot()
Matrix multiply: xenc @ W to produce logits from one-hot inputs
keepdim: Demonstrate the bug without keepdim=True and the fix with it
Equivalence: Show that one_hot(i) @ W == W[i] (embedding = table lookup)

Tensors are the language of neural networks, and broadcasting is its grammar. Every forward pass, every loss computation, every gradient update is a tensor operation. Fluency here is not optional — it is the difference between writing code that works and code that silently corrupts. Tomorrow, you put it all together in a training loop. — Day 7 Closing Principle

Day 7 Notebook — Tensors, Broadcasting & torch.Tensor Runnable Python

PyTorch tensor deep dive: shapes, broadcasting rules, one-hot encoding, matrix multiply for neural nets, and softmax.

▶ Open in Colab View on GitHub nbviewer