1 2 3 4 5 6 7 8 · · · 80
← Back to index
PHASE 1 Foundations · Day 7 of 80 · Neural Networks & Backprop

Tensors, Broadcasting & torch.Tensor Deep Dive

Master the data structure at the heart of deep learning. Understand shapes, strides, views, and the broadcasting rules that make vectorized code possible.

A spreadsheet is a 2D grid of numbers. A tensor is an N-dimensional grid of numbers. Just as portfolio analytics requires fluency with spreadsheets, neural network work requires fluency with tensors. Broadcasting — the automatic expansion of shapes during arithmetic — is the single most important concept to internalize. Get it wrong, and silent shape mismatches corrupt your results. Get it right, and you write code that runs 100× faster than loops. — Day 7 Principle, adapted from the Marks framework

I. Tensor Fundamentals — Shape, Stride, View

A tensor is a multi-dimensional array stored as a contiguous block of memory. Its shape tells you the dimensions (e.g., [3, 4] means 3 rows, 4 columns). Its stride tells you how many elements to skip to reach the next position along each dimension. A view creates a new tensor that shares the same underlying data but interprets it with a different shape.

Exhibit A — Tensor Shapes: Scalar → Vector → Matrix → 3D Tensor
42 shape: [] 0-D scalar 1 2 3 shape: [3] 1-D vector 1 2 3 4 5 6 7 8 9 shape: [3,3] 2-D matrix data here shape: [3,3,3] 3-D tensor N-D: same idea, more dimensions

II. Broadcasting Rules — The Three-Step Check

Broadcasting lets you perform arithmetic between tensors of different shapes without explicit copying. PyTorch follows NumPy’s broadcasting rules. Align shapes from the right. At each dimension, sizes must be equal or one of them must be 1.

✓ Compatible Shapes
✗ Incompatible
[3,4] + [   4] → [3,4]
[3,4] + [3,1] → [3,4]
[3,1] + [1,4] → [3,4]
[2,3,4] + [4] → [2,3,4]
[3,4] + [3] → ERROR
[3,4] + [2,4] → ERROR
sizes 3 vs 2 ≠ 1
must be equal or 1
import torch # Broadcasting in action a = torch.tensor([[1,2,3], [4,5,6]]) # shape [2, 3] b = torch.tensor([10,20,30]) # shape [3] c = a + b # b broadcasts to [2, 3] # tensor([[11, 22, 33], # [14, 25, 36]]) # Row-wise normalization (sum to 1 per row) P = a.float() P = P / P.sum(dim=1, keepdim=True) # keepdim=True is crucial! # P.sum(1, keepdim=True) shape: [2, 1] — broadcasts against [2, 3]

The keepdim Trap

If you write P.sum(dim=1) without keepdim=True, the result has shape [2] instead of [2, 1]. Division then broadcasts incorrectly — silently producing wrong results. This is the #1 tensor bug. Always use keepdim=True when you need the result to broadcast back against the original tensor.

III. One-Hot Encoding — Integers to Tensors

Neural networks operate on continuous numbers, not discrete integers. One-hot encoding converts a character index (e.g., 5) into a vector of 27 zeros with a single 1 at position 5. This is the bridge between discrete tokens and the continuous world of matrix multiplication.

import torch.nn.functional as F # One-hot encode character indices xenc = F.one_hot(torch.tensor([5, 13, 1]), num_classes=27).float() # shape: [3, 27] — three chars, each a 27-dim vector # Multiply by weight matrix: this IS the neural net’s first layer W = torch.randn((27, 27), requires_grad=True) logits = xenc @ W # [3, 27] @ [27, 27] = [3, 27]

One-Hot @ W = Table Lookup

Multiplying a one-hot vector by a weight matrix is equivalent to selecting a row from the matrix. So one_hot(5) @ W just returns W[5]. This is why embedding layers exist — they skip the one-hot multiplication and directly index into the weight matrix. Same result, zero wasted computation.

IV. The Matrix — What Matters Today

Builds Deep Intuition
Surface-Level Only
Quick to Do
🎯

DO FIRST

Practice broadcasting: create tensors of shapes [3,4], [4], [3,1] and verify all arithmetic combinations. Break one on purpose.

⏭️

DO IF TIME

Explore .view(), .reshape(), .unsqueeze(), .squeeze(). Understand which share memory (views) vs. copy.

Slow but Worth It
🖐

DO CAREFULLY

Re-implement the Day 6 bigram model using one-hot encoding + matrix multiply. Verify you get the same NLL as the counting approach.

🚫

AVOID TODAY

GPU tensors, CUDA operations, or distributed tensor ops. CPU tensors are sufficient for all Phase 1 work.

V. Today’s Deliverables

Tensors are the language of neural networks, and broadcasting is its grammar. Every forward pass, every loss computation, every gradient update is a tensor operation. Fluency here is not optional — it is the difference between writing code that works and code that silently corrupts. Tomorrow, you put it all together in a training loop. — Day 7 Closing Principle
Day 7 Notebook — Tensors, Broadcasting & torch.Tensor Runnable Python

PyTorch tensor deep dive: shapes, broadcasting rules, one-hot encoding, matrix multiply for neural nets, and softmax.