← Back to index
COMPONENT I
Tensor Library
torch.Tensor · NumPy-like API
Multi-dimensional array containers for all data and parameters. GPU-accelerated. The fundamental unit of every PyTorch computation — scalars, vectors, matrices, and beyond.
Like NumPy, but your arrays can live on a GPU and know how to differentiate themselves.
COMPONENT II
Autograd Engine
torch.autograd · Dynamic computation graphs
Automatically computes gradients of any tensor expression. Builds a computation graph on every forward pass and runs backpropagation via .backward() — no calculus by hand.
The engine that turns forward passes into backward passes without you writing a single derivative.
COMPONENT III
Deep Learning Utilities
torch.nn · Layers, losses, optimizers
Modular building blocks: nn.Module, nn.Linear, nn.Sequential, loss functions, optimizers (SGD, Adam), data loading (Dataset, DataLoader).
A Lego kit for neural networks — mix, match, subclass, and extend.
RANK 0 · 0D
Scalar
torch.tensor(42)
A single loss value, a learning rate
RANK 1 · 1D
Vector
torch.tensor([1, 2, 3, 4])
A bias term, a 1D feature vector
RANK 2 · 2D
Matrix
torch.tensor([[1,2],[3,4]])
A weight matrix, a batch of embeddings
RANK 3+ · nD
Tensor
shape: [batch, seq, dim]
A batch of token embeddings for an LLM
| dtype | bits | created from | use case | convert with |
| torch.float32 | 32 | Python float | Default for training — GPU-optimized, sufficient precision | .to(torch.float32) |
| torch.float16 / bfloat16 | 16 | explicit cast | Mixed-precision training, LLM inference, saves memory | .half() / .to(torch.bfloat16) |
| torch.int64 | 64 | Python int | Class labels, token IDs, indices | .long() |
| torch.bool | 8 | comparison ops | Attention masks, padding masks | .bool() |
⚠
Key operations: .shape · .view()/.reshape() · .T (transpose) · @ (matmul) · .to(device). All operations preserve the computation graph when requires_grad=True.
Forward Pass — building the graph
x₁,w₁
Input & Parametersfeature tensor + weight with requires_grad=True
tensor([1.1])
tensor([2.2],✓grad)
z
Net Inputlinear combination: z = x₁·w₁ + b
z = x1 * w1 + b
a
Activationnonlinearity: a = σ(z)
torch.sigmoid(z)
L
Losscompare prediction to true label y
F.binary_cross_entropy(a,y)
Backward Pass — computing gradients
∂L
Trigger BackpropPyTorch traverses graph right-to-left
loss.backward()
∂L/∂a
Gradient at Activationderivative of BCE w.r.t. sigmoid output
auto-computed
∂L/∂z
Gradient at Net Inputchain rule through sigmoid
auto-computed
∂L/∂w₁
Parameter Gradienthow much does the loss change w.r.t. w₁?
w1.grad → −0.0898
Chain Rule
∂L/∂w₁ = (∂L/∂a) · (∂a/∂z) · (∂z/∂w₁)
PyTorch does this automatically. You never compute derivatives by hand.
1define
Dataset
Subclass torch.utils.data.Dataset. Implement three methods: __init__ (store data), __getitem__ (return one example by index), __len__ (total size).
⚠ Class labels must start at 0. Largest label = num_outputs − 1.
class MyDataset(Dataset):
def __init__(self, X, y):
self.X, self.y = X, y
def __getitem__(self, i):
return self.X[i], self.y[i]
def __len__(self):
return len(self.y)
2load
DataLoader
Wraps Dataset to handle batching, shuffling, and parallelism. num_workers>0 loads next batch in background while GPU trains on current batch.
⚠ Use drop_last=True to avoid a tiny last batch. Use shuffle=True only for train, not test.
DataLoader(
dataset=train_ds,
batch_size=32,
shuffle=True,
num_workers=4,
drop_last=True
)
3build
Model — nn.Module
Subclass nn.Module. Define layers in __init__, connect them in forward(). Return raw logits — PyTorch loss functions apply softmax/sigmoid internally.
class Net(nn.Module):
def __init__(self):
super().__init__()
self.layers = nn.Sequential(
nn.Linear(50,30), nn.ReLU(),
nn.Linear(30,3),
)
def forward(self, x):
return self.layers(x)
4train
Training Loop
Iterate epochs → batches. Five operations in strict order each batch:
Per-batch update — repeat for every epoch
①Forward pass— compute logits from features
②Compute loss— F.cross_entropy(logits, labels)
③Zero gradients— optimizer.zero_grad() — prevents accumulation!
④Backward pass— loss.backward() — fills .grad attributes
⑤Update params— optimizer.step() — w ← w − lr·∂L/∂w
model.train()
for features, labels in loader:
logits = model(features)
loss = F.cross_entropy(logits, labels)
optimizer.zero_grad()
loss.backward()
optimizer.step()
5eval
Inference & Save
model.eval() disables dropout/batchnorm training behavior. Wrap in torch.no_grad() to skip building the computation graph — saves memory and compute during inference.
⚠ Always call model.eval() before any prediction. Forgetting this with dropout gives different results each run.
model.eval()
with torch.no_grad():
probs = torch.softmax(model(X), dim=1)
preds = torch.argmax(probs, dim=1)
torch.save(model.state_dict(), "model.pth")
model.load_state_dict(
torch.load("model.pth", weights_only=True))
CPU (Default)
# All tensors start here
t = torch.tensor([1., 2., 3.])
# t.device → cpu
Every tensor and model parameter starts on CPU. Operations between tensors must be on the same device — mixing CPU and GPU raises RuntimeError.
→
.to(device)
GPU (cuda / mps)
# Best practice device setup
device = torch.device(
"cuda" if torch.cuda.is_available()
else "cpu"
)
model.to(device)
X, y = X.to(device), y.to(device)
Apple Silicon: replace "cuda" with "mps" — torch.backends.mps.is_available()
💡
Only 3 lines change to go from CPU to single-GPU training: device = ... · model.to(device) · features.to(device), labels.to(device) inside the loop. GPU won’t speed up tiny datasets — transfer overhead dominates.
How DDP Works — One Process Per GPU, Synchronized Gradients
Training Data
BATCH A
examples 1, 3, 5, 7
DistributedSampler
ensures no overlap
BATCH B
examples 2, 4, 6, 8
different subset
per GPU process
BATCH C …
scales linearly
with N GPUs
Model Copy
(identical)
GPU 0 · rank=0
full model copy
forward → loss
→ backward → ∇w
GPU 1 · rank=1
full model copy
forward → loss
→ backward → ∇w
↔ All-Reduce: average gradients across all GPUs (NCCL) ↔
Synchronized
Weight Update
GPU 0 · updated
optimizer.step()
same weights as GPU 1
GPU 1 · updated
optimizer.step()
same weights as GPU 0
The key insight: each GPU sees a different data subset per iteration, but their gradients are averaged before each weight update — so all model copies stay identical. With 2 GPUs you process 2× more data per wall-clock second. With 8 GPUs, ~8×. Overhead: one all-reduce communication step per iteration.
$torchrun --nproc_per_node=2 train.py # 2 GPUs
$torchrun --nproc_per_node=$(nvidia-smi -L | wc -l) train.py # all GPUs
Tensor Operations
t.shapedimensions, e.g. torch.Size([2, 3])
t.view(3,2)reshape (shares memory)
t.Ttranspose (flip along diagonal)
A @ Bmatrix multiply (= A.matmul(B))
t.to(device)move to CPU / GPU / MPS
t.item()extract Python scalar from 0D tensor
Model & Training
model.train()enable dropout / batchnorm training mode
model.eval()disable dropout, freeze batchnorm stats
torch.no_grad()skip graph construction (inference)
optimizer.zero_grad()clear gradients before each backward
loss.backward()compute all gradients via chain rule
optimizer.step()update parameters using computed gradients
Model Persistence
model.state_dict()dict of all parameter tensors
torch.save(sd, "f.pth")save to disk
torch.load("f.pth", weights_only=True)load dict
model.load_state_dict(sd)restore weights (architecture must match)
sum(p.numel() for p in model.parameters())total params
Common Gotchas
zero_grad()must call before backward() — gradients accumulate by default
logits (not softmax)return raw logits from model — loss fns apply softmax internally
same devicemodel and data must be on identical device or RuntimeError
drop_last=Trueprevents tiny last batch destabilizing gradient updates
shuffle=Falsetest/val loaders must not shuffle — use DistributedSampler for DDP