← Back to index
Visual Reference · Sebastian Raschka, PhD

PyTorch in One Hour

From tensors to multi-GPU training — every essential concept mapped, annotated, and connected.
PyTorch 2.4.1 9 Sections Tensors · Autograd · Training · GPU DDP Multi-GPU
§ 01The Three Core Components of PyTorch
COMPONENT I
Tensor Library
torch.Tensor · NumPy-like API
Multi-dimensional array containers for all data and parameters. GPU-accelerated. The fundamental unit of every PyTorch computation — scalars, vectors, matrices, and beyond.
Like NumPy, but your arrays can live on a GPU and know how to differentiate themselves.
COMPONENT II
Autograd Engine
torch.autograd · Dynamic computation graphs
Automatically computes gradients of any tensor expression. Builds a computation graph on every forward pass and runs backpropagation via .backward() — no calculus by hand.
The engine that turns forward passes into backward passes without you writing a single derivative.
COMPONENT III
Deep Learning Utilities
torch.nn · Layers, losses, optimizers
Modular building blocks: nn.Module, nn.Linear, nn.Sequential, loss functions, optimizers (SGD, Adam), data loading (Dataset, DataLoader).
A Lego kit for neural networks — mix, match, subclass, and extend.
§ 02Tensors — The Universal Data Container
RANK 0 · 0D
Scalar
torch.tensor(42)
A single loss value, a learning rate
RANK 1 · 1D
Vector
torch.tensor([1, 2, 3, 4])
A bias term, a 1D feature vector
RANK 2 · 2D
Matrix
torch.tensor([[1,2],[3,4]])
A weight matrix, a batch of embeddings
RANK 3+ · nD
Tensor
shape: [batch, seq, dim]
A batch of token embeddings for an LLM
dtypebitscreated fromuse caseconvert with
torch.float3232Python floatDefault for training — GPU-optimized, sufficient precision.to(torch.float32)
torch.float16 / bfloat1616explicit castMixed-precision training, LLM inference, saves memory.half() / .to(torch.bfloat16)
torch.int6464Python intClass labels, token IDs, indices.long()
torch.bool8comparison opsAttention masks, padding masks.bool()
Key operations: .shape · .view()/.reshape() · .T (transpose) · @ (matmul) · .to(device). All operations preserve the computation graph when requires_grad=True.
§ 03 – 04Computation Graphs & Automatic Differentiation

Forward Pass — building the graph

x₁,w₁
Input & Parametersfeature tensor + weight with requires_grad=True
tensor([1.1])
tensor([2.2],✓grad)
z
Net Inputlinear combination: z = x₁·w₁ + b
z = x1 * w1 + b
a
Activationnonlinearity: a = σ(z)
torch.sigmoid(z)
L
Losscompare prediction to true label y
F.binary_cross_entropy(a,y)

Backward Pass — computing gradients

∂L
Trigger BackpropPyTorch traverses graph right-to-left
loss.backward()
∂L/∂a
Gradient at Activationderivative of BCE w.r.t. sigmoid output
auto-computed
∂L/∂z
Gradient at Net Inputchain rule through sigmoid
auto-computed
∂L/∂w₁
Parameter Gradienthow much does the loss change w.r.t. w₁?
w1.grad → −0.0898
Chain Rule
∂L/∂w₁ = (∂L/∂a) · (∂a/∂z) · (∂z/∂w₁)
PyTorch does this automatically. You never compute derivatives by hand.
§ 05 – 07The Full Training Pipeline
1define
Dataset
Subclass torch.utils.data.Dataset. Implement three methods: __init__ (store data), __getitem__ (return one example by index), __len__ (total size).
⚠ Class labels must start at 0. Largest label = num_outputs − 1.
class MyDataset(Dataset):
  def __init__(self, X, y):
    self.X, self.y = X, y
  def __getitem__(self, i):
    return self.X[i], self.y[i]
  def __len__(self):
    return len(self.y)
2load
DataLoader
Wraps Dataset to handle batching, shuffling, and parallelism. num_workers>0 loads next batch in background while GPU trains on current batch.
⚠ Use drop_last=True to avoid a tiny last batch. Use shuffle=True only for train, not test.
DataLoader(
  dataset=train_ds,
  batch_size=32,
  shuffle=True,
  num_workers=4,
  drop_last=True
)
3build
Model — nn.Module
Subclass nn.Module. Define layers in __init__, connect them in forward(). Return raw logits — PyTorch loss functions apply softmax/sigmoid internally.
class Net(nn.Module):
  def __init__(self):
    super().__init__()
    self.layers = nn.Sequential(
      nn.Linear(50,30), nn.ReLU(),
      nn.Linear(30,3), # logits
    )
  def forward(self, x):
    return self.layers(x)
4train
Training Loop
Iterate epochs → batches. Five operations in strict order each batch:
Per-batch update — repeat for every epoch
Forward pass— compute logits from features
Compute loss— F.cross_entropy(logits, labels)
Zero gradients— optimizer.zero_grad() — prevents accumulation!
Backward pass— loss.backward() — fills .grad attributes
Update params— optimizer.step() — w ← w − lr·∂L/∂w
model.train()
for features, labels in loader:
  logits = model(features)
  loss = F.cross_entropy(logits, labels)
  optimizer.zero_grad() # ③
  loss.backward()       # ④
  optimizer.step()       # ⑤
5eval
Inference & Save
model.eval() disables dropout/batchnorm training behavior. Wrap in torch.no_grad() to skip building the computation graph — saves memory and compute during inference.
⚠ Always call model.eval() before any prediction. Forgetting this with dropout gives different results each run.
# Inference
model.eval()
with torch.no_grad():
  probs = torch.softmax(model(X), dim=1)
  preds = torch.argmax(probs, dim=1)

# Save / Load
torch.save(model.state_dict(), "model.pth")
model.load_state_dict(
  torch.load("model.pth", weights_only=True))
§ 09.1 – 09.2GPU Computing — Device Model
CPU (Default)
# All tensors start here
t = torch.tensor([1., 2., 3.])
# t.device → cpu
Every tensor and model parameter starts on CPU. Operations between tensors must be on the same device — mixing CPU and GPU raises RuntimeError.
.to(device)
GPU (cuda / mps)
# Best practice device setup
device = torch.device(
  "cuda" if torch.cuda.is_available()
  else "cpu"
)
model.to(device)
X, y = X.to(device), y.to(device)
Apple Silicon: replace "cuda" with "mps" — torch.backends.mps.is_available()
💡 Only 3 lines change to go from CPU to single-GPU training: device = ... · model.to(device) · features.to(device), labels.to(device) inside the loop. GPU won’t speed up tiny datasets — transfer overhead dominates.
§ 09.3Distributed Training — DistributedDataParallel (DDP)
How DDP Works — One Process Per GPU, Synchronized Gradients
Training Data
BATCH A
examples 1, 3, 5, 7
DistributedSampler
ensures no overlap
BATCH B
examples 2, 4, 6, 8
different subset
per GPU process
BATCH C …
scales linearly
with N GPUs
Model Copy
(identical)
GPU 0 · rank=0
full model copy
forward → loss
→ backward → ∇w
GPU 1 · rank=1
full model copy
forward → loss
→ backward → ∇w
GPU N …
↔   All-Reduce: average gradients across all GPUs (NCCL)   ↔
Synchronized
Weight Update
GPU 0 · updated
optimizer.step()
same weights as GPU 1
GPU 1 · updated
optimizer.step()
same weights as GPU 0
GPU N …
The key insight: each GPU sees a different data subset per iteration, but their gradients are averaged before each weight update — so all model copies stay identical. With 2 GPUs you process 2× more data per wall-clock second. With 8 GPUs, ~8×. Overhead: one all-reduce communication step per iteration.
$torchrun --nproc_per_node=2 train.py    # 2 GPUs
$torchrun --nproc_per_node=$(nvidia-smi -L | wc -l) train.py    # all GPUs
REFQuick Reference Card
Tensor Operations
t.shapedimensions, e.g. torch.Size([2, 3])
t.view(3,2)reshape (shares memory)
t.Ttranspose (flip along diagonal)
A @ Bmatrix multiply (= A.matmul(B))
t.to(device)move to CPU / GPU / MPS
t.item()extract Python scalar from 0D tensor
Model & Training
model.train()enable dropout / batchnorm training mode
model.eval()disable dropout, freeze batchnorm stats
torch.no_grad()skip graph construction (inference)
optimizer.zero_grad()clear gradients before each backward
loss.backward()compute all gradients via chain rule
optimizer.step()update parameters using computed gradients
Model Persistence
model.state_dict()dict of all parameter tensors
torch.save(sd, "f.pth")save to disk
torch.load("f.pth", weights_only=True)load dict
model.load_state_dict(sd)restore weights (architecture must match)
sum(p.numel() for p in model.parameters())total params
Common Gotchas
zero_grad()must call before backward() — gradients accumulate by default
logits (not softmax)return raw logits from model — loss fns apply softmax internally
same devicemodel and data must be on identical device or RuntimeError
drop_last=Trueprevents tiny last batch destabilizing gradient updates
shuffle=Falsetest/val loaders must not shuffle — use DistributedSampler for DDP