PyTorch in One Hour — Visual Map

§ 01The Three Core Components of PyTorch

COMPONENT I

Tensor Library

torch.Tensor · NumPy-like API

Multi-dimensional array containers for all data and parameters. GPU-accelerated. The fundamental unit of every PyTorch computation — scalars, vectors, matrices, and beyond.

Like NumPy, but your arrays can live on a GPU and know how to differentiate themselves.

COMPONENT II

Autograd Engine

torch.autograd · Dynamic computation graphs

Automatically computes gradients of any tensor expression. Builds a computation graph on every forward pass and runs backpropagation via .backward() — no calculus by hand.

The engine that turns forward passes into backward passes without you writing a single derivative.

COMPONENT III

Deep Learning Utilities

torch.nn · Layers, losses, optimizers

Modular building blocks: nn.Module, nn.Linear, nn.Sequential, loss functions, optimizers (SGD, Adam), data loading (Dataset, DataLoader).

A Lego kit for neural networks — mix, match, subclass, and extend.

§ 02Tensors — The Universal Data Container

RANK 0 · 0D

Scalar

torch.tensor(42)

A single loss value, a learning rate

RANK 1 · 1D

Vector

torch.tensor([1, 2, 3, 4])

A bias term, a 1D feature vector

RANK 2 · 2D

Matrix

torch.tensor([[1,2],[3,4]])

A weight matrix, a batch of embeddings

RANK 3+ · nD

Tensor

shape: [batch, seq, dim]

A batch of token embeddings for an LLM

dtype	bits	created from	use case	convert with
torch.float32	32	Python float	Default for training — GPU-optimized, sufficient precision	.to(torch.float32)
torch.float16 / bfloat16	16	explicit cast	Mixed-precision training, LLM inference, saves memory	.half() / .to(torch.bfloat16)
torch.int64	64	Python int	Class labels, token IDs, indices	.long()
torch.bool	8	comparison ops	Attention masks, padding masks	.bool()

⚠ Key operations: .shape · .view()/.reshape() · .T (transpose) · @ (matmul) · .to(device). All operations preserve the computation graph when requires_grad=True.

§ 03 – 04Computation Graphs & Automatic Differentiation

Forward Pass — building the graph

x₁,w₁

Input & Parametersfeature tensor + weight with requires_grad=True

tensor([1.1])
tensor([2.2],✓grad)

Net Inputlinear combination: z = x₁·w₁ + b

z = x1 * w1 + b

Activationnonlinearity: a = σ(z)

torch.sigmoid(z)

Losscompare prediction to true label y

F.binary_cross_entropy(a,y)

Backward Pass — computing gradients

∂L

Trigger BackpropPyTorch traverses graph right-to-left

loss.backward()

∂L/∂a

Gradient at Activationderivative of BCE w.r.t. sigmoid output

auto-computed

∂L/∂z

Gradient at Net Inputchain rule through sigmoid

auto-computed

∂L/∂w₁

Parameter Gradienthow much does the loss change w.r.t. w₁?

w1.grad → −0.0898

Chain Rule

∂L/∂w₁ = (∂L/∂a) · (∂a/∂z) · (∂z/∂w₁)

PyTorch does this automatically. You never compute derivatives by hand.

§ 05 – 07The Full Training Pipeline

1define

Dataset

Subclass torch.utils.data.Dataset. Implement three methods: __init__ (store data), __getitem__ (return one example by index), __len__ (total size).

⚠ Class labels must start at 0. Largest label = num_outputs − 1.

class MyDataset(Dataset):
  def __init__(self, X, y):
    self.X, self.y = X, y
  def __getitem__(self, i):
    return self.X[i], self.y[i]
  def __len__(self):
    return len(self.y)

2load

DataLoader

Wraps Dataset to handle batching, shuffling, and parallelism. num_workers>0 loads next batch in background while GPU trains on current batch.

⚠ Use drop_last=True to avoid a tiny last batch. Use shuffle=True only for train, not test.

DataLoader(
  dataset=train_ds,
  batch_size=32,
  shuffle=True,
  num_workers=4,
  drop_last=True
)

3build

Model — nn.Module

Subclass nn.Module. Define layers in __init__, connect them in forward(). Return raw logits — PyTorch loss functions apply softmax/sigmoid internally.

class Net(nn.Module):
  def __init__(self):
    super().__init__()
    self.layers = nn.Sequential(
      nn.Linear(50,30), nn.ReLU(),
      nn.Linear(30,3), # logits
    )
  def forward(self, x):
    return self.layers(x)

4train

Training Loop

Iterate epochs → batches. Five operations in strict order each batch:

Per-batch update — repeat for every epoch

①Forward pass— compute logits from features

②Compute loss— F.cross_entropy(logits, labels)

③Zero gradients— optimizer.zero_grad() — prevents accumulation!

④Backward pass— loss.backward() — fills .grad attributes

⑤Update params— optimizer.step() — w ← w − lr·∂L/∂w

model.train()
for features, labels in loader:
  logits = model(features)
  loss = F.cross_entropy(logits, labels)
  optimizer.zero_grad() # ③
  loss.backward()       # ④
  optimizer.step()       # ⑤

5eval

Inference & Save

model.eval() disables dropout/batchnorm training behavior. Wrap in torch.no_grad() to skip building the computation graph — saves memory and compute during inference.

⚠ Always call model.eval() before any prediction. Forgetting this with dropout gives different results each run.

# Inference
model.eval()
with torch.no_grad():
  probs = torch.softmax(model(X), dim=1)
  preds = torch.argmax(probs, dim=1)

# Save / Load
torch.save(model.state_dict(), "model.pth")
model.load_state_dict(
  torch.load("model.pth", weights_only=True))

§ 09.1 – 09.2GPU Computing — Device Model

CPU (Default)

# All tensors start here
t = torch.tensor([1., 2., 3.])
# t.device → cpu

Every tensor and model parameter starts on CPU. Operations between tensors must be on the same device — mixing CPU and GPU raises RuntimeError.

→ .to(device)

GPU (cuda / mps)

# Best practice device setup
device = torch.device(
"cuda" if torch.cuda.is_available()
else "cpu"
)
model.to(device)
X, y = X.to(device), y.to(device)

Apple Silicon: replace "cuda" with "mps" — torch.backends.mps.is_available()

💡 Only 3 lines change to go from CPU to single-GPU training: device = ... · model.to(device) · features.to(device), labels.to(device) inside the loop. GPU won’t speed up tiny datasets — transfer overhead dominates.

§ 09.3Distributed Training — DistributedDataParallel (DDP)

How DDP Works — One Process Per GPU, Synchronized Gradients

Training Data

BATCH A

examples 1, 3, 5, 7
DistributedSampler
ensures no overlap

BATCH B

examples 2, 4, 6, 8
different subset
per GPU process

BATCH C …

scales linearly
with N GPUs

↓

Model Copy
(identical)

GPU 0 · rank=0

full model copy
forward → loss
→ backward → ∇w

GPU 1 · rank=1

full model copy
forward → loss
→ backward → ∇w

GPU N …

…

↔ All-Reduce: average gradients across all GPUs (NCCL) ↔

Synchronized
Weight Update

GPU 0 · updated

optimizer.step()
same weights as GPU 1

GPU 1 · updated

optimizer.step()
same weights as GPU 0

GPU N …

…

The key insight: each GPU sees a different data subset per iteration, but their gradients are averaged before each weight update — so all model copies stay identical. With 2 GPUs you process 2× more data per wall-clock second. With 8 GPUs, ~8×. Overhead: one all-reduce communication step per iteration.

$torchrun --nproc_per_node=2 train.py # 2 GPUs
$torchrun --nproc_per_node=$(nvidia-smi -L | wc -l) train.py # all GPUs

REFQuick Reference Card

Tensor Operations

t.shapedimensions, e.g. torch.Size([2, 3])

t.view(3,2)reshape (shares memory)

t.Ttranspose (flip along diagonal)

A @ Bmatrix multiply (= A.matmul(B))

t.to(device)move to CPU / GPU / MPS

t.item()extract Python scalar from 0D tensor

Model & Training

model.train()enable dropout / batchnorm training mode

model.eval()disable dropout, freeze batchnorm stats

torch.no_grad()skip graph construction (inference)

optimizer.zero_grad()clear gradients before each backward

loss.backward()compute all gradients via chain rule

optimizer.step()update parameters using computed gradients

Model Persistence

model.state_dict()dict of all parameter tensors

torch.save(sd, "f.pth")save to disk

torch.load("f.pth", weights_only=True)load dict

model.load_state_dict(sd)restore weights (architecture must match)

sum(p.numel() for p in model.parameters())total params

Common Gotchas

zero_grad()must call before backward() — gradients accumulate by default

logits (not softmax)return raw logits from model — loss fns apply softmax internally

same devicemodel and data must be on identical device or RuntimeError

drop_last=Trueprevents tiny last batch destabilizing gradient updates

shuffle=Falsetest/val loaders must not shuffle — use DistributedSampler for DDP