1-212223···80
← Back to index
PHASE 3 LLM Architecture · Day 22 of 80 · Raschka LLMs From Scratch

Data Preparation & DataLoaders for LLMs

Build the data pipeline: text corpus to batches of token sequences ready for training.

Garbage in, garbage out. The quality and structure of training data determines model quality more than architecture or compute.— Day 22 Principle

I. From Text to Training Batches

The pipeline: raw text → tokenize → chunk into sequences of block_size → create (input, target) pairs where target is input shifted by 1 → batch → shuffle.

data = torch.tensor(encode(text), dtype=torch.long) def get_batch(split): d = train_data if split == 'train' else val_data ix = torch.randint(len(d) - block_size, (batch_size,)) x = torch.stack([d[i:i+block_size] for i in ix]) y = torch.stack([d[i+1:i+block_size+1] for i in ix]) return x, y

V. Deliverables

Clean data pipelines are invisible but essential. Tomorrow: embeddings deep dive.— Day 22 Closing