Day 22 - Data Preparation & DataLoaders for LLMs

1-212223···80

Garbage in, garbage out. The quality and structure of training data determines model quality more than architecture or compute.— Day 22 Principle

I. From Text to Training Batches

The pipeline: raw text → tokenize → chunk into sequences of block_size → create (input, target) pairs where target is input shifted by 1 → batch → shuffle.

data = torch.tensor(encode(text), dtype=torch.long)
def get_batch(split):
    d = train_data if split == 'train' else val_data
    ix = torch.randint(len(d) - block_size, (batch_size,))
    x = torch.stack([d[i:i+block_size] for i in ix])
    y = torch.stack([d[i+1:i+block_size+1] for i in ix])
    return x, y

V. Deliverables

Tokenize full corpus
Train/val split
Batch sampler
Input/target pairs with shift-by-1
DataLoader with shuffling

Clean data pipelines are invisible but essential. Tomorrow: embeddings deep dive.— Day 22 Closing