Garbage in, garbage out. The quality and structure of training data determines model quality more than architecture or compute.— Day 22 Principle
I. From Text to Training Batches
The pipeline: raw text → tokenize → chunk into sequences of block_size → create (input, target) pairs where target is input shifted by 1 → batch → shuffle.
data = torch.tensor(encode(text), dtype=torch.long)
def get_batch(split):
d = train_data if split == 'train' else val_data
ix = torch.randint(len(d) - block_size, (batch_size,))
x = torch.stack([d[i:i+block_size] for i in ix])
y = torch.stack([d[i+1:i+block_size+1] for i in ix])
return x, y
V. Deliverables
- Tokenize full corpus
- Train/val split
- Batch sampler
- Input/target pairs with shift-by-1
- DataLoader with shuffling
Clean data pipelines are invisible but essential. Tomorrow: embeddings deep dive.— Day 22 Closing