1-293031···80
← Back to index
PHASE 3 LLM Architecture · Day 30 of 80 · Raschka LLMs From Scratch

Instruction Finetuning — Dataset Format & Preparation

Prepare instruction-following datasets: system/user/assistant format, chat templates, and data quality.

The difference between a base model and an assistant is instruction tuning. Raw capability becomes usable skill through carefully formatted examples.— Day 30 Principle

I. Instruction Format

{"instruction": "Summarize the following text", "input": "The Federal Reserve announced...", "output": "The Fed raised rates by 25bp..."}

Datasets like Alpaca, LIMA, and OpenAssistant follow this pattern. Quality matters more than quantity — LIMA showed that 1,000 high-quality examples can match 50,000 noisy ones.

V. Deliverables

Good data beats more data. Tomorrow: the SFT training loop.— Day 30 Closing