A minimal, educational GPT-style transformer trained character-by-character on children's stories.
No pre-trained weights. No fine-tuning. Just raw PyTorch, from init to generation.
|
Trains a tiny transformer on text — character by character — and learns which characters tend to follow others. Same architecture as GPT, just scaled way down.
Cross-entropy — the model minimizes surprise on unseen text. Adam optimizer with learning rate schedule and dropout for regularization.
GPT-style decoder-only transformer. Identical conceptual structure to GPT-2, scaled down to run on commodity hardware.
The key mechanisms that make transformers powerful. Understanding these is understanding every modern language model.
Self-attention scales as O(L²). One sentence ≈ 250 chars = 25× more attention pairs than word-level.
No vocabulary limits. Can spell any word, handle typos, creative linguistics. OOV problems simply don't exist.
Harder to learn long-range semantic dependencies. Context windows fill up faster than word-level.
Three runs across three hardware setups. All converged well — none overfit, none underfit catastrophically.
No overfitting detected. Train/val gap is 0.0057 — the model generalizes to unseen text.
All three runs were still improving at the final checkpoint. Validation loss was still falling. More training steps or data would help all runs.
The Chinchilla (2022) scaling law: ~20 tokens of training data per parameter is optimal. Here's how our runs align with the frontier.
Once training finishes, best_model.pt
contains frozen weights. Generation is a simple loop — predict, sample, feed back.
Same weights + different random seed = different output. Add torch.manual_seed(42) for reproducible results.
Drag the sliders to design your own model. Parameter count, memory, Chinchilla coverage, and training time update instantly.
One script. One dependency. No cloud account, no credentials, no pipeline.
git clone https://github.com/Eamon2009/Transformer-language-model
pip install torch# Place any UTF-8 text file at data.txt# (or edit the filename in transformer.py)
python transformer.py[ 0/5000] train=4.6207 val=4.6202 << best!
[ 200/5000] train=2.2058 val=2.1986 << best!
[ 400/5000] train=1.6111 val=1.6039 << best!
...
[DONE] Training finished in 367.0s | Best val loss: 0.9250
transformer.py ← Everything. One file. best_model.pt ← Saved weights (after first run) data.txt ← Your text (any UTF-8 file)
No config files. Edit hyperparameters directly in the script.
The script loads best_model.pt and
generates text indefinitely. Press Ctrl+C to stop. Output differs
each run because sampling is random.