Educational GPT-Style Transformer — PyTorch from scratch

Quadtrix

A minimal, educational GPT-style transformer trained character-by-character on children's stories.

No pre-trained weights. No fine-tuning. Just raw PyTorch, from init to generation.

model_output.txt — generated text
// sampling at temperature=1.0

|

0.82M
Min Parameters
CPU run
0.7176
Best Val Loss
10.82M model
6.1 min
Fastest Training
Tesla T4 GPU
10.82M
Max Parameters
Colab GPU
Scroll

What Is Quadtrix?

Trains a tiny transformer on text — character by character — and learns which characters tend to follow others. Same architecture as GPT, just scaled way down.

Same Architecture as GPT
Identical transformer structure to GPT — just 1M–11M params instead of 175B. Scale, not magic.
📖
Character-Level Learning
Operates on individual characters. No vocabulary limits, no OOV (out-of-vocabulary) problems.
🔬
Fully Transparent
Every line of code is readable. Trace loss, inspect weights, understand every operation.
🚀
Train in Minutes
6 minutes on a Tesla T4. Under an hour on CPU. No cloud account or API keys needed.

// The Pipeline

  • 1
    Load children's stories from disk
  • 2
    Encode each character as an integer (vocab: 28–110)
  • 3
    Split into train/val chunks (80/20)
  • 4
    Build GPT-style transformer
  • 5
    Train for N steps with loss tracking
  • 6
    Save best weights on val improvement
  • 7
    Load best model and generate text

// Each Forward Pass

Input
Sequence of character indices — e.g., "Once upon..."
Embed
Each character → dense vector of floats
Attend
Transformer layers learn which chars to attend to
Project
Output logits = scores over all possible next chars
Sample
Softmax → probability → random sample next char
Loop
Feed next char back in and repeat indefinitely
Loss function

Cross-entropy — the model minimizes surprise on unseen text. Adam optimizer with learning rate schedule and dropout for regularization.

Model Architecture

GPT-style decoder-only transformer. Identical conceptual structure to GPT-2, scaled down to run on commodity hardware.

// Transformer Block Components

hyperparameters.py
batch_size16–64
block_size128–256
n_embd128–384
n_head4–6
n_layer4–6
dropout0.2
learning_rate3e-4
max_iters3000–5000

How Transformers Really Work

The key mechanisms that make transformers powerful. Understanding these is understanding every modern language model.

Character-Level Complexity
Complexity

Self-attention scales as O(L²). One sentence ≈ 250 chars = 25× more attention pairs than word-level.

Advantage

No vocabulary limits. Can spell any word, handle typos, creative linguistics. OOV problems simply don't exist.

Trade-off

Harder to learn long-range semantic dependencies. Context windows fill up faster than word-level.

Training Results

Three runs across three hardware setups. All converged well — none overfit, none underfit catastrophically.

loss_curve.svg
train val

No overfitting detected. Train/val gap is 0.0057 — the model generalizes to unseen text.

Run 3 — Tesla T4 output

head_to_head.md
Metric T4 ⭐ Colab CPU
Parameters 1.99M 10.82M 0.82M
Best Val Loss 0.9250 0.7176 1.3145
Training Time 6.1 min 61.3 min 39.4 min
Key Observation

All three runs were still improving at the final checkpoint. Validation loss was still falling. More training steps or data would help all runs.

Where Quadtrix Sits

The Chinchilla (2022) scaling law: ~20 tokens of training data per parameter is optimal. Here's how our runs align with the frontier.

// Chinchilla Coverage

Model Params Data Coverage
Run 1 — CPU 0.82M 200K
1.2%
Run 3 — T4 ⭐ 1.99M 28.3M
71.1%
Run 2 — Colab 10.82M 79.6M
36.8%
GPT-2 Small 117M 40B
1700%
GPT-3 175B 600B
17%

// Scaling Law Insights

1
The Chinchilla Scaling Law
DeepMind's 2022 paper established the optimal ratio: ~20 training tokens per parameter. Undertrained models waste compute; overtrained models waste data.
2
Run 3 is the Sweet Spot
At 71% of Chinchilla-optimal coverage, Run 3 has the best parameter-to-data ratio. It learns the most per unit of compute of all three runs.
3
GPT-3 Violates Chinchilla
GPT-3 (175B params) was trained on only 600B tokens — ~17% of what Chinchilla recommends. This is why smaller Chinchilla-trained models often match GPT-3.
4
What to Do Next
All runs benefit from more training steps first. Only after data coverage exceeds ~50% should you scale up model size.

How Generation Works

Once training finishes, best_model.pt contains frozen weights. Generation is a simple loop — predict, sample, feed back.

generate.py

// Temperature Effect

output — temperature=1.0

Deterministic generation

Same weights + different random seed = different output. Add torch.manual_seed(42) for reproducible results.

// Known Limitations

Character-level trade-off
Learns characters, not words. Can't reliably spell or track meaning across paragraphs.
📉
Output coherence
Sentences drift logically, names disappear, tense breaks — expected at this scale.
All models undertrained
Val loss was still improving at the final checkpoint. More steps would help all three.
📦
Limited data
Run 2 is only at 37% optimal data coverage. A larger story corpus would help significantly.
🔭
No long-range memory
Fixed context window (128–256 tokens). Cannot reference events from earlier in the story.
🔤
No subword tokenization
'fox' = 3 char tokens vs 1 word token. Context fills up faster, harder to learn semantics.

Hyperparameter Playground

Drag the sliders to design your own model. Parameter count, memory, Chinchilla coverage, and training time update instantly.

// hyperparameters.py
n_embd 200
Embedding dimension — width of every vector in the model
64192320448512
n_head 4
Number of parallel attention heads (must divide n_embd)
12468
n_layer 4
Number of stacked transformer blocks (depth)
136912
block_size 128
Context window — characters seen per forward pass
64192320448512
batch_size 64
Sequences processed in parallel per gradient step
8326496128
max_iters 5000
Total gradient update steps during training
5005k10k15k20k
Parameters
1.99M
~1.99M trainable weights
Model Memory
7.6 MB
float32 weights only
Training Memory
~380 MB
weights + grads + activations
Total Tokens Seen
40.9M
batch × block × iters
Chinchilla Coverage
71%
of 20 tokens/param ideal
FLOPs (est.)
490 GFLOPs
6 × params × tokens

// Estimated Training Time
⚡ Tesla T4
6.1 min
~8 TFLOPs
🔬 Colab GPU
7.2 min
~6.5 TFLOPs
💻 CPU (Ryzen)
39.4 min
~100 GFLOPs
// vs. Actual Runs (parameter count)
Your model ✦
1.99M
Run 1 — CPU
0.82M
Run 3 — T4 ⭐
1.99M
Run 2 — Colab
10.82M
your_model_config.py

                                    

Get Running in Minutes

One script. One dependency. No cloud account, no credentials, no pipeline.

// Installation Steps

01Clone or download
git clone https://github.com/Eamon2009/Transformer-language-model
02Install PyTorch
pip install torch
03Add your data
# Place any UTF-8 text file at data.txt# (or edit the filename in transformer.py)
04Run training
python transformer.py
05Watch it learn
[ 0/5000] train=4.6207 val=4.6202 << best! [ 200/5000] train=2.2058 val=2.1986 << best! [ 400/5000] train=1.6111 val=1.6039 << best! ... [DONE] Training finished in 367.0s | Best val loss: 0.9250

// Choose Your Config

GPU — Tesla T4
⭐ Recommended · 6.1 min · 1.99M params

                                    
project structure
transformer.py    ← Everything. One file.
best_model.pt     ← Saved weights (after first run)
data.txt          ← Your text (any UTF-8 file)

No config files. Edit hyperparameters directly in the script.

After training finishes...

The script loads best_model.pt and generates text indefinitely. Press Ctrl+C to stop. Output differs each run because sampling is random.