Educational GPT-Style Transformer — PyTorch from scratch

Quadtrix

A minimal, educational GPT-style transformer trained character-by-character on children's stories.

No pre-trained weights. No fine-tuning. Just raw PyTorch, from init to generation.

Get Started Explore Architecture View on GitHub

model_output.txt — generated text

// sampling at temperature=1.0

0.82M

Min Parameters

CPU run

0.7176

Best Val Loss

10.82M model

6.1 min

Fastest Training

Tesla T4 GPU

10.82M

Max Parameters

Colab GPU

Scroll

/ overview

What Is Quadtrix?

Trains a tiny transformer on text — character by character — and learns which characters tend to follow others. Same architecture as GPT, just scaled way down.

⚡

Same Architecture as GPT

Identical transformer structure to GPT — just 1M–11M params instead of 175B. Scale, not magic.

📖

Character-Level Learning

Operates on individual characters. No vocabulary limits, no OOV (out-of-vocabulary) problems.

🔬

Fully Transparent

Every line of code is readable. Trace loss, inspect weights, understand every operation.

🚀

Train in Minutes

6 minutes on a Tesla T4. Under an hour on CPU. No cloud account or API keys needed.

// The Pipeline

1

Load children's stories from disk
2

Encode each character as an integer (vocab: 28–110)
3

Split into train/val chunks (80/20)
4

Build GPT-style transformer
5

Train for N steps with loss tracking
6

Save best weights on val improvement
7

Load best model and generate text

// Each Forward Pass

Input

Sequence of character indices — e.g., "Once upon..."

Embed

Each character → dense vector of floats

Attend

Transformer layers learn which chars to attend to

Project

Output logits = scores over all possible next chars

Sample

Softmax → probability → random sample next char

Loop

Feed next char back in and repeat indefinitely

Loss function

Cross-entropy — the model minimizes surprise on unseen text. Adam optimizer with learning rate schedule and dropout for regularization.

/ deep dive

How Transformers Really Work

The key mechanisms that make transformers powerful. Understanding these is understanding every modern language model.

⚠ Character-Level Complexity

Complexity

Self-attention scales as O(L²). One sentence ≈ 250 chars = 25× more attention pairs than word-level.

Advantage

No vocabulary limits. Can spell any word, handle typos, creative linguistics. OOV problems simply don't exist.

Trade-off

Harder to learn long-range semantic dependencies. Context windows fill up faster than word-level.

/ results

Training Results

Three runs across three hardware setups. All converged well — none overfit, none underfit catastrophically.

loss_curve.svg

train val

✓

No overfitting detected. Train/val gap is 0.0057 — the model generalizes to unseen text.

Run 3 — Tesla T4 output

head_to_head.md

Metric	T4 ⭐	Colab	CPU
Parameters	1.99M	10.82M	0.82M
Best Val Loss	0.9250	0.7176	1.3145
Training Time	6.1 min	61.3 min	39.4 min

Key Observation

All three runs were still improving at the final checkpoint. Validation loss was still falling. More training steps or data would help all runs.

/ scaling laws

Where Quadtrix Sits

The Chinchilla (2022) scaling law: ~20 tokens of training data per parameter is optimal. Here's how our runs align with the frontier.

// Chinchilla Coverage

Model	Params	Data	Coverage
Run 1 — CPU	0.82M	200K	1.2%
Run 3 — T4 ⭐	1.99M	28.3M	71.1%
Run 2 — Colab	10.82M	79.6M	36.8%
GPT-2 Small	117M	40B	1700%
GPT-3	175B	600B	17%

// Scaling Law Insights

The Chinchilla Scaling Law

DeepMind's 2022 paper established the optimal ratio: ~20 training tokens per parameter. Undertrained models waste compute; overtrained models waste data.

Run 3 is the Sweet Spot

At 71% of Chinchilla-optimal coverage, Run 3 has the best parameter-to-data ratio. It learns the most per unit of compute of all three runs.

GPT-3 Violates Chinchilla

GPT-3 (175B params) was trained on only 600B tokens — ~17% of what Chinchilla recommends. This is why smaller Chinchilla-trained models often match GPT-3.

What to Do Next

All runs benefit from more training steps first. Only after data coverage exceeds ~50% should you scale up model size.

/ generation

How Generation Works

Once training finishes, best_model.pt contains frozen weights. Generation is a simple loop — predict, sample, feed back.

generate.py

// Temperature Effect

output — temperature=1.0

Deterministic generation

Same weights + different random seed = different output. Add torch.manual_seed(42) for reproducible results.

// Known Limitations

⚠

Character-level trade-off

Learns characters, not words. Can't reliably spell or track meaning across paragraphs.

📉

Output coherence

Sentences drift logically, names disappear, tense breaks — expected at this scale.

⏳

All models undertrained

Val loss was still improving at the final checkpoint. More steps would help all three.

📦

Limited data

Run 2 is only at 37% optimal data coverage. A larger story corpus would help significantly.

🔭

No long-range memory

Fixed context window (128–256 tokens). Cannot reference events from earlier in the story.

🔤

No subword tokenization

'fox' = 3 char tokens vs 1 word token. Context fills up faster, harder to learn semantics.

/ playground

Hyperparameter Playground

Drag the sliders to design your own model. Parameter count, memory, Chinchilla coverage, and training time update instantly.

// hyperparameters.py

n_embd 200

Embedding dimension — width of every vector in the model

64192320448512

n_head 4

Number of parallel attention heads (must divide n_embd)

12468

n_layer 4

Number of stacked transformer blocks (depth)

136912

block_size 128

Context window — characters seen per forward pass

64192320448512

batch_size 64

Sequences processed in parallel per gradient step

8326496128

max_iters 5000

Total gradient update steps during training

5005k10k15k20k

Parameters

1.99M

~1.99M trainable weights

Model Memory

7.6 MB

float32 weights only

Training Memory

~380 MB

weights + grads + activations

Total Tokens Seen

40.9M

batch × block × iters

Chinchilla Coverage

71%

of 20 tokens/param ideal

FLOPs (est.)

490 GFLOPs

6 × params × tokens

⚠

// Estimated Training Time

⚡ Tesla T4

6.1 min

~8 TFLOPs

🔬 Colab GPU

7.2 min

~6.5 TFLOPs

💻 CPU (Ryzen)

39.4 min

~100 GFLOPs

// vs. Actual Runs (parameter count)

Your model ✦

1.99M

Run 1 — CPU

0.82M

Run 3 — T4 ⭐

1.99M

Run 2 — Colab

10.82M

your_model_config.py

/ quick start

Get Running in Minutes

One script. One dependency. No cloud account, no credentials, no pipeline.

// Installation Steps

01 — Clone or download

git clone https://github.com/Eamon2009/Transformer-language-model

02 — Install PyTorch

pip install torch

03 — Add your data

# Place any UTF-8 text file at data.txt# (or edit the filename in transformer.py)

04 — Run training

python transformer.py

05 — Watch it learn

[ 0/5000] train=4.6207 val=4.6202 << best! [ 200/5000] train=2.2058 val=2.1986 << best! [ 400/5000] train=1.6111 val=1.6039 << best! ... [DONE] Training finished in 367.0s | Best val loss: 0.9250

// Choose Your Config

GPU — Tesla T4

⭐ Recommended · 6.1 min · 1.99M params

project structure

transformer.py    ← Everything. One file.
best_model.pt     ← Saved weights (after first run)
data.txt          ← Your text (any UTF-8 file)

No config files. Edit hyperparameters directly in the script.

After training finishes...

The script loads best_model.pt and generates text indefinitely. Press Ctrl+C to stop. Output differs each run because sampling is random.

Quadtrix

What Is Quadtrix?

// The Pipeline

// Each Forward Pass

Model Architecture

// Transformer Block Components

How Transformers Really Work

Training Results

Where Quadtrix Sits

// Chinchilla Coverage

// Scaling Law Insights

How Generation Works

// Temperature Effect

// Known Limitations

Hyperparameter Playground

Get Running in Minutes

// Installation Steps

// Choose Your Config