You use ChatGPT every day. You have a rough idea that it involves AI and neural networks. But if someone asked you to explain — from first principles — how a large language model is actually designed, trained, and why the transformer architecture matters, could you give a coherent answer?
This is that explanation. Written for an engineer who has a degree, has coded before, but has never worked in machine learning. No jargon without definition. No hand-waving.
Part 1: The Neural Network — What It Actually Is
Before transformers, we need to understand the thing transformers replaced: the plain neural network, also called a feedforward neural network.
Imagine you have a spreadsheet of data about apartments: area in square feet, number of bedrooms, distance to the nearest MRT, age of the building. You want to predict monthly rent. This is a classic regression problem: input numbers in, output a number out.
A neural network solves this by stacking layers of simple computational units called neurons. Each neuron takes in a bunch of numbers, multiplies each by a weight, sums them up, and passes the result through an activation function — a simple non-linear function (like ReLU: if negative, output 0; if positive, output the number itself) that allows the network to learn non-linear relationships.
A network with one hidden layer looks like this:
Input layer: your 4 apartment features → passed to hidden layer neurons.
Hidden layer: each neuron applies weights + activation to the inputs → passes results to output layer.
Output layer: combines the hidden layer signals → produces a predicted rent number.
The magic is in the weights. At initialization, the weights are random. You feed the network real data — a thousand apartments with known rents — and compute how wrong each prediction is using a loss function (for regression, typically mean squared error: the average of the squared difference between predicted and actual rent). Then you propagate that error backwards through the network and adjust every weight slightly to reduce the error. This is called backpropagation. You repeat this thousands of times — forward pass, compute loss, backpropagation, weight update — and the network gradually learns to predict rent accurately.
The key insight: the network never “knows” what rent is. It discovers the relationships between features and rent purely through adjusting numbers. It finds patterns in data without being explicitly told the rules.
Part 2: Why a Plain Neural Network Can’t Handle Language
Language has a problem that rent prediction doesn’t: sequence and context matter.
For rent, the order of your four features doesn’t matter. For language, “the cat sat on the mat” vs “the mat sat on the cat” — the same words, completely different meaning. Order is everything.
A plain neural network also can’t handle variable-length inputs gracefully. “The cat” is 2 tokens. “The cat sat on the mat and stared at the bird outside the window” is 17 tokens. The network architecture needs to handle sequences of arbitrary length.
There were two precursor architectures before the transformer solved these problems:
Recurrent Neural Networks (RNNs) — the idea: process the sentence one word at a time, maintaining a “hidden state” that accumulates context as it reads. Read “The”, update state. Read “cat”, update state. Read “sat”, update state — now the state contains some representation of “The cat sat.” RNNs can handle sequences of any length in theory.
The problem: RNNs are sequential. To process the 17th token, you need to have processed tokens 1 through 16 first. This makes training slow — you can’t parallelize across the sequence. And RNNs suffer from vanishing gradients: as you backpropagate through many time steps, the gradient (the signal telling weights how to adjust) becomes extremely small, making it nearly impossible for the network to learn long-range dependencies. In a 50-word sentence, by the time you’re at word 50, word 1’s signal has effectively vanished. RNNs couldn’t reliably remember things from the beginning of a long sentence.
Long Short-Term Memory (LSTM) — an improvement on RNNs that introduced a “memory cell” with gates (input gate, forget gate, output gate) that let the network decide what to remember and what to forget. LSTMs partially solved the vanishing gradient problem and could handle somewhat longer sequences. They were the dominant NLP architecture from roughly 2014 to 2017.
But LSTMs still couldn’t truly handle long-range dependencies well, and they were still sequential. A 1,000-word document required processing word by word in order. Training was slow. Parallelization was limited. Researchers were stuck.
Part 3: The Transformer — The Architecture That Changed Everything
In 2017, a paper from Google Brain and University of Toronto called “Attention Is All You Need” — Vaswani et al. — described the transformer architecture. It wasn’t a minor improvement over RNNs. It was a fundamentally different computational approach that solved every major limitation simultaneously.
The key innovation: self-attention. Instead of processing a sentence word by word, the transformer looks at the entire sentence all at once and measures how much each word relates to every other word.
The Attention Mechanism Explained
Attention sounds complicated but the core idea is intuitive. Consider the sentence: “The bank by the river has good fish and chips.”
When processing the word “bank”, the model needs to know whether this refers to a financial institution or the side of a river. The word “river” tells you it is the latter. The word “fish” might mislead you toward financial institution. The attention mechanism lets every word “query” every other word to find the relevant context.
Technically, each word in the input gets three representations: a Query (Q), a Key (K), and a Value (V). These are learned vectors. The attention score between two words is computed as: how much does word A’s Query match word B’s Key? Multiply that match score by B’s Value, and you get how much A should “attend to” B’s information.
This is done in parallel for all words. Every word simultaneously computes its relationship with every other word. The entire sentence is processed in one step — no sequential dependency, full parallelization on GPU hardware.
The formula is:
Attention(Q, K, V) = softmax(QK^T / √d_k) × V
Where QK^T measures query-key similarity, √d_k is a scaling factor to prevent the softmax from pushing values into extremely small gradients, and the softmax converts match scores into a probability distribution (how much to weight each value).
Modern transformers use Multi-Head Attention: instead of computing one set of Q/K/V relationships, they compute multiple sets in parallel (e.g., 16 or 64 “heads”). Each head learns a different type of relationship — one head might learn subject-verb agreement, another might learn word proximity, another might learn semantic similarity. The outputs of all heads are concatenated and projected back to the original dimension.
The Feed-Forward Network (FFN)
Each transformer layer also contains a position-wise feed-forward network: a two-layer fully connected network applied independently to each token’s embedding. This is where the actual “reasoning” about features happens — the attention layer figures out what information to retrieve; the FFN transforms it.
Residual Connections and Layer Normalization
Two architectural details make transformers stable during training:
Residual connections (skip connections): the output of a layer is added to its input before passing to the next layer. This allows gradients to flow directly backward through the network during backpropagation, preventing the vanishing gradient problem even in very deep networks (dozens of layers). The original transformer has 12 layers; GPT-3 has 96 layers.
Layer normalization: normalizes the activations within each layer to have mean 0 and variance 1, which stabilizes training by preventing any layer’s activations from becoming too large or too small.
Positional Encoding
Self-attention treats the input as a bag of words — it has no inherent sense of word order. Since “the cat bit the dog” and “the dog bit the cat” contain identical words, the model needs position to matter. The transformer injects positional information by adding a positional encoding to each word embedding — a vector that encodes the word’s position in the sequence using sine and cosine functions of different frequencies. These encodings are added to the word embeddings before the first transformer layer and are learned by the model over training.
Part 4: From Words to Numbers — Tokenization
A transformer doesn’t read text directly. Text is first converted to numbers through tokenization.
The simplest approach: assign each word in the vocabulary a unique integer. But English has ~50,000+ common words, and every rare word, proper noun, or compound form would be “unknown.” This is the out-of-vocabulary (OOV) problem.
Modern LLMs use Byte Pair Encoding (BPE) or its variants. BPE works by iteratively merging the most common character pairs in a corpus. The result: the vocabulary consists of both whole words (common ones) and sub-word fragments. “Transformer” might be tokenized as one token if common, or as [“transform”, “er”] if less common. “Unbelievably” might be [“un”, “believ”, “ably”]. GPT-4 uses a variant called tiktoken.
The practical implication: different models have different vocabulary sizes. The number of tokens is roughly 0.75x to 1.5x the word count for English. This matters because LLMs are priced and rate-limited by token count.
After tokenization, each token is mapped to a token embedding — a learned vector of fixed dimension (GPT-3: 12,288 dimensions; BERT-base: 768 dimensions). These embeddings are learned during training.
Part 5: How LLMs Are Actually Trained
Pre-Training: Language Modeling
The core pre-training objective is deceptively simple: next-token prediction. Given a sequence of tokens, mask or remove the last token, ask the model to predict it, compute the loss, and update weights via backpropagation.
Concretely: give the model “The capital of France is ___.” Mask the last token. The model outputs a probability distribution over its entire vocabulary. The correct answer is “Paris” (token ID for “Paris”). Compute cross-entropy loss between the prediction and the correct answer. Backpropagate. Update weights. Repeat on trillions of token sequences.
This is why LLMs are often called “autoregressive”: they generate text by repeatedly predicting the next token, appending it to the input, and predicting again. Each new token conditions on all previous tokens.
The training data is vast. GPT-3 was trained on roughly 300 billion tokens — essentially a significant portion of the publicly accessible internet. The model sees the equivalent of millions of books, Wikipedia articles, code repositories, scientific papers, and web forums. The diversity and scale of this data is what gives the model its broad capabilities.
Training GPT-3 on this data cost approximately $4-12 million in compute costs (based on the 2020 paper’s estimates). The actual model cost is in the hardware — thousands of NVIDIA A100 GPUs running for months.
Post-Training: Making Models Useful
A raw, pre-trained transformer generates text by statistically predicting what comes next. It is coherent and grammatically correct. It is also not aligned with what a human wants. It will answer questions, but it might lie confidently, produce harmful content, or simply not follow instructions well.
Post-training fixes this through two main techniques:
Supervised Fine-Tuning (SFT): Human contractors write example conversations — instruction/response pairs. The model is fine-tuned on this curated dataset to learn the format and style of helpful responses. This is relatively straightforward supervised learning: gradient descent on human-written examples.
Reinforcement Learning from Human Feedback (RLHF): This is the more complex step. First, multiple model responses to the same prompt are ranked by human labelers. This preference data trains a reward model — a neural network that takes a (prompt, response) pair and outputs a scalar score reflecting how much a human would prefer that response. Then, the original language model is fine-tuned using reinforcement learning (specifically, the PPO algorithm) against the reward model — maximizing the reward signal rather than just predicting the next token on human-written text.
RLHF is what transforms a raw statistical language model into a helpful assistant. It is also expensive, complex, and imperfect — reward hacking (where the model finds ways to score highly on the reward model without genuinely being helpful) is a real problem.
Part 6: Scaling Laws — Why Bigger Is Better
One of the most important empirical discoveries in LLM development is scaling laws. In 2020, OpenAI published a paper (Kaplan et al.) showing that model performance improves predictably and smoothly with three factors: number of parameters, amount of training data, and compute budget — all following power-law relationships.
More parameters (weights) allow the model to store more factual knowledge, learn more complex reasoning patterns, and represent more nuanced relationships in language. More training data amplifies the benefit of more parameters. More compute (GPU-hours) allows training on more data.
The Chinchilla scaling laws (Hoffmann et al., 2022) refined this: for a given compute budget, the optimal strategy is to scale parameters and training tokens proportionally. Chinchilla (70B parameters) outperformed GPT-3 (175B parameters) despite being smaller, because it was trained on more tokens. The rule of thumb: for GPT-4 class performance, estimate roughly 1-2 trillion training tokens for a model in the 100B-200B parameter range.
This is why building a frontier LLM is extraordinarily expensive. GPT-4’s training is estimated to have cost $100+ million. This creates a significant barrier to entry and is why only a handful of organizations globally can train truly frontier models.
Part 7: The Modern LLM Architecture — GPT-4 as the Reference Design
Modern frontier models like GPT-4 are built on the transformer architecture with significant engineering on top:
Mixture of Experts (MoE): Rather than activating all parameters for every token, MoE models have many “expert” feed-forward networks (dozens or hundreds), and a lightweight router that selects 2-3 experts to process each token. The result: a model with trillions of total parameters but only activating tens of billions per token — dramatically reducing inference cost while maintaining model capacity. GPT-4 is rumored to use an MoE architecture.
Context Length and Attention Efficiency: Standard self-attention scales quadratically with context length (O(n²) in sequence length). A 128K context window requires the model to compute attention across 128,000 tokens — the memory and compute requirements grow as the square of the sequence length. Modern models use grouped query attention (GQA) and flash attention — algorithmic improvements that dramatically reduce the memory overhead of attention without sacrificing quality.
Quantization: Full-precision (32-bit or 16-bit floating point) weights for a 100B+ parameter model require 200-800GB of GPU memory — beyond what any single consumer GPU can hold. Quantization reduces weights to 8-bit or even 4-bit integers, dramatically reducing memory requirements with minimal quality loss, enabling deployment on smaller hardware.
Part 8: What All of This Means Practically
Understanding how LLMs work at this level changes how you think about what they can and cannot do.
LLMs predict the statistically most likely next token. This sounds reductive, but from simple token prediction, remarkably complex behaviors emerge: reasoning chains, code generation, translation, summarization, emotional tone detection. The hypothesis is that language itself — and the reasoning patterns embedded in language — is sufficiently structured that predicting the next token forces the model to build an internal representation of facts, logic, and world models. The model doesn’t “know” what “Paris” means the way you do — but it has absorbed enough statistical patterns about the word that it behaves as if it does.
The training objective is compression. Yann LeCun’s framing: a language model is essentially a lossy compressor of human knowledge. The model’s weights are a compressed representation of the patterns in its training data. When you query the model, you are decompressing that knowledge in the form of generated text.
The model doesn’t know what it doesn’t know. Because the model is trained to produce statistically likely text — not to fact-check against ground truth — it generates text that sounds confident and well-formed even when factually incorrect. This is not a bug that can be fully patched. It is a consequence of the training objective. The RLHF process partially addresses it by making the model more cautious, but the fundamental issue remains.
Emergent capabilities are real and not fully understood. Small language models cannot reliably do multi-step arithmetic or answer complex inference questions. Above a certain parameter and training scale, these capabilities appear suddenly — the model begins reliably performing tasks it previously failed at. Why this happens and at what scale remain active research questions. The leading hypothesis: more parameters and more training data allow the model to store and compose more reasoning primitives, eventually reaching a threshold where complex tasks become statistically predictable.
The transformer architecture, introduced in a 2017 research paper with no particular fanfare, became the foundation of the most consequential AI systems of our time. Every ChatGPT conversation, every Copilot-assisted coding session, every Claude response — all of it runs on variations of the same core idea: let every word attend to every other word, in parallel, repeatedly, at massive scale. The simplicity of the core insight makes the sophistication of the resulting behavior all the more remarkable.
