Pre-training: An Overview

Pre-training is the process that turns a freshly initialized neural network into a base model that can produce plausible continuations of text. Everything else built on top of an LLM (instruction tuning, RLHF, deployment) assumes pre-training has already happened.

At altitude, the pipeline looks like this. Text comes in. It gets chopped into tokens. Tokens become vectors. The vectors flow through a stack of transformer blocks that let every position look at every other position. The top-layer vectors produce a probability distribution over the next token. That distribution gets compared to the actual next token in the corpus, producing a loss. The loss flows backward through the network, nudging billions of parameters in the direction that reduces it. Repeat for trillions of tokens.

The same picture as a diagram:

flowchart TB
    text["Raw text"]
    tokens["Token IDs"]
    embeddings["Embedding vectors
(+ position info)"]
    block["Transformer block × N
(self-attention + MLP)"]
    distribution["Next-token
distribution"]
    loss["Cross-entropy loss"]
    backprop["Backprop + Optimizer"]

    text --> tokens
    tokens --> embeddings
    embeddings --> block
    block --> distribution
    distribution --> loss
    loss --> backprop
    backprop -.->|"update parameters"| block

Forward pass goes top to bottom. The dashed arrow back from the optimizer to the transformer stack is the training loop: every batch, the gradients computed from the loss are used to nudge the parameters, and the whole forward pass runs again. Each box maps to a deep-dive post (or set of posts) further down in the series; the section headings below each carry a forward-link.

Each stage below has, or will have, its own deep-dive post.

Text becomes tokens

The model cannot read characters; it can only read integers. Tokenization is the function that turns a string into a sequence of integer IDs drawn from a fixed vocabulary, set once at training time.

Deep dive: Tokens and Tokenization.

Tokens become embeddings

Each token ID indexes into the embedding matrix, a $V \times d$ lookup table where $V$ is the vocabulary size and $d$ is the model’s hidden dimension. The row at index $i$ is the embedding vector for token $i$: a $d$-dimensional point in a learned semantic space. The integer sequence becomes a sequence of vectors.

Deep dive: coming soon.

Positions

Attention treats its input as a set, not a sequence. Without explicit position information, “dog bites man” and “man bites dog” look identical. Positional encodings inject information about where each token sits in the sequence, either added to the embeddings or baked into attention’s geometry.

Deep dive: coming soon.

The transformer stack

The vectors flow through a stack of identical transformer blocks. Each block runs self-attention (every position looks at every other position) followed by an MLP (per-position mixing). Layer norms and residual connections hold the math stable across many layers. By the time the sequence reaches the top, each vector is a context-aware representation of its position given everything around it.

Deep dive: coming soon.

From vector to next-token distribution

The final layer takes each vector through the output matrix (also $V \times d$, often weight-tied to the embedding matrix) and produces a $V$-dimensional vector of logits, raw scores over the vocabulary. A softmax turns those scores into a probability distribution. That is the model’s prediction for the next token at each position.

Deep dive: coming soon.

The loss

Pre-training optimizes one objective: cross-entropy between the model’s predicted distribution and the actual next token in the corpus. For each position in training data, the loss is $-\log p(\text{actual next token})$. Summed over the corpus, this is what the optimizer minimizes.

Deep dive: coming soon.

Backprop and the optimizer

The loss flows backward through every layer of the network, computing gradients with respect to every parameter via backpropagation. An optimizer (almost always AdamW for modern LLMs) uses those gradients to update parameters. One step of this loop processes a batch; pre-training runs millions of steps.

Deep dive: coming soon.

Scale

Pre-training a modern LLM is a feat of scale: trillions of training tokens, thousands of GPUs, months of wall-clock time. The math does not change with scale; the engineering does. Data pipelines that keep GPUs fed, parallelism schemes that split a model across thousands of devices, mixed-precision math, checkpointing.

Deep dive: coming soon.

The rest of this phase of the series unpacks each stage in depth. Future deep-dive posts can refer back here when they need to gesture at “the model” or “the loss” or “the embedding matrix” without re-explaining them every time.

Text becomes tokens#

Tokens become embeddings#

Positions#

The transformer stack#

From vector to next-token distribution#

The loss#

Backprop and the optimizer#

Scale#

Text becomes tokens

Tokens become embeddings

Positions

The transformer stack

From vector to next-token distribution

The loss

Backprop and the optimizer

Scale