Simon Bjørnøy

Engineer, writer, builder. Notes on tech and ideas.

Pre-training: An Overview

This is a flyover A high-altitude tour of the pre-training pipeline. Each stage gets its own deep dive elsewhere in the series; this post exists to give those posts shared vocabulary to refer back to. Pre-training is the process that turns a freshly initialized neural network into a base model that can produce plausible continuations of text. Everything else built on top of an LLM (instruction tuning, RLHF, deployment) assumes pre-training has already happened. ...

Tokens and Tokenization

Ask GPT-4 how many r’s are in “strawberry” and it will confidently say two. The right answer is three. This isn’t because the model can’t count. It’s because it never sees the letters at all. Every Large Language Model (LLM) starts with the same operation: text comes in, gets chopped into chunks called tokens, and those chunks become integer IDs that index into an embedding matrix. The chunks aren’t characters and they aren’t words. They’re something more specific, and the specificity matters more than most people realize. ...