Tokenization

Ask GPT-4 how many r’s are in “strawberry” and it will confidently say two. The right answer is three. This isn’t because the model can’t count. It’s because it never sees the letters at all. Every Large Language Model (LLM) starts with the same operation: text comes in, gets chopped into chunks called tokens, and those chunks become integer IDs that index into an embedding matrix. The chunks aren’t characters and they aren’t words. They’re something more specific, and the specificity matters more than most people realize. ...