Skip to main content
00 Days
00 Hrs
00 Min
00 Sec

Tokenization: The Step That Happens Before AI Reads Anything

When you type a message into an AI tool, your words don't arrive as words.

They arrive as tokens, which are the units the model actually works with. Tokenization is the process of converting raw text into those units, and it happens before any of the more visible AI processing begins. It's infrastructure, in the same way that loading a document before you can read it is infrastructure. Invisible, taken for granted, and consequential in ways that only become apparent when you understand what's happening.

A token is roughly a word or a word fragment, though the mapping is imprecise in ways worth knowing. Common short words like "the," "is," and "and" typically map to a single token each. Longer words often split into two or more tokens: "tokenization" might become "token" and "ization." Punctuation, spaces, and line breaks are tokens too. Numbers, code, and text in languages other than English tend to tokenize differently, sometimes less efficiently, which is one reason AI models handle some languages and content types better than others.

The tokenization scheme a model uses is called its vocabulary, and it's fixed at training time. The most common approach is called byte pair encoding, or BPE, which builds a vocabulary by iteratively merging the most frequent pairs of characters or character sequences in the training data. The result is a vocabulary of subword units that balances coverage, no text should be unrepresentable, against efficiency, common words and word fragments should be single tokens.

Why does any of this matter in practice? Several reasons.

Token counts are how AI models measure length. When a model has a context window of 128,000 tokens, that limit is in tokens, not words. A rough rule of thumb is that one token equals about three to four characters of English text, or approximately 75 words per 100 tokens. But that ratio shifts depending on what you're processing. Code tokenizes differently than prose. Non-English text often uses more tokens per word. Dense technical content with unusual vocabulary may tokenize less efficiently than everyday language. If you're building applications where context window limits matter, understanding that the limit is in tokens rather than words, and that token density varies, is a practical necessity rather than background knowledge.

Token counts also affect cost. Most commercial AI APIs price by token, counting both the tokens in your input and the tokens in the model's output. A prompt that includes a large document, a detailed system instruction, and a conversation history may consume far more tokens than a simple question, with corresponding cost implications. Knowing how your content tokenizes helps you design prompts and pipelines that are efficient without sacrificing the context the model needs to do good work.

There's also a subtler effect on model behavior. Because tokenization splits words at subword boundaries rather than at word boundaries, the model's view of a word depends on how that word was tokenized. A word that appears as a single token is processed differently than the same word split across two tokens. This is one of the reasons AI models can behave unexpectedly on unusual proper nouns, technical jargon, or misspellings: the tokenization of unfamiliar text is less predictable, and the model has less training signal for those token sequences.

Tokenization is one of those foundational concepts that rewards understanding early. It explains why context windows are measured in tokens rather than words, why AI costs are token-based, why some content types are handled more reliably than others, and why the tokens piece elsewhere in this blog is a useful companion read. The model doesn't see your text. It sees a sequence of integers, each representing a token from its vocabulary. Everything else follows from that.