Introduction
While we often speak about ‘the great convergence’ in machine learning as a convergence in architecture, rather than blocks, it is crucial to recognize that this convergence has not simplified model variation entirely. Despite the shift from specialized architectures such as LSTMs, CNNs, and RNNs towards the transformer block, transformers themselves have varied significantly both in terms of architecture and pre-training methodology. Early architectures included the BERT-style encoder-only masked language models approach, RetroMAE-style masked-autoencoding, ELECTRA-style GAN-based discriminative training, and T5-style span corruption methods. However, recent developments have moved predominantly towards GPT-style next-token-prediction models, where embedding, vision, and image generation architectures converge around a decoder-only autoregressive design, leading to shared block structures across LLMs, diffusion transformers (DiTs), and vision transformers (ViTs).
Inductive Biases
When discussing inductive biases, it’s critical to acknowledge that all models inherently possess biases suited to specific data or tasks. CNNs exhibit translation equivariance, RNNs leverage Markovian dependencies, and modern LLMs introduce a different set of biases entirely. Modern transformer models predominantly rely on an inductive bias grounded in tokenization: they implicitly assume text comprises whitespace-delimited sub-word units (tokens), upon which statistical relationships like chi-squared statistics can be computed, and sequential token predictions can be performed.
An illustrative example is observed in the embedding layers (model.embed_tokens.weight
) and prediction heads (lm_head.weight
) across model evolutions:
- ‘OPT 6.7B’ (released 3 years ago) [8] comprised 🤏 6% of the model parameters,
- ‘Meta-Llama-3-8B’ (released 11 months ago) [9] comprised 📏 13% of the model parameters, and
- ‘Gemma-2-2B’ (released 5 months ago) [10] comprises 22% of model parameters—even with weight-sharing!
The reason behind this steady increase, as shown by ByteDance’s analysis on over-tokenization [11], is simple: performance. The the larger the token vocabulary, the smaller the context length, the lower the compute and latency for a given user query. The more tokens, the fewer word-pieces, and the higher the likelihood that - given a large enough pre-training corpora - the model will learn rare and domain-specific words which it can use to generate more accurate and relevant responses - rather than relying on multi-head attention or other attention mechanisms to learn the combinatorial relationships between sub-world tokens.
This is a crucial point for LLM product managers to understand: the tokenization process is not just a preprocessing step, but a fundamental part of the model’s architecture and performance. The choice of tokenization method can significantly impact the model’s ability to understand and generate text, in-line with key application or business use cases.
cls➡️instruct➡️user➡️think
Token(s) | First Prominent Use | Key Literature / Release | Feature(s) Unlocked |
---|---|---|---|
<cls> | BERT (2018) | Devlin et al. [1] | Sequence‑level tasks: sentiment, intent, similarity |
<instruct> | InstructGPT (2022) | Ouyang et al. [2] | Instruction‑following chatbots (ChatGPT) |
<user> / <assistant> | ChatGPT (2022) | OpenAI blog [3] | Multi‑turn, context‑aware conversational agents |
Image tokens (<img>…> ) | CLIP → Flamingo / LLaVA (2021‑23) | Radford et al.; Alayrac et al. [4] | Multimodal search, VQA, GPT‑4V |
Code‑aware delimiters | Codex / GPT‑4 (2021‑) | Chen et al. [5] | Copilot, code generation & refactoring |
“Thinking” tokens (<think> ) | R1, Chain‑of‑Thought (2023) | Narang et al. [6] | Intermediate reasoning, improved math & planning |
Tool‑use tokens (<tool>…> ) | Toolformer, Gorilla (2023‑24) | Schick et al.; Patel et al. [7] | Agents that call APIs, browse, execute code |
As models introduce new tokens, they unlock new features. For example, the <cls>
token in BERT [1] enabled sequence-level tasks like sentiment analysis and intent classification. The <instruct>
token in InstructGPT [2] allowed for instruction-following chatbots like ChatGPT [3]. The <user>
and <assistant>
tokens in ChatGPT enabled multi-turn, context-aware conversations. Image tokens in CLIP and Flamingo unlocked multimodal search and visual question answering. Code-aware delimiters in Codex and GPT-4 facilitated code generation and refactoring. “Thinking” tokens in R1 and Chain-of-Thought improved intermediate reasoning, math, and planning. Finally, tool-use tokens in Toolformer and Gorilla enabled agents to call APIs, browse the web, and execute code.
For GPT-style models, the space of inputs tokens defines the sensory landscape of the model, and the world-model that is developed. For text-only GPT-style pre-trained models, the token-space covers only Unicode or ASCII characters, and the model is trained to predict the next token in a sequence. For these models the sensory landscape is limited to the text and the model’s world-model covers only the space of entities and actions which that sensory landscape can describe. Given that world-model, the model is able to embody an agent, but cannot interpret a meta-narrative about it’s purpose or behaviour.
With the introduction of InstructGPT, the <instruct>
token, and the <user>
and <assistant>
tokens in ChatGPT, the sensory landscape of the model is expanded to include instructions and conversational context. This allows the model to understand and respond to user queries in a more meaningful way, effectively embodying an agent that can interpret a meta-narrative about its purpose and behaviour. With the introduction of these tokens and post-training, the world-model of the LLM is expanded to include the space of entities and actions that can be described by the instructions and conversational context. This allows the model to perform tasks and satisfy features that allow it to be used in a wider range of applications, such as chatbots, virtual assistants, copilots and various agentic workflows.
The introduction of image tokens in CLIP and LLava further expands the sensory landscape of the model to include visual information, re-orienting the model’s world-model to include the space of entities and actions that can be described by visual information. As these images map onto a pre-defined token embedding space, a local minima may be found that limits the model’s ability to understand and interpolate between colors, textures and shapes - as explored in ColorBench. However, in staged fine-tuning, the model is able to learn to interpolate between these tokens and adapt its internal representation to accommodate the new sensory landscape. Using this landscape, the model is able to perform tasks such as visual question answering and classification, and use that information to inform its responses in a conversational context. To users and product managers, this unlocks new features and applications, which may solve business problems in OCR, scene understanding, and image captioning.
Similar to <user>
and <assistant>
tokens, <think>
and <tool>
look to develop a world-model that may be self-referential, and allows the model to understand and respond to its own reasoning process. The <think>
token is used in R1 and Chain-of-Thought prompting to improve intermediate reasoning, math, and planning. This allows the model to break down complex tasks into smaller steps, improving its ability to solve problems and perform reasoning tasks. The <tool>
token is used in Toolformer and Gorilla to enable agents to call APIs, browse the web, and execute code. This expands the model’s capabilities beyond just text generation, allowing it to interact with external tools (or entities) and resources to perform tasks that require real-time information or complex computations.
Throughout these developments, the introduction of new tokens serves to expand the sensory landscape of the model, this new landscape defines a new world-model that might understand objectness (<user>
), shades (<img>
), or reasoning (<think>
). This new world-model allows the model to perform tasks and satisfy features that were previously impossible, such as multi-turn conversations, visual question answering, and API interactions. As models continue to evolve, we can expect to see even more innovative uses of tokens to enhance their performance and capabilities.
Conclusion
Ultimately, the evolution of large language models has hinged fundamentally on advancements in tokenization. Each new token introduced reshapes the sensory landscape, redefining the model’s internal world-model, and unlocking new functionalities. From simple sequence-level embeddings to multimodal tokens and self-referential reasoning capabilities, tokens continue to shape LLMs’ developmental trajectory profoundly. Looking forward, innovations like Byte Latent Transformers [7] and Super-BPE [8] suggest future tokens might transcend traditional limitations, driving further innovation in machine learning applications and expanding their business potential.
For LLM product managers, comprehending token-driven innovation is crucial—not only for recognizing present limitations but also for harnessing the full potential of emerging technological capabilities.
References
[1] CLS Embedding Pooling: https://blog.ml6.eu/the-art-of-pooling-embeddings-c56575114cf8
[2] InstructGPT: https://openai.com/index/instruction-following/
[3] ChatGPT: https://openai.com/index/chatgpt/
[4] R1: https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B/raw/main/tokenizer.json#L53
[5] Octopus: https://huggingface.co/NexaAIDev/octo-net/raw/main/tokenizer.json#L143
[6] Prefix-finetuning: https://arxiv.org/abs/2101.00190
[7] Byte Latent Transformer: https://www.youtube.com/watch?v=loaTGpqfctI
[8] OPT 6.7B (this is an older model not safetensors so I point to a recent FT): https://huggingface.co/pkarypis/opt-6.7b-sft?show_file_info=model.safetensors.index.json
[9] Meta-Llama-3-8B SafeTensors: https://huggingface.co/meta-llama/Meta-Llama-3-8B?show_file_info=model.safetensors.index.json
[10] Gemma-2-2B-it: https://huggingface.co/google/gemma-2-2b-it?show_file_info=model.safetensors.index.json
[11] Over-Tokenization: https://arxiv.org/abs/2501.16975
[12] Token Drag: https://github.com/jmward01/lmplay/wiki/Unified-Embeddings#how-i-think-they-work-token-drag-and-the-two-jobs-of-embeddings