Tokenization
Last updated
Last updated
Tokenization is the process by which input text is broken down into smaller components, such as words, subwords, or even individual characters. These tokens are the units that AI models like ChatGPT process. For instance, the phrase "ChatGPT is amazing" might be tokenized as ["Chat", "GPT", "is", "amazing"]. This process is vital because the model operates on numerical representations of these tokens, not the raw text itself.
The choice of tokenization method affects the model's efficiency and accuracy. Subword tokenization, for example, allows the AI to handle rare words by breaking them into common subword units, improving its capacity to deal with diverse languages and expressions. However, tokenization introduces limits, as each model has a maximum token count (e.g., 4096 tokens for some GPT models). Input and output text must fit within this constraint, requiring users to prioritize essential information in prompts.
As you can see above the tokens are not a full word or something meaningful to human. But they definetly mean something to the LLMs. The Transformer in GPT predicts the next token, not the next word or character. That prediction is what become the response from AI.