# Tokenization

Tokenization is the process by which input text is broken down into smaller components, such as words, subwords, or even individual characters. These tokens are the units that AI models like ChatGPT process. For instance, the phrase "ChatGPT is amazing" might be tokenized as \["Chat", "GPT", "is", "amazing"]. This process is vital because the model operates on numerical representations of these tokens, not the raw text itself.

The choice of tokenization method affects the model's efficiency and accuracy. Subword tokenization, for example, allows the AI to handle rare words by breaking them into common subword units, improving its capacity to deal with diverse languages and expressions. However, tokenization introduces limits, as each model has a maximum token count (e.g., 4096 tokens for some GPT models). Input and output text must fit within this constraint, requiring users to prioritize essential information in prompts.

<figure><img src="/files/pZudzO16ksGj2RTUJxO4" alt=""><figcaption></figcaption></figure>

As you can see above the tokens are not a full word or something meaningful to human. But they definetly mean something to the LLMs. The Transformer in GPT predicts the next token, not the next word or character. That prediction is what become the response from AI.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://learn-with-nathan.gitbook.io/learnwithnathan/ai-concepts/tokenization.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
