Learn With Nathan
  • AI Chat Tools
    • ChatGPT - OpenAI
      • Start with ChatGPT
      • Account Settings
      • ChatGPT Free Plan
      • ChatGPT Account Settings
    • Claude - Anthropic
      • Signup for Claude
      • User Interface
    • Gemini - Google
  • AI Concepts
    • Context
    • Tokenization
    • Prompt Engineering
    • Temperature
    • Max Tokens
    • Fine-Tuning
    • System Prompt
    • Persona
    • Memory
    • Hallucination
    • Model Bias
    • Embedding
    • Latency
    • User Intent
    • Multimodal AI
    • Safety Layers
    • Chain of Thought
    • Prompt Templates
    • Retrieval-Augmented Generation (RAG)
  • Introduction to Prompting
    • Beginner's Prompting Strategies
      • Understanding the Purpose of a Prompt
      • Be Specific and Clear
      • Using Contextual Information
      • Direct vs. Open-Ended Prompts
      • Step-by-Step Instructions
      • Role-Based Prompts
      • Sequential Prompts
      • Multi-Step Questions
      • Incorporating Examples
    • Common Prompting Mistakes to Avoid
      • Being Too Vague or Ambiguous
      • Overloading with Multiple Questions
      • Ignoring Context Limitations
      • Not Specifying the Desired Output
      • Lack of Iteration and Refinement
      • Neglecting to Set the Right Tone or Role
      • Using Jargon or Complex Language Unnecessarily
      • Ignoring Feedback from the AI
      • Overly Long or Short Prompts
      • Page 6
      • Page 5
      • Page 4
      • Page 3
      • Page 2
      • Page 1
    • Output Formatting Techniques
      • Using Headings and Subheadings
      • Bulleted and Numbered Lists
      • Paragraph Structure
      • Tables and Charts
      • Direct Answers vs. Detailed Explanations
      • Incorporating Summaries and Conclusions
    • Leveraging Formatting for Clarity
      • Highlighting Key Points
      • Guiding the AI on Tone and Style
      • Requesting Examples or Case Studies
      • Formatting for Different Audiences
      • Using Questions to Clarify Information
      • Prompting for Step-by-Step Guides
      • Customizing Responses for Presentation or Reports
      • Avoiding Over-Complicated Formatting
  • Types of Prompts
    • Direct Prompts
    • Instructional Prompts
    • Conversational Prompts
    • Contextual Prompts
    • Example-Based Prompts
    • Reflective or Feedback Prompts
    • Multi-Step Prompts
    • Open-Ended Prompts
    • Role-Based Prompts
    • Comparative Prompts
    • Conditional Prompts
    • Summarization prompts
    • Exploratory Prompts
    • Problem-Solving Prompts
    • Clarification Prompts
    • Sequential Prompts
    • Hypothetical Prompts
    • Ethical or Judgment-Based Prompts
    • Diagnostic Prompts
    • Instructional design prompts
    • Page 8
    • Page 7
  • Advanced Prompting Techniques
    • Zero-Shot
    • Few-Shot
    • Chain-of-Thought
    • Meta Prompting
    • Self-Consistency
    • Generated Knowledge
    • Prompt Chaining
    • Tree of Thoughts (ToT)
    • Retrieval-Augmented Generation (RAG)
    • Automatic Prompt Engineer (APE)
    • Active Prompt
    • Directional Stimulus
  • Live Examples
    • Legal
      • Non-Disclosure Agreement (NDA)
      • Employment Contract
      • Lease Agreement
      • Service Agreement
      • Sales Agreement
    • Zero-Shot Prompting
    • Few-Shot Prompting
Powered by GitBook
On this page
  1. AI Concepts

Tokenization

PreviousContextNextPrompt Engineering

Last updated 5 months ago

Tokenization is the process by which input text is broken down into smaller components, such as words, subwords, or even individual characters. These tokens are the units that AI models like ChatGPT process. For instance, the phrase "ChatGPT is amazing" might be tokenized as ["Chat", "GPT", "is", "amazing"]. This process is vital because the model operates on numerical representations of these tokens, not the raw text itself.

The choice of tokenization method affects the model's efficiency and accuracy. Subword tokenization, for example, allows the AI to handle rare words by breaking them into common subword units, improving its capacity to deal with diverse languages and expressions. However, tokenization introduces limits, as each model has a maximum token count (e.g., 4096 tokens for some GPT models). Input and output text must fit within this constraint, requiring users to prioritize essential information in prompts.

As you can see above the tokens are not a full word or something meaningful to human. But they definetly mean something to the LLMs. The Transformer in GPT predicts the next token, not the next word or character. That prediction is what become the response from AI.