How Large Language Models Work
Understand tokens, training, context windows, and the mechanics behind ChatGPT and other LLMs.
The Basics: Predicting the Next Word
At their core, LLMs do one thing: predict the next word (or token) in a sequence. When you type a prompt, the model looks at your input and generates a response one token at a time, each time picking the most likely next token given everything that came before.
This simple mechanism — repeated billions of times during training on enormous datasets — produces remarkably intelligent-seeming behavior. The model learns grammar, facts, reasoning patterns, coding syntax, and even creative writing styles, all from the statistical patterns in its training data.
What Are Tokens?
LLMs don't read words — they read tokens. A token is roughly 3-4 characters, or about 0.75 words. Understanding tokens matters because:
- Pricing is based on tokens (input tokens + output tokens)
- Context windows are measured in tokens
- Long words use more tokens than short ones
Examples:
- "Hello" = 1 token
- "Artificial intelligence" = 2 tokens
- "Supercalifragilisticexpialidocious" = 5 tokens
Most models show token usage. GPT-4o allows ~128,000 tokens per conversation. Claude 3.5 allows ~200,000 tokens.
Context Windows Explained
The context window is the total amount of text (in tokens) that the model can "see" at once. It includes both your input and the model's output.
Think of it as the model's working memory:
- Small context (4K tokens): Can handle a short conversation
- Medium context (32K tokens): Can process a long document
- Large context (128K-200K tokens): Can analyze an entire book
When a conversation exceeds the context window, the model starts "forgetting" earlier messages. This is why long conversations sometimes lose coherence.
Temperature and Creativity
Temperature is a setting that controls how random or creative the model's responses are:
- Temperature 0: Always picks the most likely next token. Responses are deterministic and consistent. Best for factual tasks, coding, data analysis.
- Temperature 0.5-0.7: Balanced. Good for most general tasks.
- Temperature 1.0+: More random and creative. Good for brainstorming, creative writing, generating diverse ideas.
Most chat interfaces use a default temperature around 0.7. Some let you adjust it in settings.
Training vs Inference
Training is when the model learns from data. This happens once (or periodically) and costs millions of dollars. The model processes billions of text examples and adjusts its internal parameters to predict text better.
Inference is when you use the trained model. Each time you send a prompt and get a response, that's inference. It's much cheaper than training but still requires significant computing power.
The model's knowledge is frozen at training time. This is why AI models have a "knowledge cutoff date" — they don't know about events after their training data was collected.
Ask ChatGPT or Claude: "How many tokens is this sentence?" Then ask it to explain its token counting. Note: LLMs aren't always accurate at counting their own tokens, which demonstrates an interesting limitation.
- ✓LLMs predict the next token in a sequence — that's the core mechanism
- ✓Tokens are ~0.75 words and determine pricing and context limits
- ✓Context windows are the model's working memory — bigger means it can handle more
- ✓Temperature controls creativity vs consistency
- ✓Models have knowledge cutoff dates because training is a one-time event