AI Fundamentals·Lesson 2

How Large Language Models Work

Understand tokens, training, context windows, and the mechanics behind ChatGPT and other LLMs.

Course progress2 / 29

The Basics: Predicting the Next Word

At their core, LLMs do one thing: predict the next word (or token) in a sequence. When you type a prompt, the model looks at your input and generates a response one token at a time, each time picking the most likely next token given everything that came before.

This simple mechanism — repeated billions of times during training on enormous datasets — produces remarkably intelligent-seeming behavior. The model learns grammar, facts, reasoning patterns, coding syntax, and even creative writing styles, all from the statistical patterns in its training data.

What Are Tokens?

LLMs don't read words — they read tokens. A token is roughly 3-4 characters, or about 0.75 words. Understanding tokens matters because:

- Pricing is based on tokens (input tokens + output tokens)
- Context windows are measured in tokens
- Long words use more tokens than short ones

Examples:
- "Hello" = 1 token
- "Artificial intelligence" = 2 tokens
- "Supercalifragilisticexpialidocious" = 5 tokens

Most models show token usage. GPT-4o allows ~128,000 tokens per conversation. Claude 3.5 allows ~200,000 tokens.

Context Windows Explained

The context window is the total amount of text (in tokens) that the model can "see" at once. It includes both your input and the model's output.

Think of it as the model's working memory:
- Small context (4K tokens): Can handle a short conversation
- Medium context (32K tokens): Can process a long document
- Large context (128K-200K tokens): Can analyze an entire book

When a conversation exceeds the context window, the model starts "forgetting" earlier messages. This is why long conversations sometimes lose coherence.

Temperature and Creativity

Temperature is a setting that controls how random or creative the model's responses are:

- Temperature 0: Always picks the most likely next token. Responses are deterministic and consistent. Best for factual tasks, coding, data analysis.
- Temperature 0.5-0.7: Balanced. Good for most general tasks.
- Temperature 1.0+: More random and creative. Good for brainstorming, creative writing, generating diverse ideas.

Most chat interfaces use a default temperature around 0.7. Some let you adjust it in settings.

Training vs Inference

Training is when the model learns from data. This happens once (or periodically) and costs millions of dollars. The model processes billions of text examples and adjusts its internal parameters to predict text better.

Inference is when you use the trained model. Each time you send a prompt and get a response, that's inference. It's much cheaper than training but still requires significant computing power.

The model's knowledge is frozen at training time. This is why AI models have a "knowledge cutoff date" — they don't know about events after their training data was collected.

Practice This

Ask ChatGPT or Claude: "How many tokens is this sentence?" Then ask it to explain its token counting. Note: LLMs aren't always accurate at counting their own tokens, which demonstrates an interesting limitation.

Try this on ChatGPT, Claude, or Gemini

Key Takeaways

✓LLMs predict the next token in a sequence — that's the core mechanism
✓Tokens are ~0.75 words and determine pricing and context limits
✓Context windows are the model's working memory — bigger means it can handle more
✓Temperature controls creativity vs consistency
✓Models have knowledge cutoff dates because training is a one-time event

Test Yourself

Q1Roughly how many tokens is 1,000 words?

About 1,333 tokens (1,000 ÷ 0.75). The rule of thumb is 1 token ≈ 0.75 words, or 1 word ≈ 1.33 tokens.

Q2What temperature setting would you use for writing a legal contract?

Temperature 0 or very low (0.1-0.2). Legal documents need consistency and precision, not creativity.

Q3Why does ChatGPT sometimes not know about recent events?

Because its knowledge comes from training data with a cutoff date. Events after that date aren't in its training set.

What is Artificial Intelligence?

AI Models Compared: ChatGPT vs Claude vs Gemini