Level 1 · Chapter 1.3

Generative AI & Large
Language Models

ChatGPT, Claude, Gemini, Llama — you have heard the names. Now understand how they actually work. This chapter takes you inside the technology, from the transformer architecture to tokens to multimodal models, using plain language and zero equations.

Watch the Lecture

Why Generative AI Changed Everything

For decades, AI was primarily about analysis: classifying emails, detecting fraud, recognizing faces, recommending products. These systems are powerful, but they are fundamentally about sorting and scoring existing information. Generative AI is different. It creates new content that did not exist before.

When you ask ChatGPT to write a poem, it does not retrieve a stored poem from a database. It generates a brand-new poem, word by word, by applying the statistical patterns it learned during training. When you ask an image generator to create "a cat wearing a business suit in a boardroom," it does not search for a matching photograph. It constructs an entirely new image from scratch.

This creative capability is what makes generative AI feel qualitatively different from earlier AI systems. It is also what makes it simultaneously more useful and more dangerous: more useful because it can help with creative and knowledge tasks that were previously impossible to automate, and more dangerous because it can generate convincing misinformation, fake images, and deceptive content at scale.

What Are Large Language Models?

The Core Concept

A Large Language Model (LLM) is an AI system that has been trained on massive amounts of text data to understand and generate human-like language. The "large" in the name refers to two things: the enormous amount of training data (often trillions of words) and the enormous number of parameters in the model (often hundreds of billions).

The fundamental task an LLM performs is deceptively simple: predict the next token. Given a sequence of words (or more precisely, tokens), what word is most likely to come next? That is it. Every impressive thing an LLM does, from writing essays to solving math problems to generating code, emerges from this single core ability, applied at massive scale.

Next-Token Prediction: How It Actually Works

Imagine you are playing a word game where you see the beginning of a sentence and need to guess the next word. Given "The capital of France is ___," you would probably guess "Paris." You can do this because you have read and heard many sentences about France, capitals, and Paris in your life. You are using your experience (your "training data") to predict the most likely continuation.

LLMs do exactly the same thing, but at an incomprehensible scale. During training, the model processes billions of text documents. From this data, it learns the statistical relationships between words and phrases. It learns that "The capital of France is" is overwhelmingly likely to be followed by "Paris." It also learns more subtle patterns: that academic papers use certain vocabulary, that casual conversations have different structures than business emails, that Python code follows specific syntax rules.

When you type a prompt, the model generates a response one token at a time. For each token, it calculates the probability of every possible next token and selects one (usually one of the highest-probability options, with some randomness to avoid being too predictable). Then it adds that token to the sequence and repeats the process for the next token. This continues until the response is complete.

The result is text that follows the statistical patterns of human-written text so closely that it often appears to understand, reason, and create. But it is important to remember that it is fundamentally performing statistical prediction, not understanding in the human sense.

The Autocomplete Analogy

Your phone's autocomplete is a tiny, simple version of an LLM. It predicts the next word based on what you have typed so far. LLMs do the same thing, but with vastly more training data, vastly more parameters, and vastly more context, producing results that are incomparably more sophisticated. The fundamental mechanism, however, is the same: statistical prediction of what comes next.

Tokens: The Language AI Speaks

Before we go further, you need to understand tokens, because they are fundamental to how LLMs work and they directly affect how you use these tools.

Humans read and write in words. LLMs process text in tokens. A token is a chunk of text that the model treats as a single unit. Common short words like "the," "and," and "is" are typically one token each. Longer or less common words get split into multiple tokens. The word "understanding" might become two tokens: "understand" and "ing." Very rare or technical words might be split into even more tokens.

On average, one token equals roughly three-quarters of a word in English. So 1,000 tokens is approximately 750 words. This ratio varies across languages: some languages require more tokens per word than English.

Why does this matter to you? Because LLMs have a limited "context window" measured in tokens. The context window is the maximum amount of text the model can consider at one time, including both your prompt and the model's response. If a model has a 128,000-token context window (approximately 96,000 words or about 150 pages), that is its working memory limit. Anything beyond that window is invisible to the model.

This has practical implications. In a very long conversation, the model may "forget" things you discussed earlier because those messages have scrolled out of the context window. When you upload a long document for analysis, extremely long documents may exceed the context window. Understanding this limitation helps you use AI tools more effectively by structuring your interactions to stay within the window.

The Transformer Architecture

Why Transformers Were a Breakthrough

Before transformers, language models processed text sequentially: one word at a time, left to right, like reading a book one word at a time. This created two major problems. First, it was slow, because each word had to be processed before the next one could start. Second, the model struggled with long-range dependencies, meaning it had trouble connecting information from the beginning of a passage to information at the end.

Transformers solved both problems. Instead of processing words one at a time, transformers process the entire input simultaneously. They use a mechanism called "attention" that allows every part of the input to interact with every other part, regardless of position. The model can directly connect the word "it" in the last sentence of a paragraph to the noun it refers to in the first sentence, without having to pass information through every word in between.

Attention: The Key Innovation

The attention mechanism is what makes transformers powerful, and it is surprisingly intuitive once you understand it. Think about how you read a complex sentence. When you encounter a pronoun like "it" or "they," your brain automatically looks back to figure out what the pronoun refers to. You pay more "attention" to the relevant earlier words and less attention to the irrelevant ones.

Transformers do something similar. For each token in the input, the attention mechanism calculates how relevant every other token is. When processing the word "it" in a sentence, the model assigns high attention weights to the noun "it" refers to and low weights to unrelated words. This happens across multiple "attention heads" that each focus on different types of relationships: some might focus on grammatical relationships, others on topical relationships, and others on positional patterns.

The result is a model that can capture extraordinarily complex relationships in text. It can track multiple threads of meaning, maintain coherence across long passages, and generate text that follows sophisticated linguistic patterns. All from the simple principle of letting every part of the input attend to every other part.

How LLMs Are Trained

Phase 1: Pre-training

The first phase of LLM training is called pre-training, and it is where the model learns the fundamental patterns of language. The model is shown enormous amounts of text, typically trillions of words scraped from the internet, books, academic papers, code repositories, and other sources. For each passage, the model tries to predict the next token, compares its prediction to the actual next token, and adjusts its parameters to improve.

This process requires extraordinary computational resources. Training a state-of-the-art LLM might require thousands of high-end GPUs running continuously for months, at a cost of tens or hundreds of millions of dollars. The result is a "base model" that can complete text in a general way but is not yet optimized for following instructions or being helpful.

Phase 2: Fine-tuning

The base model can predict text, but it does not know how to have a conversation or follow instructions. Fine-tuning teaches it to be useful. In this phase, the model is trained on a smaller, carefully curated dataset of instruction-following examples. These typically consist of prompts paired with high-quality responses that demonstrate the desired behavior: being helpful, following instructions accurately, and avoiding harmful outputs.

Phase 3: RLHF (Reinforcement Learning from Human Feedback)

The final phase uses human evaluators to rate the model's responses and further refine its behavior. Multiple responses to the same prompt are generated, and human raters rank them from best to worst. The model then learns to produce responses that are more like the highly-rated ones and less like the poorly-rated ones. This is the process described in Chapter 1.2 as reinforcement learning, and it is what makes modern chatbots notably more helpful and aligned with human preferences than earlier language models.

The Training Cutoff

Because LLMs learn from a fixed dataset, their knowledge has a "cutoff date" after which they have no information. If a model's training data was collected through early 2025, it genuinely does not know about events that happened after that date. This is not a bug; it is a fundamental characteristic of how these systems are built. Always consider the training cutoff when evaluating whether an LLM's response on current events is reliable.

The Major Models Today

The LLM landscape is evolving rapidly, but understanding the major players helps you navigate the ecosystem.

OpenAI GPT Series (ChatGPT). The model that brought generative AI to mainstream attention in late 2022. GPT-4 and its successors are multimodal, handling text, images, and code. Available through ChatGPT (consumer) and via API for developers. Known for broad general knowledge and strong instruction-following.

Anthropic Claude Series. Developed with a strong emphasis on safety and helpfulness. Claude models are known for thoughtful, nuanced responses and strong performance on analysis and writing tasks. Available through the Claude interface and API.

Google Gemini Series. Google's multimodal models integrated into Google's product ecosystem. Strong performance on reasoning tasks and deep integration with Google Search, Workspace, and other Google services.

Meta Llama Series. Open-source models released by Meta that can be downloaded, modified, and deployed by anyone. Llama models have become the foundation of a thriving open-source AI ecosystem, enabling researchers and companies to build custom solutions without depending on a commercial API.

Other Notable Models. Mistral (French company known for efficient, high-performing models), Cohere (enterprise-focused), and numerous specialized models for specific tasks like code generation (GitHub Copilot), image generation (Midjourney, DALL-E, Stable Diffusion), and music generation.

Multimodal AI: Beyond Text

The latest generation of AI models does not just process text. Multimodal models can work across multiple types of media simultaneously.

Text to Image. Systems like DALL-E, Midjourney, and Stable Diffusion generate images from text descriptions. You describe what you want to see, and the model creates a novel image matching your description. These use a different architecture (typically diffusion models) but the same fundamental principle of learning patterns from training data.

Image to Text. Modern LLMs can "see" images uploaded by users. You can show Claude or GPT-4 a photograph, chart, screenshot, or handwritten note and ask questions about it. The model converts visual information into its text-based understanding to generate relevant responses.

Text to Code. LLMs trained on code repositories can generate working software from natural language descriptions. This is not a separate capability but an extension of language modeling: code is a structured language, and models can learn its patterns just as they learn natural language patterns.

Text to Audio and Video. Emerging models can generate realistic speech from text, create music from descriptions, and even generate short video clips. These are at earlier stages of development than text and image generation, but they are advancing rapidly.

Why Multimodal Matters

Multimodal AI dramatically expands what you can do with these tools. You can photograph a whiteboard and ask the AI to transcribe it, extract text from a screenshot, analyze data in a chart you photographed, describe an image for accessibility purposes, or generate visual content from text descriptions. Understanding that these capabilities exist (and where they are reliable versus unreliable) is key to getting maximum value from modern AI tools.

Temperature and Creativity

When an LLM generates text, it does not always pick the single most likely next token. A "temperature" parameter controls how random or deterministic the output is.

Low temperature (0.0 - 0.3): The model strongly favors the most probable tokens. Outputs are more predictable, consistent, and factual. Ideal for tasks requiring accuracy, like data analysis, factual questions, and code generation. The same prompt will produce very similar outputs each time.

Medium temperature (0.4 - 0.7): A balance between predictability and variety. Good for most general-purpose tasks including email drafting, summarization, and conversation.

High temperature (0.8 - 1.0+): The model considers less probable tokens more frequently. Outputs are more creative, surprising, and varied. Good for brainstorming, creative writing, and generating diverse ideas. The same prompt may produce quite different outputs each time.

Understanding temperature helps you get better results. If an AI tool gives you the same generic response every time, it might be set too low. If it produces inconsistent or unreliable outputs, it might be set too high. Many AI platforms let you adjust this setting directly.

What LLMs Cannot Do (Yet)

Having understood how LLMs work, it is equally important to understand their inherent limitations:

They cannot access real-time information. Unless connected to external tools (like web search), an LLM only knows what was in its training data. It cannot check current stock prices, read today's news, or access your email.

They cannot learn from your conversation permanently. Each conversation exists independently (unless the platform implements memory features). The model does not become smarter or more knowledgeable based on your interactions.

They cannot reason reliably. While LLMs can perform impressive reasoning-like tasks, they are fundamentally performing pattern matching, not logical deduction. Complex multi-step reasoning, mathematical proofs, and novel logical problems remain areas where LLMs frequently fail.

They cannot verify their own outputs. An LLM has no internal mechanism for checking whether its output is factually correct. It generates text that patterns suggest should follow, regardless of truth value. This is why human verification remains essential.

They do not have beliefs, desires, or consciousness. Despite the conversational interface that makes it feel like you are talking to a thinking being, LLMs are sophisticated pattern-matching systems. They do not "want" anything, "believe" anything, or have any subjective experience.

Key Takeaway

Large language models generate text by predicting the most likely next token, based on statistical patterns learned from massive training datasets. The transformer architecture, with its attention mechanism, is what makes this possible at the sophistication level you see in ChatGPT, Claude, and Gemini.

These systems are extraordinarily capable within their strengths (text generation, analysis, translation, summarization, creative tasks) and consistently unreliable in their weaknesses (factual accuracy, real-time information, complex reasoning, self-verification). The professionals who get the most value from generative AI are the ones who understand this clearly and use the tools accordingly: trusting the strengths, compensating for the weaknesses, and always applying human judgment to the output.

What Comes Next

Chapter 1.4 builds directly on this understanding to give you a comprehensive framework for assessing AI capabilities and limitations. You will learn the specific tasks where AI excels, the specific failure modes to watch for (including hallucinations in detail), and how to develop the critical judgment that separates effective AI users from everyone else. This is the chapter that turns your conceptual understanding into practical decision-making ability.