Chapter 7.3: Prompt Engineering & In-Context Learning

What is Prompt Engineering?

Prompt engineering is the practice of designing and refining the text you provide to language models to elicit better responses. The remarkable insight is that language models can be dramatically more effective at specific tasks through careful prompt design, without any model changes. Different prompts to the same model can produce responses of vastly different quality on the same task.

Prompt engineering works because language models are fundamentally pattern-matching systems that respond to context. The context you provide through your prompt shapes what patterns the model recalls and how it applies them. By designing prompts strategically, you can steer models toward better reasoning, more appropriate tone, and higher accuracy on domain-specific tasks.

Prompt Engineering Techniques

Few-Shot Prompting

Few-shot prompting provides examples of (input, output) pairs in the prompt, teaching the model by example how to handle your specific task. Instead of saying "classify this email sentiment," you show the model several sentiment classification examples, then ask it to classify your actual email. The model learns the pattern from examples and applies it to new inputs.

Few-shot prompting is remarkably effective because it grounds the model in domain-specific patterns without modifying the model itself. A financial analyst can show a model examples of how to analyze quarterly earnings, then ask it to analyze new earnings. A customer service leader can show examples of good responses to common issues, then ask the model to generate responses to new customer queries.

Chain-of-Thought Reasoning

Chain-of-thought prompting asks models to explain their reasoning step-by-step before providing an answer. Instead of asking "What is the solution?" you ask the model to "Walk me through your reasoning step-by-step, then provide your answer." This simple change dramatically improves accuracy on complex reasoning tasks by forcing the model to articulate reasoning rather than leap to conclusions.

Chain-of-thought is particularly effective for mathematical reasoning, logic problems, and domain-specific analysis. By asking the model to think through problems step-by-step, it catches errors in its own reasoning and arrives at better conclusions.

System Prompts and Role-Playing

System prompts - meta-instructions about how to behave - can shape model responses dramatically. A system prompt like "You are a financial analyst with 20 years of experience. Provide detailed, evidence-based analysis" can significantly improve the quality and tone of financial analysis. Role-playing prompts teach models to adopt specific personas, perspectives, or constraints.

Structured Output Prompting

Rather than letting models choose output format, you can specify exactly what format you want (JSON, XML, bullet points, specific fields). Structured output prompts dramatically improve consistency and downstream usability. A prompt that specifies "provide your answer in JSON format with fields for {risk_level, justification, confidence}" produces structured output ready for downstream processing.

Limitations of Prompt Engineering

While prompt engineering is powerful, it has limits. As tasks become more complex or domain-specific, prompt engineering alone becomes insufficient. Models cannot learn complex reasoning patterns from a few examples. Highly specialized terminology or concepts not well-represented in training data cannot be fully taught through prompts. Tasks requiring understanding of proprietary systems or recent developments exceed what prompts can provide.

When prompt engineering reaches its limits, you must consider RAG for accessing external knowledge or fine-tuning for teaching complex patterns. The key skill is diagnosing when you've maxed out prompt engineering and need to move to more complex customization.

Start with Simple Prompts, Iterate

Effective prompt engineering follows an iterative process. Start with a simple prompt, test it, measure results, identify gaps, and refine. Don't try to engineer the perfect prompt upfront. Instead, use an experimental mindset: test one prompt variation at a time, measure impact, and build on what works. This disciplined approach yields better results than random tweaking.

When Prompt Engineering is Sufficient

Prompt engineering can be sufficient for: tasks that require pattern matching rather than learning complex concepts; domains where terminology is fairly standard; situations where RAG-retrieved context provides needed information; and tasks where moderate improvements are acceptable. Prompt engineering is particularly cost-effective - it costs nothing beyond your design time.

Prompt engineering is insufficient for: teaching complex domain-specific reasoning patterns; tasks requiring knowledge not in model training; situations where precision and accuracy are critical and prompting alone doesn't achieve them. In these cases, RAG or fine-tuning becomes necessary.

Key Takeaway

Prompt engineering is the simplest and fastest customization approach, and it's more powerful than most people realize. Professional prompt engineers can achieve remarkable results through careful prompt design. However, prompt engineering has limits, and knowing those limits is as important as the engineering itself. The most successful practitioners master prompt engineering for what it does well, recognize its boundaries, and graduate to RAG or fine-tuning when necessary. This layered approach - starting simple, measuring results, adding complexity only when justified - delivers better outcomes than jumping to complex customization strategies prematurely.

Frequently Asked Questions

How many examples do I need for few-shot prompting?

Usually 3-10 examples are sufficient. Too few (1-2) sometimes works but is unreliable. Too many uses up precious context space. The optimal number depends on task complexity and example quality.

How do I measure if my prompt is working?

Define metrics before optimizing prompts. Compare new prompts against baselines on consistent test sets. Track not just accuracy but consistency, latency, and cost. Red team your prompts by deliberately trying to break them.

Should I use longer or shorter prompts?

Use the minimal prompt necessary. Longer prompts consume context space and can confuse models. Clear, concise prompts usually work better than verbose ones. However, context and examples sometimes require length.

Ch 7.2: Fine-Tuning

Ch 7.4: Evaluation