Why Evaluation Frameworks Matter
Without a framework, evaluation is ad hoc and inconsistent. You might accept output on Monday that you would reject on Tuesday. You might overlook problems because you are not systematically checking. You might spend time evaluating unimportant aspects while missing critical flaws.
A framework forces consistency. It ensures you check all important dimensions. It helps you prioritize. It gives you language to explain why something is or is not acceptable. This chapter introduces the ACRE framework, which stands for Accuracy, Completeness, Relevance, and Appropriateness.
Frameworks are not rigid rules. They are thinking tools. Use ACRE to organize your evaluation. But the actual judgment call is yours. Different contexts weight dimensions differently. For creative writing, appropriateness might matter more than accuracy. For financial analysis, accuracy is paramount. Use the framework to ensure you are thinking systematically, not to replace judgment.
Dimension 1: Accuracy
What Accuracy Means
Accuracy is whether the facts in the output are true. Does the output contain verifiable claims? Are those claims correct? Accuracy includes factual accuracy (dates, names, statistics), conceptual accuracy (correctly explaining concepts), and logical accuracy (sound reasoning).
Assessing Accuracy
For factual claims: Verify against authoritative sources. If the output says "The earth orbits the sun in 365.25 days," check that. If it says "Company X founded in 2010," verify that.
For conceptual claims: Check against your domain knowledge. If you are an expert, you can often spot conceptual errors immediately. If you are not an expert, consult with someone who is.
For logical claims: Walk through the reasoning. Does A actually lead to B? Are the assumptions stated? Are they reasonable?
Accuracy Red Flags
Watch for: specific numbers without sources, confident assertions about recent events (models have knowledge cutoffs), claims about proprietary information, statements that contradict what you know to be true, inconsistent information within the same output (says one thing here, contradicts it there).
Accuracy Trade-offs
Perfect accuracy is often impossible. The question is: what level of accuracy is acceptable for this use case? A marketing email can have minor inaccuracies. A legal document cannot. A brainstorm can be speculative. A technical specification cannot.
Dimension 2: Completeness
What Completeness Means
Completeness is whether the output covers all necessary ground. Does it address your question fully? Does it include all necessary elements? Are important aspects omitted?
Assessing Completeness
Create a checklist of what should be included: Before evaluating, write down what you expect to see. Then check whether the output covers those elements. Example: "A product proposal should include: problem statement, proposed solution, competitive differentiation, timeline, resource requirements, success metrics."
Check for gaps: Are there obvious topics the output should have covered but did not? Is the analysis superficial in any area?
Check for depth: Completeness is not just about listing topics. It is about adequate depth. A one-sentence explanation of a complex topic is incomplete.
Completeness Red Flags
Watch for: output that ends abruptly, missing sections that should be there, shallow treatment of complex topics, one-sided analysis that does not address counterarguments, missing context that would be needed to understand the output fully.
Completeness vs. Conciseness
There is tension between completeness and conciseness. You do not want a 10,000-word answer to a simple question. The answer is that completeness is context-dependent. For an email, shorter is better as long as it covers the essentials. For a strategy document, deeper is better. Define what "complete" means for your specific context.
Dimension 3: Relevance
What Relevance Means
Relevance is whether the output actually addresses what you asked for. It is possible to have accurate and complete output that is completely irrelevant because it answers the wrong question.
Assessing Relevance
Compare to your original request: Reread what you asked for. Does the output address it? Or does it address something tangentially related?
Check focus: Is the output focused on your specific situation, or generic and broadly applicable? If you asked "How should we price our SaaS product," and the output gives generic SaaS pricing advice, that is less relevant than output that accounts for your specific product, market, and positioning.
Check audience alignment: If you asked for advice for "executives," is the output at the executive level or is it too detailed/basic? If you asked for content for "beginners," is it accessible to beginners?
Relevance Red Flags
Watch for: output that answers a related but different question, generic advice when you asked for specific recommendations, content aimed at the wrong audience, output that addresses your question in a way you did not intend, missing context about your specific situation.
Relevance Problems Are Common
This is one of the most common evaluation problems. An output can seem great because it is accurate and well-written, but if it does not actually address your need, it is not useful. Always check relevance carefully.
Dimension 4: Appropriateness
What Appropriateness Means
Appropriateness is whether the output is suitable for its intended use and audience. Is the tone right? Is the level of formality right? Does it contain anything that should not be there? Is it suitable for the context?
Assessing Appropriateness
Check tone: Should this be formal or casual? Professional or conversational? Is the tone appropriate for the audience and context?
Check for offensive content: Does the output contain anything offensive, inappropriate, or harmful? This includes biased language, inappropriate jokes, or content that could upset the audience.
Check for proprietary or sensitive information: Does the output inadvertently expose sensitive information? Could it be used against you if shared?
Check structure and format: Is the output formatted for its intended use? If you need bullet points, is it bullet points? If you need a narrative, is it narrative?
Check audience fit: Is the output appropriate for its intended audience? Would a C-suite executive read this, or is it too junior-level? Would a beginner understand this, or is it too technical?
Appropriateness Red Flags
Watch for: tone that does not match context (too casual for formal setting, too stuffy for creative context), content that could be offensive to some audience members, inappropriate humor, overly technical language for non-technical audience, overly simple language for technical audience, anything that reveals proprietary information.
Putting It All Together: Using ACRE
Here is how to apply ACRE systematically:
- Read the output. Get a complete picture.
- Assess Accuracy. Are the facts correct? Is the reasoning sound? Note any inaccuracies.
- Assess Completeness. Did it cover all necessary ground? Are there gaps? Is the depth adequate?
- Assess Relevance. Does it actually address what was asked? Or does it go off track?
- Assess Appropriateness. Is it suitable for its intended use and audience?
- Make a decision. Based on all four dimensions, is the output acceptable? What needs to be fixed?
Weighted Evaluation
Not all dimensions are equally important for every task. You can weight them differently depending on context. Examples:
Financial analysis: Accuracy 40%, Completeness 30%, Relevance 20%, Appropriateness 10%
Marketing copy: Appropriateness 35%, Relevance 30%, Completeness 25%, Accuracy 10%
Code review: Accuracy 40%, Completeness 35%, Appropriateness 15%, Relevance 10%
Adjust the weights for your specific use case. This helps you prioritize your evaluation effort.
Key Takeaway
The ACRE framework gives you a systematic way to evaluate AI output. Accuracy checks if facts are correct. Completeness checks if nothing important is missing. Relevance checks if it addresses what you asked. Appropriateness checks if it is suitable for its context and audience. No single dimension is enough. Good evaluation checks all four.
Use ACRE as your evaluation checklist. Adapt the weighting for your specific use case. With this framework, you will catch problems that you would otherwise miss, and you will evaluate consistently across different pieces of output.