OVERVIEW

Exam details


Format	Multiple choice, scenario-based
Scenarios	4 of 6 selected at random
Passing score	720 / 1000
Cost	Free

The 5 domains

Module	Exam Domain	Weight	Topics
1	Agentic Architecture & Orchestration	27%	Agentic loops, multi-agent orchestration, hooks, session management, task decomposition
2	Tool Design & MCP Integration	18%	Tool descriptions, structured errors, tool distribution, MCP configuration, built-in tools
3	Claude Code Configuration & Workflows	20%	CLAUDE.md hierarchy, slash commands, path-specific rules, plan mode, CI/CD integration
4	Prompt Engineering & Structured Output	20%	Explicit criteria, few-shot prompting, JSON schemas, validation loops, batch processing
5	Context Management & Reliability	15%	Context management, escalation, error propagation, codebase exploration, accuracy validation

The 6 exam scenarios

The exam draws 4 at random. Each frames multiple questions:

Customer Support Resolution Agent -- agentic loops, tool design, escalation
Code Generation with Claude Code -- CLAUDE.md, commands, plan mode
Multi-Agent Research System -- coordinator-subagent orchestration, context passing
Developer Productivity -- built-in tools, codebase exploration
Claude Code for CI/CD -- non-interactive mode, structured output, independent review
Structured Data Extraction -- JSON schemas, validation, batch processing

After the 5 modules: a Final Scenario integrating all domains, then a 60-question Practice Exam.

MODULE 1: AGENTIC ARCHITECTURE & ORCHESTRATION

Domain 1 — 27% of exam Task Statements 1.1 – 1.7

Key Terms for Module 1

Agentic loop: while True loop where your code sends requests to Claude, checks stop_reason, executes any requested tools, and repeats until Claude signals end_turn.
stop_reason: The only valid loop control signal. "tool_use" means Claude needs tool results before continuing. "end_turn" means Claude has finished.
tool_use: Content block containing a tool call (name + JSON input). Your code executes it and returns the result.
tool_result: The tool output sent back to Claude as a role: "user" message. Every tool_use block must have a matching tool_result.
Hub-and-spoke: One coordinator manages all communication. Subagents talk only to the coordinator, never to each other.
Coordinator: Central agent responsible for task decomposition, subagent delegation, result aggregation, and gap detection.
Subagent: Specialized agent with isolated context and scoped tools. Does not inherit the coordinator's conversation history.
Task tool: Agent SDK mechanism for spawning subagents. Coordinator must include "Task" in allowedTools or spawning fails.
AgentDefinition: Subagent configuration: name, description, system prompt, and tool list.
Prerequisite gate: Programmatic block on a tool call until a prior step completes. Deterministic -- unlike prompt instructions, it cannot be bypassed.
PostToolUse hook: Intercepts tool results after execution, before Claude sees them. Used for normalization, redaction, enrichment.
PreToolUse hook: Intercepts tool calls before execution. Used for blocking policy violations, validating parameters, enforcing ordering.
Prompt chaining: Fixed sequential pipeline. Steps are predetermined regardless of intermediate results.
Dynamic decomposition: Adaptive pipeline. Next steps depend on what prior steps discovered.
--fork-session: Creates an independent exploration branch from a shared baseline. Changes in the fork do not affect the original session.
--resume: Continues a named session with full prior context. Risk: tool results may be stale if files changed.

LAB 1.1: THE AGENTIC LOOP

What the exam tests

The agentic loop lifecycle: send request, check stop_reason, execute tools, append results, repeat
tool_result messages appended as role: "user" -- every tool_use block requires a matching tool_result
Model-driven decision-making: Claude selects tools based on context, not a hardcoded decision tree
Anti-patterns: text parsing for loop control, arbitrary iteration caps, checking for text presence

The agentic loop pattern

Every Claude agent -- including Claude Code -- runs an agentic loop. Your code loops, Claude decides what to do on each iteration.

Each iteration:

Send the conversation (with tool definitions) to Claude
Check stop_reason in the response
If "tool_use" -- execute the requested tools, append results, continue
If "end_turn" -- extract the final text, exit

stop_reason is the only valid loop control signal. Not text content, not iteration count, not token usage. Claude decides when the task is finished.

How it works in code

import anthropic

client = anthropic.Anthropic()

tools = [
    {
        "name": "get_customer",
        "description": "Look up customer by email. Returns ID, name, tier.",
        "input_schema": {
            "type": "object",
            "properties": {"email": {"type": "string"}},
            "required": ["email"]
        }
    },
    {
        "name": "lookup_order",
        "description": "Look up order by ID. Returns status, items, tracking.",
        "input_schema": {
            "type": "object",
            "properties": {"order_id": {"type": "string"}},
            "required": ["order_id"]
        }
    }
]

messages = [{"role": "user", "content": "I'm jane@email.com, check order ORD-555"}]

# ── THE AGENTIC LOOP ──────────────────────────
while True:
    response = client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=1024,
        tools=tools,
        messages=messages
    )

    # ┌──────────────────────────────────────────┐
    # │  stop_reason is the ONLY loop signal     │
    # └──────────────────────────────────────────┘

    if response.stop_reason == "end_turn":           # ← Claude is done
        final = response.content[0].text
        print(final)
        break

    if response.stop_reason == "tool_use":           # ← Claude needs tools
        # Append Claude's response to conversation
        messages.append({
            "role": "assistant",
            "content": response.content
        })

        # Execute each tool Claude requested
        tool_results = []
        for block in response.content:
            if block.type == "tool_use":
                result = execute_tool(block.name, block.input)
                tool_results.append({
                    "type": "tool_result",
                    "tool_use_id": block.id,        # ← Must match
                    "content": result
                })

        # CRITICAL: Send results back as role "user"
        messages.append({
            "role": "user",
            "content": tool_results
        })

What happens step by step:

Iteration	stop_reason	Claude's action	Your code's action
1	`tool_use`	Calls `get_customer(email="jane@email.com")`	Executes tool, appends result to messages
2	`tool_use`	Calls `lookup_order(order_id="ORD-555")`	Executes tool, appends result to messages
3	`end_turn`	Returns "Your order has shipped, tracking: 1Z999..."	Prints final response, exits loop

Your code never chose which tool to call. Claude evaluated the context and made every decision. That's model-driven decision-making -- the opposite of a hardcoded decision tree.

If the customer isn't found, Claude adapts. get_customer returns {"error": "No customer found"}, Claude skips lookup_order entirely and responds with "I couldn't find that account." A decision tree would have called lookup_order anyway -- model-driven control handles the edge case without extra code.

Anti-pattern: Text parsing for loop control

# ✗ ANTI-PATTERN: Parsing text to decide when to stop
while True:
    response = get_response()
    text = response.content[0].text
    if "Final answer:" in text.lower():
        break                    # ← Exits mid-task
    if len(text) > 100:
        break                    # ← Exits when text is "long enough"

Claude includes text alongside tool calls. Iteration 1: Claude says "Let me look up your account" while also requesting get_customer. A text-presence check exits on iteration 1 -- before the order lookup ever happens.

# ✓ CORRECT: Check stop_reason
while True:
    response = client.messages.create(...)
    if response.stop_reason == "end_turn":
        break                    # ← Claude decided it's done
    if response.stop_reason == "tool_use":
        execute_and_append()     # ← Claude needs more data

stop_reason is deterministic. Text parsing is not.

Anti-pattern: Missing tool results

Omitting tool results from the conversation produces an API error:

RuntimeError: API Error: every tool_use block needs a matching
tool_result in the next message. Missing tool_result for: call_001

Every tool_use block requires a matching tool_result with the same tool_use_id in the next role: "user" message. Without it, Claude never receives the data it requested and the loop breaks.

Check your understanding

Q1. You comment out the line that appends tool results to the conversation. What happens when the agent runs?

A) Claude detects the missing result and automatically retries the same tool call with corrected parameters.

B) The loop exits immediately with stop_reason: "end_turn" because Claude sees no pending work.

C) The API returns an error because the next request is missing the required tool_result entries.

D) The while True loop throws a KeyError when accessing the response content on the next iteration.

Correct: C. Tool results must be sent back as role: "user" messages. Without them, the API rejects the request entirely. Why not A: Claude doesn't auto-retry -- your code is responsible for sending results back. Why not B: The loop doesn't exit; the API rejects the request before Claude can respond. Why not D: The error comes from API validation, not a Python key lookup.

Q2. In the first iteration, Claude returns both text ("Let me look up your account") and a tool_use block. A developer checks: if any(b.type == "text" for b in response.content): break. What happens?

A) The loop works correctly -- Claude only includes text in its final response.

B) The loop exits after iteration 1, before the order lookup, because text was present alongside the tool call.

C) The check is redundant but harmless; stop_reason takes precedence over any text-based checks.

D) The loop enters an infinite cycle because the text block is consumed but the tool block is not.

Correct: B. Claude can include text and tool calls in the same response. Checking for text presence exits the loop on iteration 1, skipping lookup_order entirely. Only stop_reason == "end_turn" reliably signals completion. Why not A: Claude regularly includes text alongside tool calls as intermediate reasoning. Why not C: The check actively breaks the loop -- it's not harmless. Why not D: The loop exits, it doesn't cycle.

Q3. Your agentic loop exits mid-task. The termination check is: if "Final answer:" in response.content[0].text: break. What is the root cause?

A) The model sometimes includes "Final answer:" in intermediate reasoning before all tools complete.

B) The max_tokens limit is cutting off the response before the model can finish its work.

C) The tool results are not being appended to the conversation history correctly.

D) The model's temperature setting is too high, causing unpredictable text generation.

Correct: A. Parsing natural language for loop termination is an anti-pattern. The model may write "Final answer:" while still planning tool calls. Use stop_reason == "end_turn" instead. Why not B: A max_tokens cutoff produces stop_reason: "max_tokens", not a clean text match. Why not C: Missing tool results cause API errors, not premature text matches. Why not D: Temperature affects word choice, not whether specific phrases appear at the wrong time.

Exam tips

"Agent exits mid-task" --> the answer involves stop_reason, NOT text parsing or iteration caps
Wrong answer trap: "Set a maximum iteration count" -- iteration caps are a safety net, not a termination strategy
Wrong answer trap: "Check if response contains a final answer" -- text parsing exits early when Claude includes text alongside tool requests
Wrong answer trap: "Increase temperature for more reliable termination" -- temperature affects word choice, not loop control signals
The core pattern: On the exam, if the question is about loop termination, the answer is always stop_reason. Every other option is a distractor.

LAB 1.2: MULTI-AGENT COORDINATOR

What the exam tests

Hub-and-spoke architecture: one coordinator, specialized subagents, no direct subagent-to-subagent communication
Subagent context isolation -- subagents do not inherit the coordinator's conversation history
Coordinator responsibilities: decomposition, delegation, gap detection, re-delegation, synthesis
Narrow decomposition as the root cause of missing topic coverage (not subagent failure)
Dynamic subagent selection based on query needs (not always running the full pipeline)
Iterative refinement: evaluate synthesis for gaps, re-delegate targeted queries, re-synthesize

Hub-and-spoke architecture

One coordinator agent manages all communication. Specialized subagents handle individual tasks. Subagents never talk to each other -- everything flows through the coordinator.

Hub-and-spoke gives you four properties that peer-to-peer architectures lack:

Context isolation: Each subagent receives only what it needs, not the full conversation
Observability: Single communication point -- every delegation and result is loggable
Error recovery: Coordinator retries, re-delegates, or escalates when a subagent fails
Controlled information flow: Coordinator decides what context to pass and what to withhold

How it works in code

from anthropic import Agent, Task

# ── Coordinator: has the Task tool to spawn subagents ──
coordinator = Agent(
    model="claude-sonnet-4-5",
    tools=[Task],                    # ← Required for spawning
    prompt="""You are a research coordinator. For each query:
    1. Analyze what research is needed
    2. Delegate to the right subagents
    3. Evaluate findings for gaps
    4. Re-delegate if coverage is insufficient
    5. Synthesize the final report"""
)

# ── Subagents: specialized, scoped tools ──
search_agent = Agent(
    name="search",
    tools=[web_search, extract_urls],    # Only search tools
    prompt="Search the web for the given topic. Return structured findings."
)

analysis_agent = Agent(
    name="analysis",
    tools=[read_document, extract_data], # Only analysis tools
    prompt="Analyze the given documents. Return key findings."
)

synthesis_agent = Agent(
    name="synthesis",
    tools=[format_report],               # Only formatting tools
    prompt="Combine findings into a coherent report with citations."
)

Subagents do not inherit coordinator context. The coordinator calls Task(agent="search", prompt="Find articles about...") to spawn each subagent. That subagent sees only the prompt it was given -- not the coordinator's conversation history, not other subagents' results.

Dynamic selection vs full pipeline

The coordinator selects which subagents a query actually needs -- not every query requires the full pipeline:

# ✗ ANTI-PATTERN: Always run the full pipeline
def handle_query(query):
    search_results = run_subagent("search", query)
    analysis = run_subagent("analysis", search_results)
    report = run_subagent("synthesis", analysis)
    return report
# What if the query only needs analysis? Wasted search call.

# ✓ CORRECT: Coordinator decides which subagents to invoke
coordinator_prompt = """Analyze the query and decide:
- Does this need web search, document analysis, or both?
- Invoke only the subagents that are relevant.
- Don't route through the full pipeline if only one subagent is needed."""

Iterative refinement: gap detection

Synthesis is not the final step. The coordinator evaluates the report for coverage gaps and re-delegates if needed:

# Coordinator evaluates synthesis output
if "gaps" in evaluation:
    # Re-delegate with specific gap-filling queries
    additional = run_subagent("search",
        f"Find information about: {gaps}")
    # Re-synthesize with new findings merged in
    final = run_subagent("synthesis",
        f"Previous report: {report}\nNew findings: {additional}")

The narrow decomposition trap

Coordinator decomposes "impact of AI on creative industries" into visual arts subtasks only (digital art, graphic design, photography). Report misses music, writing, and film. Every subagent executed perfectly -- the failure is in what they were assigned, not how they performed.

When a report misses entire topic areas, the root cause is the coordinator's decomposition -- not subagent failure. The exam tests this distinction repeatedly.

Check your understanding

Q1. A compliance monitoring system uses three subagents to audit "employee benefits programs." The coordinator decomposes it into: "health insurance plans," "retirement 401k options," and "dental coverage." The final report says the company's benefits are fully compliant. Legal later finds violations in stock option vesting schedules and parental leave policies. What is the root cause?

A) The compliance-checking subagent relied on outdated regulatory data that missed recent policy changes.

B) The synthesis agent failed to identify and flag the missing benefit categories in its final report.

C) The subagents didn't search deeply enough within their assigned compliance areas.

D) The coordinator's task decomposition was too narrow -- it covered only health-related benefits and missed equity compensation and leave policies.

Correct: D. The coordinator defined "benefits" as health-related only. Stock options and parental leave were never assigned to any subagent. The subagents can't audit topics they were never asked to investigate. Why not A: The regulatory data doesn't matter if the topics were never assigned. Why not B: The synthesis agent can only synthesize findings it received -- it can't flag topics that were never investigated. Why not C: The subagents executed their assignments correctly -- the assignments were incomplete.

Q2. A coordinator passes the synthesis subagent a one-paragraph summary instead of full search and analysis findings. The final report is vague and lacks specific data. What should the coordinator pass instead?

A) Complete structured findings from prior agents -- full search results, analysis outputs, and source URLs.

B) The coordinator's full conversation history so the synthesis agent has maximum context available.

C) Only the original user query so the synthesis agent can re-research the topic from scratch.

D) A list of topic headings for the synthesis agent to fill in using its own knowledge base.

Correct: A. The synthesis agent needs complete findings to produce a detailed report. Summaries lose the specifics. Why not B: Passing full conversation history wastes tokens and confuses the agent with irrelevant coordinator reasoning. Why not C: Re-researching defeats the purpose of the multi-agent pipeline. Why not D: The synthesis agent should synthesize findings, not generate content from its own knowledge.

Q3. A legal document review system has a coordinator, OCR agent, clause extraction agent, and risk scoring agent. Every document goes through all three agents. But 60% of incoming documents are already digital PDFs that don't need OCR. What should change?

A) Run OCR anyway -- it's harmless on digital PDFs and keeps the pipeline consistent.

B) Have the coordinator inspect the document format first and skip OCR for digital PDFs, invoking only the agents each document needs.

C) Split into two separate pipelines: one for scanned documents and one for digital PDFs.

D) Replace the OCR agent with a more capable model that handles both formats natively.

Correct: B. The coordinator should assess what each document needs and invoke only the relevant agents. Skipping OCR for digital PDFs saves 60% of unnecessary processing. Why not A: Running OCR on digital PDFs wastes compute and can introduce artifacts. Why not C: Two separate pipelines duplicate the extraction and scoring logic. Why not D: The issue is unnecessary invocation, not OCR capability.

Exam tips

"Report misses entire topic areas" --> coordinator decomposition too narrow, not subagent failure
Wrong answer trap: "Improve the search agent's queries" -- if the coordinator never assigned the topic, better search cannot find it
Wrong answer trap: "Share full coordinator context with subagents" -- wastes tokens and confuses subagents with irrelevant information
The core pattern: The coordinator's job: select, delegate, evaluate coverage, re-delegate gaps. If topics are missing, fix the decomposition.

LAB 1.3: SUBAGENT CONTEXT PASSING AND SPAWNING

What the exam tests

The Task tool as the mechanism for spawning subagents (allowedTools must include "Task")
Subagent context must be explicitly provided -- subagents don't inherit parent context
AgentDefinition configuration: descriptions, system prompts, tool restrictions
Fork-based session management for exploring divergent approaches
Passing complete findings (not summaries) between agents
Structured data formats to preserve attribution (source URLs, dates, page numbers)
Parallel spawning: multiple Task calls in a single coordinator response
Goal-based coordinator prompts vs step-by-step procedural instructions

Spawning subagents with the Task tool

The Task tool is the spawning mechanism. If "Task" is not in the coordinator's allowedTools, spawning fails -- no error handling can work around this.

from anthropic import Agent, AgentDefinition

# ── Define subagent types ──────────────────────
search_agent_def = AgentDefinition(
    name="search",
    description="Searches the web for information on a given topic",
    prompt="You are a web search specialist. Return structured "
           "findings with source URLs and publication dates.",
    tools=["web_search", "extract_urls"]     # ← Scoped tools
)

analysis_agent_def = AgentDefinition(
    name="analysis",
    description="Analyzes documents and extracts key findings",
    prompt="You are a document analyst. Return structured data "
           "with page numbers and direct quotes.",
    tools=["read_document", "extract_data"]  # ← Scoped tools
)

# ── Coordinator must have Task in allowedTools ──
coordinator = Agent(
    model="claude-sonnet-4-5",
    allowedTools=["Task"],                   # ← Required
    agents=[search_agent_def, analysis_agent_def]
)

Explicit context passing

Subagents start with an empty conversation. Everything they need must appear in their prompt -- there is no implicit inheritance:

# ✗ ANTI-PATTERN: Assuming subagent inherits context
Task(
    agent="synthesis",
    prompt="Now synthesize the findings."
    # synthesis agent has NO IDEA what the findings are
)

# ✓ CORRECT: Pass complete findings explicitly
Task(
    agent="synthesis",
    prompt=f"""Synthesize these findings into a report:

    SEARCH FINDINGS:
    {json.dumps(search_results, indent=2)}

    ANALYSIS FINDINGS:
    {json.dumps(analysis_results, indent=2)}

    Requirements: Cite every claim with its source URL."""
)

If the subagent needs data from a prior step, that data must appear in the subagent's prompt. No implicit context, no exceptions.

Structured data preserves attribution

Pass structured formats between agents, not flat strings. Structured data separates content from metadata so attribution survives:

# ✗ ANTI-PATTERN: Flat string loses attribution
context = "AI market is growing at 35% CAGR. Battery costs dropped 89%."
# Which source? What date? Can't trace.

# ✓ CORRECT: Structured format preserves source
context = [
    {
        "claim": "AI market is growing at 35% CAGR",
        "source": "https://example.com/ai-report",
        "date": "2025-03-15",
        "confidence": "high"
    },
    {
        "claim": "Battery costs dropped 89% since 2010",
        "source": "https://example.com/energy-report",
        "date": "2024-11-20",
        "confidence": "medium"
    }
]

Parallel subagent spawning

Parallel execution requires multiple Task calls in a single coordinator response -- not one per turn:

# ✗ ANTI-PATTERN: Sequential spawning (one per turn)
# Turn 1: Task(agent="search", prompt="...")
# Turn 2: Task(agent="analysis", prompt="...")
# Each waits for the previous to complete.

# ✓ CORRECT: Parallel spawning (multiple in one response)
# The coordinator emits BOTH Task calls in the same response:
# [Task(agent="search", prompt="..."),
#  Task(agent="analysis", prompt="...")]
# Both execute simultaneously.

Goal-based vs procedural coordinator prompts

# ✗ ANTI-PATTERN: Step-by-step procedural instructions
coordinator_prompt = """
Step 1: Call the search agent with query X
Step 2: Call the analysis agent with the search results
Step 3: Call the synthesis agent
Step 4: Return the report"""
# Rigid. Can't adapt if step 2 finds nothing.

# ✓ CORRECT: Research goals and quality criteria
coordinator_prompt = """
Research goal: Produce a comprehensive report on {topic}.
Quality criteria:
- Cover at least 3 distinct subtopics
- Include data from both web sources and documents
- Flag any conflicting findings with source attribution
- Re-investigate if initial coverage is insufficient"""
# Adaptive. Coordinator can adjust based on findings.

Check your understanding

Q1. A coordinator tries to spawn a subagent but gets an error. The coordinator's allowedTools list includes ["web_search", "read_doc"] but not "Task". What is the fix?

A) Add the subagent's tools to the coordinator's allowedTools so it has access to them.

B) Use --fork-session instead of the Task tool to create a parallel execution branch.

C) Define the subagent inline rather than as a separate AgentDefinition to avoid the error.

D) Add "Task" to the coordinator's allowedTools so it can spawn subagents.

Correct: D. The coordinator needs "Task" in its allowedTools to spawn subagents. Without it, no subagent creation is possible. Why not A: Adding subagent tools to the coordinator gives the coordinator those tools directly, it doesn't enable spawning. Why not B: --fork-session is for session branching, not subagent invocation. Why not C: The Task tool is still required regardless of how the agent is defined.

Q2. A market analysis subagent produces a report stating "several competitors have raised prices recently" with no specifics. The data collection agents returned structured findings with exact company names, price changes, and effective dates. What went wrong?

A) The coordinator summarized the data collection findings into a brief paragraph before passing them -- the specific numbers, company names, and dates were lost.

B) The market analysis agent needs a better system prompt with clearer output requirements.

C) The analysis agent hallucinated vague claims instead of using the structured data it received.

D) The data collection agents returned too much raw information for the analysis agent to process.

Correct: A. The coordinator compressed detailed structured data into a vague summary. The analysis agent can only work with what it receives -- if specifics were summarized away, the output will be vague. Why not B: Even a perfect prompt can't recover data that was never provided. Why not C: The agent isn't hallucinating -- it's accurately reflecting the vague summary it received. Why not D: Detailed data is exactly what the analysis agent needs.

Q3. A due diligence system runs four independent checks: financial audit, legal review, IP assessment, and regulatory compliance. Each check takes ~20 seconds. The coordinator spawns them one at a time, waiting for each to finish. Total time: 80 seconds. The checks don't depend on each other. How do you cut the latency?

A) Use a smaller, faster model for each check to reduce per-task processing time.

B) Merge all four checks into a single agent prompt so only one call is needed.

C) Have the coordinator emit all four Task tool calls in a single response so they execute simultaneously.

D) Cache results from previous due diligence runs to avoid redundant processing.

Correct: C. Independent tasks should run in parallel. Four 20-second tasks in parallel complete in ~20 seconds instead of 80. The coordinator emits all Task calls in one turn. Why not A: Faster models reduce per-task time but the sequential bottleneck remains. Why not B: Merging loses specialization and makes each check less focused. Why not D: Caching doesn't apply to new due diligence targets.

Exam tips

"Subagent has no context" --> pass findings explicitly in the subagent's prompt
Wrong answer trap: "Subagents inherit coordinator history" -- they never do
"Coordinator can't spawn subagents" --> "Task" missing from allowedTools
Parallel spawning = multiple Task calls in ONE response, not one per turn
The core pattern: Pass complete structured data, not summaries. Use goal-based prompts, not step-by-step procedures. Spawn in parallel when tasks are independent.

LAB 1.4: PREREQUISITE GATES AND HANDOFF PATTERNS

What the exam tests

Programmatic enforcement (gates, hooks) vs prompt-based guidance -- and why prompts have a non-zero failure rate
Prerequisite gates that block tool execution until a prior step completes
Structured handoff protocols: summary, root cause, partial resolution, recommended next action
Decomposing multi-concern requests into parallel investigation tracks
Why financial, legal, and compliance operations require deterministic enforcement

Prerequisite gates: programmatic enforcement

A prerequisite gate blocks a tool call until a prior step has completed. process_refund cannot execute until get_customer returns a verified customer ID -- enforced in code, not by prompt instruction.

verified_customer_id = None   # ← Gate variable

def get_customer(email):
    global verified_customer_id
    customer = database.lookup(email)
    if customer:
        verified_customer_id = customer["id"]  # ← Opens the gate
    return customer

def process_refund(order_id, amount):
    # ── GATE: Must verify customer first ──────
    if verified_customer_id is None:
        return {
            "isError": True,
            "errorCategory": "prerequisite",
            "message": "Customer must be verified before "
                       "processing refunds. Call get_customer first."
        }
    # Gate passed -- process the refund
    return execute_refund(verified_customer_id, order_id, amount)

Prompt instructions have a non-zero failure rate -- and for financial operations, any failure rate is unacceptable:

# ✗ ANTI-PATTERN: Prompt-based enforcement
system_prompt = """IMPORTANT: Always verify the customer
with get_customer before calling process_refund."""
# Works ~95-97% of the time. In production, 3-5% of
# refunds process without verification. Unacceptable
# for financial operations.

# ✓ CORRECT: Programmatic gate
# The code BLOCKS the refund call if customer isn't verified.
# 100% enforcement. Zero exceptions.

If the consequence is financial, legal, or compliance-related: use a programmatic gate. If it's a style preference: use a prompt. The exam tests this distinction in every domain.

Structured handoff for human escalation

The human reviewer does not have access to the agent's conversation. The handoff must contain everything they need to act without starting from scratch:

def escalate_to_human(customer_id, summary, root_cause,
                      recommended_action):
    """Structured handoff for human agents."""
    return {
        "escalation": {
            "customer_id": customer_id,
            "summary": summary,
            # What the agent found
            "root_cause": root_cause,
            # What the agent recommends
            "recommended_action": recommended_action,
            "conversation_context": {
                "tools_called": ["get_customer", "lookup_order"],
                "issues_found": ["order delayed", "wrong item"],
                "partial_resolution": "Refund initiated for wrong item"
            }
        }
    }

"APP-002: needs review" forces the reviewer to reconstruct the entire case. A structured handoff with summary, root cause, attempted resolution, and recommended next action puts them 80% through the problem on arrival.

Multi-concern decomposition

Customer says: "Wrong item, other order is late, and I need to update my address." Three distinct issues requiring three investigation tracks with shared customer context. The agent decomposes, investigates each, and synthesizes a unified response -- the same coordinator pattern from Lab 1.2, applied within a single conversation.

Check your understanding

Q1. Production data shows 12% of cases skip get_customer and call lookup_order using only the customer's name, leading to misidentified accounts. What change addresses this?

A) Add a programmatic prerequisite that blocks lookup_order and process_refund until get_customer has returned a verified ID.

B) Enhance the system prompt to explicitly state that customer verification is mandatory before any order operations.

C) Add few-shot examples showing the agent always calling get_customer first in every scenario.

D) Implement a routing classifier that dynamically enables tool subsets based on the request type.

Correct: A. When specific tool ordering is required for critical business logic, programmatic enforcement provides deterministic guarantees. Prompt-based approaches (B, C) are probabilistic and insufficient when errors have financial consequences. Why not D: Tool availability doesn't enforce tool ordering -- the agent could still skip verification.

Q2. An agent escalates a case to a human reviewer. The handoff contains only the customer ID. The reviewer spends 20 minutes reconstructing the case. What should the handoff include?

A) The full conversation transcript so the reviewer has complete context from the interaction.

B) A link to the customer's account page so the reviewer can look up the relevant details.

C) The model's confidence score for the escalation decision along with the customer ID.

D) Customer ID, root cause analysis, partial resolution status, and recommended next action.

Correct: D. A structured handoff provides everything the reviewer needs: what was found, what was tried, and what to do next. Why not A: Full transcripts are too verbose -- reviewers need focused summaries. Why not B: An account link forces the reviewer to start from scratch. Why not C: Model confidence isn't calibrated and doesn't help the reviewer resolve the case.

Q3. A medical records system has a prompt instruction: "Always confirm patient identity with verify_patient before accessing records via get_records." Audit logs reveal that in 7% of requests, the agent accesses records without verification. What is the most effective fix?

A) Add "THIS IS A HIPAA REQUIREMENT" to the prompt instruction to increase compliance urgency.

B) Implement a programmatic prerequisite gate that blocks get_records until verify_patient returns a confirmed identity.

C) Include few-shot examples demonstrating the correct verify-then-access sequence in every case.

D) Reduce the temperature to make the model more deterministic and consistent in following instructions.

Correct: B. Compliance requirements like HIPAA demand deterministic enforcement -- 100% of the time. Prompt instructions (A, B) reduce violations but can't eliminate them. A gate makes it architecturally impossible to access records without verification. Why not A: Emphasizing importance in the prompt improves but can't guarantee compliance. Why not C: Few-shot examples are probabilistic. Why not D: Temperature affects word variation, not instruction adherence.

Exam tips

"Agent skips a required step" --> programmatic gate, not better prompting
"Reviewer starts from scratch" --> structured handoff with summary, root cause, and recommendation
Wrong answer trap: "Make the prompt stronger" -- stronger prompts reduce violations but cannot eliminate them
Wrong answer trap: "Add more few-shot examples" -- examples are probabilistic, gates are deterministic
The core pattern: Guaranteed compliance = programmatic gate. Soft guidance = prompt. No exceptions for financial, legal, or compliance operations.

LAB 1.5: AGENT SDK HOOKS FOR TOOL INTERCEPTION

What the exam tests

PostToolUse hooks: intercept and transform tool results before Claude processes them
PreToolUse hooks: intercept and block outgoing tool calls that violate policy
The deterministic/probabilistic distinction: hooks guarantee compliance, prompts do not
Normalizing heterogeneous data formats (timestamps, status codes, currency) via PostToolUse
Blocking policy violations and redirecting to escalation via PreToolUse
Decision rule for hooks vs prompts based on business impact

Two types of hooks

PreToolUse fires before execution. It can block the call, modify parameters, or redirect to a different workflow entirely.

PostToolUse fires after execution but before Claude sees the result. It transforms, normalizes, or redacts the output.

User message → Claude decides to call tool
    → PreToolUse hook (can BLOCK or MODIFY)
        → Tool executes
            → PostToolUse hook (can TRANSFORM result)
                → Claude sees the (possibly modified) result

PostToolUse: Data normalization

Backend tools return data in inconsistent formats -- Unix timestamps from one, ISO 8601 from another, cents instead of dollars from a third. A PostToolUse hook normalizes before Claude ever sees the result:

def normalize_data_hook(tool_name, tool_input, tool_output):
    """PostToolUse: Normalize data formats before
    Claude sees the result."""
    result = json.loads(tool_output)

    # Normalize timestamps to ISO 8601
    if "timestamp" in result:
        if isinstance(result["timestamp"], (int, float)):
            # Unix timestamp → ISO 8601
            from datetime import datetime, timezone
            dt = datetime.fromtimestamp(
                result["timestamp"], tz=timezone.utc)
            result["timestamp"] = dt.isoformat()
            #  1713200000 → "2025-04-15T18:13:20+00:00"

    # Normalize status codes to human-readable
    status_map = {200: "active", 404: "not_found",
                  503: "unavailable"}
    if "status" in result and isinstance(result["status"], int):
        result["status"] = status_map.get(
            result["status"], f"unknown_{result['status']}")
        #  200 → "active"

    # Normalize currency (cents → dollars)
    if "amount_cents" in result:
        result["amount"] = f"${result['amount_cents'] / 100:.2f}"
        del result["amount_cents"]
        #  14999 → "$149.99"

    return json.dumps(result)

Without this hook, Claude interprets raw formats on every response -- and inconsistently. A hook runs the same normalization code every time.

PreToolUse: Policy enforcement

A PreToolUse hook blocks policy-violating actions before they ever execute:

def refund_limit_hook(tool_name, tool_input, tool_output):
    """PreToolUse: Block refunds above $500."""
    if tool_name == "process_refund":
        amount = tool_input.get("amount", 0)
        if amount > 500:
            return {
                "blocked": True,
                "reason": f"Refund ${amount} exceeds $500 limit",
                "redirect": "escalate_to_human",
                "context": {
                    "customer_id": tool_input.get("customer_id"),
                    "requested_amount": amount
                }
            }
    return None  # Allow all other calls

Hooks vs prompts: the decision rule

Scenario	Use hooks	Use prompts
Refund limit ($500 max)	Deterministic	Model may ignore
Compliance blocking	Zero exceptions	Non-zero failure rate
Audit logging	Every call logged	May skip
Style preferences	Overkill	Soft guidance fits
Escalation judgment	Too rigid	Model reasons well
Output formatting	Not critical	Flexible guidance

Guaranteed compliance (financial, legal, security) = hooks. Preferences (style, tone, formatting) = prompts. There is no middle ground for critical business rules.

Check your understanding

Q1. A PostToolUse hook normalizes timestamps from multiple MCP tools. One tool returns Unix timestamps, another returns ISO 8601, and a third returns "March 15, 2025" strings. A developer suggests adding normalization instructions to the system prompt instead. Is this a good alternative?

A) No -- Claude may misinterpret formats or normalize inconsistently across iterations. A hook guarantees deterministic results.

B) Yes -- prompt-based normalization is simpler and equally reliable for standard timestamp formats. C) Yes -- but only if you include few-shot examples demonstrating each format conversion.

D) No -- but only because prompts lack the ability to process and convert timestamp formats.

Correct: A. Hooks run as code -- they normalize deterministically. Prompt-based normalization relies on Claude interpreting formats correctly every time, which is unreliable with heterogeneous inputs. Why not B: Prompt-based normalization is inconsistent across format variations. Why not C: Few-shot examples help but can't guarantee correct parsing of every format variant. Why not D: Prompts can process timestamps; the issue is reliability, not capability.

Q2. A PreToolUse hook blocks a process_refund call because the amount exceeds $500. The hook returns {blocked: true, redirect: "escalate_to_human"}. What does Claude see?

A) Nothing -- the tool call silently fails and Claude continues without acknowledging the refund.

B) An error message that crashes the agent loop and requires a manual restart to recover.

C) The blocking response -- Claude sees the refund was blocked and the redirect to human escalation, then adapts accordingly.

D) The refund processes anyway because hooks are advisory and don't prevent tool execution.

Correct: C. PreToolUse hooks return their response to Claude as the tool result. Claude sees the block reason and redirect, then tells the customer the refund is being escalated to a human agent. Why not A: Silent failures would leave Claude confused about what happened. Why not B: Hook responses are structured data, not crash-inducing errors. Why not D: Hooks are deterministic, not advisory -- the tool does not execute.

Q3. An agent occasionally grants discount codes above 30%, violating pricing policy. The system prompt says "Never issue discounts above 30%." A developer proposes adding few-shot examples showing the agent declining a 50% discount request. Is this sufficient?

A) Yes -- few-shot examples are the most reliable enforcement mechanism for pricing rules.

B) No -- remove the discount tool entirely so no discount above 30% is possible.

C) Yes -- if combined with a stronger prompt like "CRITICAL: NEVER exceed 30%" to reinforce the rule.

D) No -- add a PreToolUse hook that programmatically blocks any apply_discount call where percentage > 30.

Correct: D. Pricing rules with direct revenue impact require deterministic enforcement. A PreToolUse hook guarantees every discount call is validated before execution. Why not A/C: Prompt-based approaches reduce frequency but leave a non-zero violation rate. Why not B: Removing the tool eliminates the agent's ability to apply any discounts, including valid ones.

Exam tips

"Hook or prompt?" --> financial/legal/security consequence = hook. Style preference = prompt.
PostToolUse = transform results after execution (normalization, redaction, enrichment)
PreToolUse = intercept calls before execution (blocking, validation, enforcement)
Wrong answer trap: "Add normalization instructions to the prompt" -- hooks are deterministic, prompts are probabilistic
The core pattern: Hooks run as code. They cannot be ignored, skipped, or misinterpreted. That's the entire point.

LAB 1.6: TASK DECOMPOSITION STRATEGIES

What the exam tests

When to use fixed sequential pipelines (prompt chaining) vs dynamic adaptive decomposition
Prompt chaining for predictable, multi-aspect reviews (e.g., per-file code review)
Dynamic decomposition for open-ended investigation (e.g., "why is this slow?")
Splitting large reviews into per-file local passes + cross-file integration pass
Adaptive investigation plans that generate subtasks from discoveries

Two decomposition patterns

Prompt chaining: fixed sequential pipeline. Steps are known before execution starts.

Dynamic decomposition: adaptive pipeline. Steps emerge from intermediate results.

	Prompt chaining	Dynamic decomposition
Steps known in advance?	Yes	No
Adapts to findings?	No	Yes
Best for	Predictable reviews, multi-step extraction	Investigation, debugging, open-ended research
Example	Per-file code review → cross-file integration	"Why is the API slow?" → discover it's a DB issue

Do you know the steps before you start? Yes → chaining. No → dynamic decomposition.

Prompt chaining: Per-file + cross-file review

Large code reviews need two passes to avoid attention dilution -- the model spreading focus too thin across many files:

# Pass 1: Per-file local analysis
# Each file analyzed independently for local issues
file_findings = []
for file in changed_files:
    finding = analyze_single_file(file)
    file_findings.append(finding)
    # Finds: unused imports, null checks, naming issues

# Pass 2: Cross-file integration analysis
# Analyze how files interact (data flow, API contracts)
integration = analyze_cross_file(file_findings)
    # Finds: broken interfaces, inconsistent types across files

Single-pass review of 12 files gives each file ~8% of attention. Per-file analysis gives each file 100% attention. The integration pass then focuses exclusively on cross-file relationships -- broken interfaces, inconsistent types, mismatched contracts.

Dynamic decomposition: Adaptive investigation

Open-ended tasks require plans that evolve from discoveries:

# Phase 1: Map the landscape
structure = agent.run("Map the codebase structure. "
    "Identify modules, dependencies, and test coverage.")

# Phase 2: Identify high-impact areas (based on Phase 1)
priorities = agent.run(f"Given this structure: {structure}\n"
    "Identify the highest-impact areas for improvement.")

# Phase 3: Create adaptive plan (based on Phase 2)
plan = agent.run(f"Given these priorities: {priorities}\n"
    "Create a prioritized implementation plan. "
    "Adapt if you discover new dependencies.")

If Phase 2 discovers an undocumented helper function that multiple modules depend on, Phase 3 prioritizes testing that function first. A fixed pipeline would never adjust -- dynamic decomposition makes the plan a living document.

Check your understanding

Q1. You need to add comprehensive tests to a legacy codebase you've never seen. What decomposition pattern should you use?

A) Prompt chaining -- define the test plan upfront and execute each step in a fixed sequence.

B) Prompt chaining with a fallback to dynamic decomposition if the first pass fails to cover enough code.

C) No decomposition -- have the agent write all tests in a single comprehensive pass across the codebase.

D) Dynamic decomposition -- first map the codebase structure, then identify high-impact areas, then adapt the plan as you go.

Correct: D. An unfamiliar codebase requires exploration before planning. Dynamic decomposition adapts as you discover the structure, dependencies, and critical paths. Why not A: You can't define a test plan for a codebase you've never seen. Why not B: The task is clearly open-ended -- there's no reason to start with chaining. Why not C: A single pass on a large codebase causes attention dilution.

Q2. During a dynamic investigation of a slow API, the agent discovers the bottleneck is in the database layer, not the application code it was originally investigating. What should happen?

A) Report the finding and stop -- the agent was assigned to investigate application code only.

B) The investigation plan adapts -- the agent re-prioritizes to investigate the database layer where the bottleneck is.

C) Start a new investigation from scratch targeting the database layer with fresh context.

D) Flag the database issue and continue investigating the application code as originally assigned.

Correct: B. Dynamic decomposition means the plan adapts based on discoveries. Finding the real bottleneck in the database should shift the investigation there. Why not A: Stopping at the boundary misses the actual problem. Why not C: Starting fresh wastes the context already gathered. Why not D: Continuing to investigate application code ignores the actual cause.

Q3. A migration script needs to update configuration files across 15 services. When processed in a single prompt, the agent correctly updates the first 4 and last 3 services but produces incorrect configurations for services 5-12. What decomposition strategy fixes this?

A) Process each service's configuration independently in its own prompt, then run a cross-service consistency check.

B) Use a more powerful model with a larger context window to handle all 15 services at once.

C) Retry the same prompt -- the errors in the middle services may be random and non-repeatable.

D) Sort the services by complexity and process only the most complex ones that need attention.

Correct: A. Independent per-service processing gives each configuration full attention. The cross-service pass catches inconsistencies between services. Why not B: A larger context window doesn't fix the attention distribution problem. Why not C: The errors aren't random -- they're caused by attention dilution in the middle of the input. Why not D: Skipping services guarantees missed migrations.

Exam tips

"Open-ended investigation" --> dynamic decomposition, not prompt chaining
"Large code review misses files in the middle" --> split into per-file + cross-file passes
Wrong answer trap: "Run the review multiple times" -- repetition does not fix attention dilution
Wrong answer trap: "Increase context window" -- more context does not improve attention distribution
The core pattern: Steps known → chain. Steps unknown → decompose. Large reviews → per-file local pass + cross-file integration pass.

LAB 1.7: SESSION STATE, RESUMPTION, AND FORKING

What the exam tests

--resume for continuing a named session with full prior context
--fork-session for creating independent exploration branches from a shared baseline
Stale context risk: resumed sessions contain tool results from before file changes
Fresh start with injected summary as the safer alternative when tool results are stale
Decision rule for resume vs fork vs fresh start

Session operations

Three session operations, each for a different situation:

# Start a named session
claude --session-name "auth-redesign"

# Resume a named session (preserves full context)
claude --resume "auth-redesign"

# Fork from a session (independent branch)
claude --resume "auth-redesign" --fork-session

Resume continues with the same conversation and context. Use when prior analysis is still valid and files have not changed.

Fork creates an independent branch from the current state. Changes in the fork do not affect the original session. Use for comparing divergent approaches from a shared baseline.

The stale context problem

Monday: you analyze a codebase. Tuesday: a teammate refactors three files. You resume Monday's session. Claude references line numbers and function signatures from Monday that no longer exist.

The resumed session contains tool results from before the refactor. Claude treats those results as current. Every reference to "the handler at line 45" is wrong -- the handler moved to line 72.

Three approaches to stale context

Approach	When to use	Risk
Resume + inform about changes	Short break, few files changed	Moderate -- Claude may not fully re-analyze
Fresh start + injected summary	Significant changes, stale tool results	Low -- clean context, explicit summary
Resume as-is	Nothing changed since last session	None

Informed resume: small changes

# Resume and immediately inform about changes
claude --resume "auth-redesign"
# "Since our last session, auth.py and middleware.py were
# refactored. Re-read those files before continuing."

Fresh start with injected summary: significant changes

When tool results are stale across many files, start a new session with a structured summary of prior findings:

claude --session-name "auth-redesign-v2"
# "Previous analysis found: 1) Auth middleware uses deprecated
# JWT library (auth.py:45-80), 2) Session tokens stored in
# plaintext (middleware.py:23), 3) No rate limiting on login
# endpoint. Files may have changed -- re-verify before acting."

The summary captures important findings without stale line numbers or outdated tool outputs. Claude re-verifies current state before acting on any of them.

Forking: compare approaches from a shared baseline

# Base session: analyze the codebase
claude --session-name "perf-investigation"
# "I've analyzed the codebase. The main bottleneck is in
# the data processing pipeline."

# Fork 1: explore caching approach
claude --resume "perf-investigation" --fork-session
# "Explore adding a Redis cache layer to the pipeline."

# Fork 2: explore batching approach
claude --resume "perf-investigation" --fork-session
# "Explore converting to batch processing instead."

# Both forks start from the same analysis baseline
# but explore different solutions independently.
# Neither fork sees the other's changes.

Check your understanding

Q1. You resume a codebase analysis session after your teammate refactored several files overnight. Claude references functions and line numbers that no longer exist. What happened?

A) The session data corrupted during the overnight idle period and lost some stored context.

B) Claude's context contains stale tool results from before the refactor -- it's referencing old contents that changed.

C) The model's context window overflowed, causing it to lose track of earlier code references.

D) The teammate's changes created git conflicts with Claude's pending suggestions from the session.

Correct: B. Resumed sessions contain tool results from prior reads. If files changed, those results are stale. Claude references old line numbers because it hasn't re-read the files. Why not A: Sessions don't corrupt -- the data is just outdated. Why not C: Context overflow produces different symptoms (degraded responses, not incorrect references). Why not D: Git conflicts are a separate concern from stale session context.

Q2. After a 30-minute debugging session, you've narrowed a memory leak to two possible causes: a connection pool issue or an event listener leak. You want to investigate both independently without one investigation polluting the other's context. Which session operation?

A) Start two fresh sessions and repeat the 30 minutes of debugging work in each one.

B) Investigate one cause, undo your changes, then investigate the other cause separately.

C) Fork the session to create two branches from the current debugging state, then investigate each in its own fork.

D) Ask Claude to investigate both causes simultaneously in the same session context.

Correct: C. --fork-session branches from the current state. Both forks inherit the 30 minutes of debugging context but explore different causes independently. Why not A: Re-doing the debugging wastes 30 minutes per branch. Why not B: Undoing is error-prone and loses findings from the first investigation. Why not D: Investigating both in one session risks context interference between the two hypotheses.

Q3. A developer resumes a 3-hour analysis session. Claude starts giving generic answers like "this module typically uses handler patterns" instead of referencing specific classes it found earlier. What is the most likely cause and fix?

A) The model is rate-limited -- wait and retry after the rate limit window resets.

B) The codebase is too large for Claude to analyze effectively in a single session.

C) The model was updated between sessions, causing it to lose prior analysis context.

D) Context degradation -- the session accumulated so much content that Claude loses specificity on earlier findings.

Correct: D. Long sessions accumulate verbose tool outputs that consume context. Claude loses specificity as important findings get pushed out by newer content. Use /compact or start fresh with a structured summary. Why not A: Rate limiting produces errors, not generic responses. Why not B: Large codebases are handled through decomposition, not avoided. Why not C: Model updates don't affect existing sessions.

Exam tips

"Stale references after resuming" --> inform about changes or start fresh with injected summary
"Compare two approaches" --> --fork-session, not two separate sessions
"Generic answers after long session" --> context degradation, use /compact or fresh start with summary
Wrong answer trap: "Resume the session" -- resuming with stale tool results produces incorrect references
The core pattern: Nothing changed → resume. Files changed → fresh start + summary. Compare approaches → fork.

MODULE 2: TOOL DESIGN & MCP INTEGRATION

Domain 2 — 18% of exam Task Statements 2.1 – 2.5

Key Terms for Module 2

Tool description: The primary mechanism for tool selection. Claude reads descriptions to decide which tool to call -- not tool names, not system prompt instructions.
isError: Boolean flag in tool responses. true = tool failed (access failure, timeout, policy violation). false = tool succeeded (even if results are empty).
errorCategory: Classifies the failure: transient (retry), validation (fix input), business (policy violation, escalate), permission (access denied, escalate).
isRetryable: Whether the agent should retry. true for transient errors. false for business and permission errors -- retrying cannot help.
tool_choice: API parameter constraining tool selection. "auto" = Claude decides freely. "any" = Claude must call a tool. {"type": "tool", "name": "X"} = forced specific tool.
MCP (Model Context Protocol): Standard protocol for extending Claude with external tools, data sources, and content catalogs.
.mcp.json: Project-level MCP configuration. Committed to git. Uses ${ENV_VAR} expansion for secrets.
~/.claude.json: User-level MCP configuration. Personal, not in version control.
MCP resources: Read-only content catalogs (API specs, schemas, issue lists) that give agents visibility without exploratory tool calls.
Grep: Built-in tool for searching file contents by text pattern or regex.
Glob: Built-in tool for finding files by name pattern with wildcards.
Edit: Built-in tool for targeted file modifications. Requires unique text match; fails on ambiguous matches.

LAB 2.1: TOOL DESCRIPTION QUALITY

What the exam tests

Tool descriptions as the primary selection mechanism -- not tool names, not system prompt keywords
Vague or overlapping descriptions as the root cause of misrouting
The five-part description template: purpose, input format, output, when to use, when NOT to use
System prompt keywords creating unintended tool associations
Splitting generic multi-purpose tools into focused single-purpose tools
Auditing system prompts for word overlap with tool descriptions

Tool descriptions drive tool selection

Claude selects tools by reading descriptions, not names. Two tools with similar descriptions produce misrouting regardless of how different their names are. A vague description ("Retrieves data") forces Claude to guess scope, input format, and boundaries.

The five-part tool description template:

Purpose: What the tool does in one sentence
Input format: Expected parameter types and formats
Output: What the tool returns
When to use: Specific scenarios where this tool is the right choice
When NOT to use: Boundaries -- what this tool should not be used for

How it works in code

# ✗ ANTI-PATTERN: Vague, overlapping descriptions
tools = [
    {
        "name": "analyze_content",
        "description": "Analyzes content and returns results.",
        # Too vague. What content? What results?
    },
    {
        "name": "analyze_document",
        "description": "Analyzes documents and extracts data.",
        # Nearly identical to above. Claude can't distinguish.
    }
]

# ✓ CORRECT: Specific, differentiated descriptions
tools = [
    {
        "name": "extract_web_results",
        "description": (
            "Extracts structured data from web search results. "
            "Input: URL string from a web search. "
            "Returns: title, snippet, publication date, source domain. "
            "Use for: processing web search output only. "
            "Do NOT use for: PDFs, local files, or database records."
        ),
    },
    {
        "name": "extract_document_data",
        "description": (
            "Extracts structured fields from uploaded documents "
            "(PDF, DOCX, TXT). "
            "Input: document content as text string. "
            "Returns: extracted fields matching the document schema. "
            "Use for: processing uploaded or local documents. "
            "Do NOT use for: web search results or URLs."
        ),
    }
]

Split generic tools into focused ones

A tool that does three things has a vague description by definition. Split into focused tools:

# ✗ ANTI-PATTERN: One generic tool
{"name": "analyze_document",
 "description": "Analyzes a document in various ways."}

# ✓ CORRECT: Three specific tools
{"name": "extract_data_points",
 "description": "Extracts numerical data (dates, amounts, IDs) "
                "from a document. Returns structured JSON."}

{"name": "summarize_content",
 "description": "Produces a 2-3 sentence summary of a document. "
                "Use when the user needs an overview, not details."}

{"name": "verify_claim_against_source",
 "description": "Checks whether a specific claim is supported by "
                "a source document. Returns: supported/unsupported "
                "with relevant excerpt."}

System prompt keyword impact

System prompt keywords bias tool selection. "Always analyze documents carefully" pushes Claude toward tools with "analyze" or "document" in their descriptions -- even when another tool fits better. Audit system prompts for word overlap with tool names and descriptions.

Check your understanding

Q1. An inventory management agent has two tools: check_warehouse_stock and check_store_inventory. Both descriptions say "Returns current inventory levels." The agent calls the wrong one 35% of the time. What is the most effective first fix?

A) Add few-shot examples mapping common queries to the correct tool. B) Rewrite each tool's description to include specific boundaries: check_warehouse_stock handles bulk/distribution queries with SKU inputs, while check_store_inventory handles retail location queries with store ID inputs. C) Build a keyword classifier that pre-routes "warehouse" queries to one tool and "store" queries to the other. D) Merge both tools into a single check_inventory tool.

Correct: B. Claude selects tools based on descriptions. When two descriptions are near-identical, the model can't distinguish them. Adding input formats, use cases, and explicit boundaries resolves the ambiguity at the source. Why not A: Few-shot examples add tokens without fixing the root cause -- ambiguous descriptions. Why not C: A keyword classifier is over-engineered when the real problem is that descriptions don't differentiate the tools. Why not D: Merging may work but is a larger architectural change when the fix is just clearer descriptions.

Q2. After improving tool descriptions, routing accuracy improves from 70% to 90%. But the agent still misroutes when users say "check this document" -- it calls extract_web_results instead of extract_document_data. The system prompt says "When checking content, always analyze thoroughly." What is the likely cause?

A) The model needs more training data for document-related queries. B) The keyword "content" in the system prompt creates an unintended association with extract_web_results, which also mentions "content" in its description. C) The model's temperature is too high. D) The tool names are too similar.

Correct: B. System prompt keywords can bias tool selection. "Content" appears in both the system prompt and extract_web_results, creating an unintended association. Fix: change the system prompt wording or rename the tool. Why not A: This is a configuration problem, not a training problem. Why not C: Temperature affects word choice, not tool selection logic. Why not D: The names are different; the descriptions (and system prompt keywords) are the issue.

Q3. A generic analyze_document tool handles three tasks: data extraction, summarization, and claim verification. Users report inconsistent behavior -- sometimes they get a summary when they wanted data extraction. What should you do?

A) Add more detailed instructions to the single tool's description. B) Split the generic tool into three purpose-specific tools: extract_data_points, summarize_content, and verify_claim_against_source, each with focused descriptions. C) Add a task_type parameter to the tool so the user specifies what they want. D) Use tool_choice to force the correct behavior.

Correct: B. Splitting into purpose-specific tools gives each a focused description that Claude can match precisely. Why not A: A longer description on a multi-purpose tool still requires Claude to interpret intent -- splitting removes the ambiguity. Why not C: Adding parameters shifts the burden to the user and doesn't help Claude's selection logic. Why not D: tool_choice forces a specific tool but requires your code to know which tool is needed -- that defeats model-driven selection.

Exam tips

"Agent picks the wrong tool" --> improve tool descriptions, not routing classifiers or few-shot examples
Wrong answer trap: "Add a routing classifier" -- over-engineered when descriptions are the root cause
Wrong answer trap: "Consolidate into one tool" -- makes ambiguity worse, not better
System prompt keywords can bias tool selection -- audit for word overlap with tool descriptions
The core pattern: More detail in descriptions = better selection. Five parts: purpose, input format, output, when to use, when NOT to use.

LAB 2.2: STRUCTURED ERROR RESPONSES

What the exam tests

isError flag as the primary failure signal -- true for failures, false for successes (including empty results)
Four error categories: transient (retry), validation (fix input), business (escalate), permission (escalate)
isRetryable field determining whether the agent should retry or escalate
Why generic "Operation failed" strings prevent intelligent recovery
Customer-friendly messages for business rule violations
The critical distinction: access failure (isError: true) vs empty result (isError: false)

The isError flag and error categories

A tool that returns "Operation failed" gives the agent nothing to work with. Structured errors tell the agent what happened, why, and what to do next.

isError: true means the tool failed to execute. isError: false means the tool succeeded, even if results are empty. Conflating the two is the most common error design mistake.

Four error categories:

Category	Example	isRetryable	Action
Transient	Database timeout, rate limit	Yes	Retry after delay
Validation	Invalid email format	No	Fix input and retry
Business	Refund exceeds $500 limit	No	Escalate to human
Permission	User lacks access to resource	No	Inform user, escalate

How it works in code

def make_tool_response(result=None, error=None):
    """Return structured MCP tool response."""
    if error:
        return {
            "isError": True,                    # ← Tool FAILED
            "content": [{
                "type": "text",
                "text": json.dumps({
                    "errorCategory": error["category"],
                    "isRetryable": error["retryable"],
                    "message": error["message"],
                    "customer_friendly": error.get("friendly")
                })
            }]
        }
    return {
        "isError": False,                       # ← Tool SUCCEEDED
        "content": [{"type": "text", "text": json.dumps(result)}]
    }

# Transient error -- agent should retry
make_tool_response(error={
    "category": "transient",
    "retryable": True,
    "message": "Database timeout after 5s"
})

# Business error -- agent should escalate
make_tool_response(error={
    "category": "business",
    "retryable": False,
    "message": "Refund $750 exceeds $500 agent limit",
    "friendly": "I'll connect you with a supervisor "
                "who can process this refund."
})

The critical distinction: access failure vs empty result

The exam tests this concept heavily. Get it wrong and the agent lies to users:

# ACCESS FAILURE: Search could not execute
# isError: True -- the tool FAILED
{
    "isError": True,
    "content": [{"type": "text", "text": json.dumps({
        "errorCategory": "transient",
        "isRetryable": True,
        "message": "Search service unavailable"
    })}]
}

# EMPTY RESULT: Search executed, found nothing
# isError: False -- the tool SUCCEEDED
{
    "isError": False,
    "content": [{"type": "text", "text": json.dumps({
        "results": [],
        "message": "No matching records found"
    })}]
}

Never return isError: false with empty results for an access failure. Database down + {"results": []} = Claude tells the user "no records found." The truth: it never searched at all.

Check your understanding

Q1. A shipping tracker tool returns {"isError": false, "shipments": []} when the carrier's API times out. The agent tells the customer "You have no shipments in transit." The customer actually has three active shipments. What is the problem?

A) The agent should display a "no results" message differently. B) The agent should always retry shipping queries before responding. C) The tool is signaling success (isError: false) for a failed API call -- it should return isError: true with errorCategory: "transient" so the agent knows the lookup never executed. D) The agent's system prompt needs instructions to handle empty shipping results.

Correct: C. The tool treats a timeout the same as a legitimate empty result. The agent trusts the isError: false signal and concludes the customer has no shipments. The fix: return isError: true when the upstream call fails. Why not A: The display isn't the problem -- the agent genuinely believes there are no shipments. Why not B: The agent doesn't know it should retry because the response indicates success. Why not D: Prompt-level handling can't override a tool that says the query succeeded.

Q2. A subagent encounters a rate limit error (HTTP 429) from an external API. It has retried 3 times without success. What should it do?

A) Keep retrying until the rate limit resets. B) Return a generic "search unavailable" message to the coordinator. C) Return structured error context to the coordinator including: failure type (transient/rate_limit), what was attempted, partial results if any, and that retries were exhausted -- so the coordinator can decide on an alternative approach. D) Silently return empty results so the coordinator can continue.

Correct: C. The subagent should propagate structured error context after exhausting local retries. The coordinator needs to know: what failed, what was tried, and what partial results exist -- so it can decide whether to try an alternative source or escalate. Why not A: Infinite retries waste time and block the pipeline. Why not B: Generic messages hide valuable context the coordinator needs for recovery. Why not D: Silent suppression is the worst anti-pattern -- it leads to incorrect conclusions based on missing data.

Q3. An agent receives this error: {"errorCategory": "business", "isRetryable": false, "message": "Refund exceeds $500 limit", "customer_friendly": "I'll connect you with a supervisor."}. What should the agent do?

A) Retry the refund with a lower amount. B) Use the customer_friendly message to inform the customer and escalate to a human agent. C) Ignore the error and try a different tool. D) Override the business rule by splitting the refund into two smaller transactions.

Correct: B. Business errors with isRetryable: false are policy violations that require escalation, not workarounds. The customer_friendly message tells the agent exactly how to communicate with the customer. Why not A: Retrying with a lower amount changes the customer's request without permission. Why not C: Ignoring structured errors loses important context. Why not D: Splitting to circumvent a business rule is a compliance violation.

Exam tips

"Tool returns empty for failed search" --> isError: true for access failures, isError: false for genuine empty results
"Generic error string" --> structured metadata: errorCategory, isRetryable, message
Wrong answer trap: "Retry everything" -- only transient errors are retryable. Business and permission errors require escalation.
Wrong answer trap: "Silently return empty results" -- the worst anti-pattern; it makes the agent lie to users
The core pattern: Structured errors enable recovery. Generic strings prevent it. Silent suppression causes harm.

LAB 2.3: TOOL CHOICE AND SCOPED ACCESS

What the exam tests

4-5 tools per agent is optimal; 18+ tools degrades selection quality significantly
Scoped tool access: each agent gets only tools relevant to its role
tool_choice modes: "auto" (optional), "any" (must call something), forced (must call specific tool)
Replacing generic tools with constrained alternatives that validate inputs
Forced tool selection to guarantee a step runs first in a pipeline
"any" to guarantee structured output instead of text responses

Scoped tool access

More tools per agent = worse selection. An agent with 18 tools calls delete_account when a user asks about billing. An agent with 4 focused tools never sees delete_account at all.

Keep each agent to 4-5 tools relevant to its role.

# ✗ ANTI-PATTERN: One agent with 18 tools
support_agent = Agent(
    tools=[
        get_customer, lookup_order, process_refund,
        update_address, send_email, create_ticket,
        search_kb, escalate, check_inventory,
        modify_subscription, cancel_account,        # ← Dangerous!
        delete_account,                              # ← Dangerous!
        generate_report, update_billing,
        schedule_callback, merge_tickets,
        apply_coupon, check_fraud
    ]
    # 18 tools. Agent calls delete_account when user
    # asks about billing. Selection degrades.
)

# ✓ CORRECT: Scoped agents with 4-5 tools each
customer_agent = Agent(
    tools=[get_customer, update_address, check_fraud,
           send_email]                               # 4 tools
)
order_agent = Agent(
    tools=[lookup_order, check_inventory,
           process_refund, apply_coupon]              # 4 tools
)
escalation_agent = Agent(
    tools=[escalate, create_ticket,
           schedule_callback, merge_tickets]          # 4 tools
)

tool_choice modes

# "auto" -- model decides whether to call a tool at all
# Risk: model may return text instead of calling a tool
response = client.messages.create(
    tool_choice={"type": "auto"},  # ← May skip tools entirely
    ...
)

# "any" -- model MUST call a tool, but chooses which one
# Use when you need guaranteed structured output
response = client.messages.create(
    tool_choice={"type": "any"},   # ← Guaranteed tool call
    ...
)

# Forced -- model MUST call this specific tool
# Use to ensure a step runs first (e.g., extract metadata
# before enrichment)
response = client.messages.create(
    tool_choice={"type": "tool",
                 "name": "extract_metadata"},  # ← This tool only
    ...
)

Constrained tool replacement

Generic tools accept any input. Constrained tools validate before execution:

# ✗ ANTI-PATTERN: Generic fetch accepts any URL
def fetch_url(url):
    return requests.get(url).text    # Accepts anything!

# ✓ CORRECT: Constrained tool validates inputs
def load_document(url):
    allowed = ["docs.example.com", "wiki.example.com"]
    if not any(url.startswith(f"https://{d}") for d in allowed):
        return {"isError": True,
                "message": f"URL must be from: {allowed}"}
    return requests.get(url).text

Check your understanding

Q1. A resume parsing pipeline uses tool_choice: "auto". For about 1 in 8 resumes, Claude responds with a text analysis instead of calling the parse_resume tool, and the downstream database insert fails on the unexpected format. What is the fix?

A) Add a system prompt instruction: "Always call parse_resume." B) Switch to tool_choice: "any" to guarantee Claude calls a tool instead of returning text. C) Add a try/catch that parses the text response when the tool call is missing. D) Filter out resumes that cause text responses.

Correct: B. tool_choice: "any" guarantees a tool call. Under "auto", Claude can choose to respond with text, which breaks pipelines expecting structured output. Why not A: Prompt instructions are probabilistic -- the 12% failure rate proves they can't guarantee tool calls. Why not C: Parsing text as a fallback adds fragile logic instead of fixing the root cause. Why not D: Filtering discards valid resumes because of a configuration problem.

Q2. An HR onboarding agent has access to 20 tools, including terminate_employee and revoke_system_access. When processing a new hire's benefits enrollment, the agent occasionally calls terminate_employee due to a keyword match on "termination date" in the benefits form. What should change?

A) Add a system prompt instruction: "Never call terminate_employee during onboarding." B) Restrict the onboarding agent's tool set to only HR onboarding tools -- terminate_employee and revoke_system_access should not be in its scope. C) Rename terminate_employee to make it less likely to be triggered by benefits keywords. D) Add a confirmation dialog before every tool call.

Correct: B. Scoping tools to the agent's role removes dangerous tools from consideration entirely. An agent that can't see terminate_employee can't call it, regardless of keyword overlap. Why not A: Prompt instructions have a non-zero failure rate -- unacceptable for destructive operations. Why not C: Renaming helps with keyword confusion but the tool still shouldn't be in the onboarding agent's scope. Why not D: Confirmation on every call adds friction without addressing why a dangerous tool is accessible.

Q3. You need to ensure extract_metadata runs before any enrichment tools in a pipeline. How should you configure the first API call?

A) Add prompt instructions: "Always call extract_metadata first." B) Use tool_choice: {"type": "tool", "name": "extract_metadata"} to force the specific tool, then switch to "auto" for subsequent turns. C) List extract_metadata first in the tools array. D) Use tool_choice: "any" and hope Claude picks the right one.

Correct: B. Forced tool selection guarantees extract_metadata runs first. Subsequent turns use "auto" so Claude can choose enrichment tools freely. Why not A: Prompt instructions are probabilistic. Why not C: Tool array ordering doesn't guarantee selection order. Why not D: "any" guarantees a tool call but doesn't control which tool.

Exam tips

"Agent uses wrong tool / dangerous tool" --> scope tool access per role, 4-5 tools max
"Pipeline crashes because Claude returns text" --> switch from "auto" to "any" or forced selection
Wrong answer trap: "Add prompt instructions to avoid the tool" -- scoping is deterministic, prompts are not
tool_choice modes: auto = optional, any = must call something, forced = must call THIS tool
The core pattern: Fewer tools = better selection. Scoped access = no misuse. Forced selection = guaranteed ordering.

LAB 2.4: MCP SERVER CONFIGURATION

What the exam tests

Two configuration scopes: .mcp.json (project, in git) vs ~/.claude.json (user, personal)
${ENV_VAR} expansion for credentials -- never hardcode secrets in config files
MCP tools discovered automatically at connection time from all configured servers
MCP resources as read-only content catalogs that eliminate exploratory tool calls
Enhancing MCP tool descriptions when Claude prefers built-in tools over MCP equivalents
Community MCP servers over custom implementations for standard integrations

Project vs user scope

Two configuration scopes with different sharing models:

Scope	File	Shared via git?	Use for
Project	`.mcp.json`	Yes	Team-shared tools (GitHub, Jira, database)
User	`~/.claude.json`	No	Personal/experimental tools

How it works in practice

// .mcp.json -- PROJECT scope (committed to git)
{
  "mcpServers": {
    "github": {
      "command": "npx",
      "args": ["@modelcontextprotocol/server-github"],
      "env": {
        "GITHUB_TOKEN": "${GITHUB_TOKEN}"     // ← Env var, NOT hardcoded
      }
    },
    "postgres": {
      "command": "npx",
      "args": ["@modelcontextprotocol/server-postgres"],
      "env": {
        "DATABASE_URL": "${DATABASE_URL}"     // ← Each dev sets their own
      }
    }
  }
}

// ~/.claude.json -- USER scope (personal, not in git)
{
  "mcpServers": {
    "my-experiment": {
      "command": "node",
      "args": ["./my-local-tool/server.js"]   // ← Personal tool
    }
  }
}

Never hardcode secrets in .mcp.json. Use ${ENV_VAR} expansion. The config file names the environment variable, not the secret itself. Hardcoded tokens in version control are a critical security failure that persists in git history even after removal.

MCP resources vs tools

Tools are actions (search, create, update). Resources are read-only content catalogs. Resources eliminate the exploratory tool calls an agent would otherwise need to discover what data exists.

# Without resources: Claude makes discovery calls
# "What endpoints exist?" → tool call
# "What tables are in the DB?" → tool call
# "What issues are open?" → tool call
# Each costs tokens and time.

# With resources: Claude reads the catalog directly
# resource://api-docs/endpoints → list of all endpoints
# resource://db-schema/tables → all table definitions
# resource://jira/open-issues → current issue summaries
# Zero tool calls needed for discovery.

Community vs custom servers

Use existing community MCP servers for standard integrations (GitHub, Jira, Postgres, Slack). Build custom servers only when no community server covers your team-specific workflow.

Check your understanding

Q1. A new team member clones the repo but Claude Code doesn't connect to the GitHub MCP server. The .mcp.json is configured correctly with ${GITHUB_TOKEN}. What is the most likely cause?

A) The MCP server needs to be installed globally. B) The team member hasn't set the GITHUB_TOKEN environment variable on their machine. C) The .mcp.json file needs to be in the user's home directory. D) Claude Code doesn't support environment variable expansion.

Correct: B. Environment variable expansion reads from the developer's local environment. If GITHUB_TOKEN isn't set, the server can't authenticate. Why not A: MCP servers are run via npx which downloads on demand. Why not C: .mcp.json is project-scoped and belongs in the repo root. Why not D: Claude Code supports ${VAR} expansion in .mcp.json.

Q2. A developer commits .mcp.json with "GITHUB_TOKEN": "ghp_abc123xyz" hardcoded instead of using ${GITHUB_TOKEN}. What is the risk?

A) The token will expire faster. B) The token is now in version control history -- anyone with repo access can see it, and rotating it requires rewriting git history. C) The server will fail to connect. D) Other team members won't be able to use the server.

Correct: B. Hardcoded secrets in version control are a critical security failure. Even if removed in a later commit, the secret persists in git history. Why not A: Token expiration isn't related to where it's stored. Why not C: The server will connect fine -- the security problem is that the token is exposed. Why not D: Other members could use it (which is actually part of the problem).

Q3. Claude Code has both a built-in Grep tool and an MCP-provided code_search tool with advanced features (semantic search, cross-repo). But Claude keeps using Grep instead. What should you do?

A) Remove the built-in Grep tool. B) Enhance the MCP tool's description to clearly explain its advanced capabilities and when to prefer it over Grep. C) Add a system prompt instruction: "Always use code_search instead of Grep." D) Rename the MCP tool to grep_advanced.

Correct: B. Claude selects tools based on descriptions. If the MCP tool's description doesn't explain why it's better than Grep for certain queries, Claude defaults to the familiar built-in. Why not A: You can't remove built-in tools. Why not C: System prompt instructions are probabilistic. Why not D: Renaming to include "grep" may increase confusion about which tool is which.

Exam tips

"MCP server not connecting for new team member" --> environment variable not set on their machine
"Hardcoded token in config" --> always use ${ENV_VAR} expansion, never literal secrets
"Claude prefers built-in tools over MCP tools" --> enhance MCP tool descriptions to explain advantages
Wrong answer trap: "Put credentials in .mcp.json" -- always wrong, regardless of context
The core pattern: .mcp.json for the team (with ${ENV_VAR}). ~/.claude.json for personal experiments. Resources for discovery without tool calls.

LAB 2.5: BUILT-IN TOOLS

What the exam tests

Grep searches file contents; Glob searches file names -- confusing them is a common exam trap
Read for viewing, Write for creating new files, Edit for targeted modifications to existing files
Edit fails on non-unique text matches; Read + Write is the fallback
Incremental codebase understanding: Grep → Read → trace dependencies
Preferring built-in tools over Bash equivalents (Read over cat, Grep over grep)

Grep vs Glob: the critical distinction

Grep searches inside files. Glob searches file names. Every other detail follows from this.

Tool	Searches	Input	Example
Grep	File contents	Text pattern or regex	"Find all files that call `processRefund`"
Glob	File names	Path pattern with wildcards	"Find all files named `*.test.tsx`"

# ✗ WRONG: Using Glob to search for function calls
Glob("processRefund")
# Glob searches file NAMES, not contents.
# This finds files literally named "processRefund", not
# files that contain a call to processRefund().

# ✓ CORRECT: Use Grep for content search
Grep("processRefund")
# Finds every file containing "processRefund" in its text.

# ✓ CORRECT: Use Glob for file patterns
Glob("**/*.test.tsx")
# Finds all files ending in .test.tsx, regardless of content.

Read, Write, and Edit

Tool	Purpose	When to use
Read	View file contents	Understanding existing code
Write	Create new files or full replacement	New files only
Edit	Targeted modifications	Changing specific lines in existing files

Edit requires a unique text match. If the target text appears multiple times, Edit fails with a non-unique-match error. The fallback ladder has three rungs -- try them in order:

Add surrounding context to the old string so the match becomes unique (usually enough).
Use replace_all: true if every occurrence should change identically.
Read + Write if the occurrences differ and context can't disambiguate them.

# ✓ Edit works: unique text match
Edit(file="auth.py",
     old="def validate(token):",
     new="def validate(token: str) -> bool:")

# ✗ Edit fails: text appears multiple times
Edit(file="config.py",
     old="return None",         # Appears 5 times in file!
     new="return default_value")
# Error: non-unique match

# ✓ Fallback 1: Add context to make the match unique
Edit(file="config.py",
     old="def get_timeout():\n    return None",  # ← now unique
     new="def get_timeout():\n    return default_value")

# ✓ Fallback 2: replace_all when every occurrence should change
Edit(file="config.py",
     old="return None",
     new="return default_value",
     replace_all=True)          # ← atomic multi-replace

# ✓ Fallback 3: Read + Write when occurrences must change differently
content = Read("config.py")
modified = content.replace(
    "# Line 45\n    return None",
    "# Line 45\n    return default_value"
)
Write("config.py", modified)

Incremental codebase understanding

Reading every file upfront wastes context. Build understanding by following the dependency chain:

# ✗ ANTI-PATTERN: Read everything first
for file in all_files:
    Read(file)                  # 500 files = context overflow

# ✓ CORRECT: Grep → Read → trace
# Step 1: Find entry points
Grep("processRefund")           # → found in 3 files

# Step 2: Read the primary implementation
Read("services/refund.py")      # → imports from billing.py

# Step 3: Follow the dependency
Read("services/billing.py")     # → calls payment gateway

# Step 4: Trace the full flow
Grep("PaymentGateway")          # → used in 2 more files

Check your understanding

Q1. You need to find all test files in the codebase. Which tool do you use?

A) Grep("test") B) Glob("**/.test.tsx") C) Read("tests/") D) Bash("find . -name '.test.tsx'")

Correct: B. Glob matches file name patterns. **/*.test.tsx finds all files ending in .test.tsx regardless of directory depth. Why not A: Grep searches file contents -- it would find files containing the word "test" in their code, not test files by name. Why not C: Read reads file contents, not directory listings. Why not D: Bash works but Glob is the dedicated tool -- prefer purpose-built tools over shell commands.

Q2. You try to Edit a configuration file to change a retry_count: 3 setting, but Edit returns "non-unique match" because retry_count: 3 appears in four different config blocks. What is the correct fallback?

A) Use Edit with a larger surrounding context string to make the match unique. B) Use Read to load the full file, locate the correct occurrence, make the change, then Write the entire modified file. C) Use Bash with sed to replace all occurrences. D) Delete the config file and rewrite it from scratch.

Correct: A. The first approach should be to include more surrounding context in the Edit call to make the match unique -- for example, including the section header or adjacent settings. Read + Write is the fallback if that still fails. Why not B: Read + Write works but is heavier than needed when adding context to Edit would resolve it. Why not C: Replacing all occurrences changes settings you didn't intend to modify. Why not D: Deleting and recreating loses file history and is unnecessarily destructive.

Q3. You need to understand how processRefund works in an unfamiliar codebase. What is the best approach?

A) Read every file in the services/ directory to build a complete picture. B) Start with Grep("processRefund") to find all references, then Read the primary implementation, then follow imports to trace the full flow. C) Search the documentation for processRefund. D) Use Glob to find the refund-related files by name.

Correct: B. Incremental understanding: Grep finds entry points, Read reveals implementation, then you trace imports. This builds understanding without loading unnecessary files. Why not A: Reading every file wastes context on irrelevant code. Why not C: Documentation may be outdated -- the code is the source of truth. Why not D: Glob finds files by name pattern, which may miss files that contain processRefund but aren't named obviously.

Exam tips

"Find files containing X" --> Grep (content search)
"Find files named X" --> Glob (path pattern matching)
"Edit failed, non-unique match" --> escalation ladder: add context → replace_all: true → Read + Write
Wrong answer trap: "Read all files first" -- incremental understanding avoids context overflow
Wrong answer trap: "Use Bash for file operations" -- built-in tools (Read, Write, Edit, Grep, Glob) are always preferred
The core pattern: Grep to find references. Read to understand implementation. Edit to change (Read + Write if Edit fails on non-unique match).

MODULE 3: CLAUDE CODE CONFIGURATION & WORKFLOWS

Domain 3 — 20% of exam Task Statements 3.1 – 3.6

Key Terms for Module 3

CLAUDE.md: Configuration files loaded automatically when Claude Code starts. Provide project context, coding standards, and behavioral rules.
User-level config: ~/.claude/CLAUDE.md -- personal preferences. Not in version control. New team members do not receive it.
Project-level config: .claude/CLAUDE.md -- team standards. Committed to git. Every developer who clones the repo gets it.
Directory-level config: CLAUDE.md in any subdirectory -- scoped rules for that directory and its children. Most specific level wins.
@import: Includes external files in CLAUDE.md. Keeps configuration modular instead of monolithic.
.claude/rules/: Topic-specific rule files that load automatically. No @import needed. Supports YAML paths frontmatter for conditional loading.
Slash command: Custom command in .claude/commands/ (project) or ~/.claude/commands/ (personal). Invoked with /command-name. Runs in current session context.
Skill: Complex reusable behavior in .claude/skills/ with SKILL.md frontmatter: context: fork, allowed-tools, argument-hint.
context: fork: Runs the skill in an isolated sub-agent context. Verbose output stays in the fork; only the final result returns to the main conversation.
Plan mode: Claude explores and designs before changing files. For complex, multi-file, architecturally significant tasks.
Direct execution: Claude makes changes immediately. For single-file, obvious fixes.
Explore subagent: Isolates verbose codebase discovery. Returns concise summaries to the main context instead of raw file contents.
-p flag: Non-interactive mode. Required for CI/CD pipelines. Without it, Claude Code hangs waiting for input.

LAB 3.1: CLAUDE.MD HIERARCHY AND MODULAR ORGANIZATION

What the exam tests

The three-level CLAUDE.md hierarchy: user, project, directory
User-level settings are personal and not shared via version control
@import syntax for modular configuration
.claude/rules/ directory for topic-specific rule files
Diagnosing hierarchy issues (new team member missing conventions)
Using /memory to verify which files are loaded

The CLAUDE.md hierarchy

Three configuration levels. More specific levels add to and can override broader ones:

Level	Location	In git?	Use for
User	`~/.claude/CLAUDE.md`	No	Personal preferences (editor style, verbosity)
Project	`.claude/CLAUDE.md`	Yes	Team standards (language, testing, conventions)
Directory	`src/api/CLAUDE.md`	Yes	Scoped rules (API validation, model conventions)

How it works in practice

~/.claude/
  CLAUDE.md                        # USER: personal only
    "Prefer concise responses"
    "Use dark theme formatting"

project/
  .claude/
    CLAUDE.md                      # PROJECT: team standards
      "Use TypeScript strict mode"
      "All functions need JSDoc"
      @import ./rules/testing.md   # ← Modular include
    rules/
      testing.md                   # Auto-loaded rule file
        "Use vitest for all tests"
        "Mock external APIs, not DB"
      api-conventions.md           # Auto-loaded rule file
        "REST endpoints return JSON"
        "Include pagination headers"

  src/
    api/
      CLAUDE.md                    # DIRECTORY: scoped rules
        "Validate auth tokens on every endpoint"
        "Use Zod schemas for request validation"

New team member missing conventions? The rules are in user-level config (~/.claude/CLAUDE.md) instead of project-level (.claude/CLAUDE.md). User-level files are not in version control -- only the developer who created them sees them.

@import and .claude/rules/

Two mechanisms for modular configuration instead of a monolithic file:

@import -- explicit includes from CLAUDE.md:

# .claude/CLAUDE.md
Use TypeScript strict mode.
@import ./rules/testing.md
@import ./rules/api-conventions.md

.claude/rules/ -- files load automatically when present:

.claude/rules/
  testing.md          # Loaded automatically
  api-conventions.md  # Loaded automatically
  deployment.md       # Loaded automatically

Run /memory to verify which files are loaded. Diagnose inconsistent behavior by checking whether expected rule files appear in the loaded list.

Check your understanding

Q1. A contractor starts working on your Python project and asks why Claude Code uses tabs instead of the team's 4-space standard. All full-time developers see the correct behavior. The project's .claude/CLAUDE.md contains only deployment instructions. Where is the indentation rule most likely configured?

A) In a directory-level CLAUDE.md inside src/. B) In ~/.claude/CLAUDE.md on each full-time developer's machine -- user-level config that isn't shared with the contractor via git. C) In .claude/rules/python.md. D) In the project's .editorconfig file.

Correct: B. User-level config is personal and not committed to the repository. The contractor doesn't have it. The fix: move team-wide coding conventions to project-level config so everyone gets them automatically. Why not A: Directory-level CLAUDE.md would be in the repo and the contractor would have it. Why not C: .claude/rules/ files are in the repo too. Why not D: .editorconfig configures the editor, not Claude Code's behavior.

Q2. Your team's CLAUDE.md has grown to 600 lines covering React component patterns, GraphQL query conventions, accessibility standards, and CI/CD procedures. Three developers edited the same file in the same sprint, causing merge conflicts. What is the best way to reorganize?

A) Add a table of contents to the file so developers can find sections faster. B) Split into focused files in .claude/rules/: react-patterns.md, graphql.md, accessibility.md, ci-cd.md -- each independently editable without merge conflicts. C) Move infrequently changed rules to user-level config. D) Create one CLAUDE.md per team member with their specific sections.

Correct: B. Topic-specific files in .claude/rules/ are auto-loaded and independently maintainable. Different developers can update different rules files without merge conflicts. Why not A: A table of contents helps navigation but doesn't solve the merge conflict problem. Why not C: User-level config isn't shared with the team. Why not D: Per-developer files fragment team conventions and create inconsistency.

Q3. You run /memory and notice a rule file isn't loading. The file is at .claude/rules/api-design.md. What should you check?

A) Whether the file has the correct YAML frontmatter. B) Whether the file is referenced with @import in CLAUDE.md. C) Whether the file exists and Claude Code has been restarted -- rules files in .claude/rules/ load automatically, but changes require a restart. D) Whether the file has the .md extension.

Correct: C. Files in .claude/rules/ auto-load without @import, but Claude Code needs to be restarted to pick up new files. Why not A: Rules files don't require YAML frontmatter (that's for path-scoped rules). Why not B: .claude/rules/ files auto-load -- @import is for files outside this directory. Why not D: The file already has .md.

Exam tips

"New team member missing conventions" --> rules are in user-level config, not project-level
"800-line CLAUDE.md" --> split into .claude/rules/ topic files
Wrong answer trap: "Add more detail to the monolithic file" -- splitting is always better than growing
/memory verifies which files are loaded -- use it to diagnose missing or unexpected behavior
The core pattern: Personal preferences → user-level. Team standards → project-level. Scoped rules → directory-level or path-specific. Monolithic files → split into .claude/rules/.

LAB 3.2: CUSTOM SLASH COMMANDS AND SKILLS

What the exam tests

Project-scoped commands (.claude/commands/) vs user-scoped (~/.claude/commands/)
Skills with SKILL.md frontmatter: context: fork, allowed-tools, argument-hint
context: fork for isolated execution preventing context pollution
Personal skill customization with different names
When to use skills vs CLAUDE.md

Commands vs skills

	Commands	Skills
Location	`.claude/commands/` (project) or `~/.claude/commands/` (personal)	`.claude/skills/` (project) or `~/.claude/skills/` (personal)
Invocation	`/command-name`	On-demand or auto-matched
Isolation	Runs in current session context	Can fork to isolated context
Tool restriction	No	Yes (allowed-tools frontmatter)
Use for	Simple, quick operations	Complex behaviors needing isolation or restricted tools

Commands: simple slash actions

# .claude/commands/review.md
# Invoked with: /review

Review the current PR for:
1. Functions exceeding 50 lines
2. Missing error handling on async operations
3. Hardcoded credentials
Report findings with file path, line number, and severity.

Skills: complex isolated behaviors

# .claude/skills/refactor/SKILL.md
---
context: fork           # ← Isolated from main conversation
allowed-tools:          # ← Only these tools available
  - Read
  - Edit
  - Grep
argument-hint: "file or directory to refactor"
---

Analyze the given code for SOLID violations.
Plan the refactoring approach before making changes.
Apply changes incrementally using Edit.
Never delete existing tests.

context: fork prevents verbose exploration output from polluting the main conversation. The skill runs in a separate context and only the final result comes back.

allowed-tools restricts which tools the skill can use. A refactoring skill limited to Read, Edit, and Grep can't accidentally delete files or run destructive shell commands.

Skills vs CLAUDE.md

Use skills for	Use CLAUDE.md for
On-demand task-specific workflows	Always-loaded universal standards
Operations needing isolation (context: fork)	Rules that apply to every interaction
Behaviors requiring tool restrictions	Project-wide coding conventions

Check your understanding

Q1. You want a /review command available to every developer who clones the repo. Where should you create it?

A) ~/.claude/commands/review.md B) .claude/commands/review.md in the project repository. C) In the root CLAUDE.md file. D) .claude/skills/review/SKILL.md

Correct: B. Project-scoped commands in .claude/commands/ are version-controlled and available to all developers who clone the repo. Why not A: ~/.claude/commands/ is personal -- not shared via git. Why not C: CLAUDE.md is for standards, not command definitions. Why not D: A skill is for complex behaviors needing isolation -- a review command is simpler.

Q2. A codebase exploration skill produces 500 lines of verbose discovery output that fills the main conversation context. What frontmatter should you add?

A) allowed-tools: [Read, Grep] B) context: fork -- runs the skill in an isolated sub-agent context so verbose output doesn't pollute the main conversation. C) argument-hint: "directory to explore" D) max-tokens: 100

Correct: B. context: fork isolates the skill's execution. The 500 lines of discovery stay in the fork; only the final summary returns to the main conversation. Why not A: Tool restriction is useful but doesn't solve the context pollution problem. Why not C: argument-hint prompts for parameters but doesn't isolate output. Why not D: max-tokens is not a valid SKILL.md frontmatter field.

Q3. A skill needs to refactor code but must not run shell commands or delete files. How do you restrict it?

A) Add prompt instructions: "Never use Bash or Write." B) Configure allowed-tools in the SKILL.md frontmatter to list only the safe tools (Read, Edit, Grep). C) Remove dangerous tools from the project configuration. D) Use context: fork to prevent destructive actions.

Correct: B. allowed-tools in SKILL.md frontmatter restricts which tools are available during skill execution. If Bash and Write aren't listed, the skill can't use them. Why not A: Prompt instructions are probabilistic. Why not C: Removing tools project-wide affects all operations, not just this skill. Why not D: Fork isolates context but doesn't restrict tools.

Exam tips

"Command available to whole team" --> .claude/commands/ (project-scoped, in git)
"Verbose output pollutes conversation" --> context: fork in skill frontmatter
"Restrict tool access during skill" --> allowed-tools frontmatter
Wrong answer trap: "Add instructions to the prompt" -- tool restriction is deterministic, prompts are not
The core pattern: Simple slash action → command. Complex behavior needing isolation or tool restriction → skill with context: fork and allowed-tools.

LAB 3.3: PATH-SPECIFIC RULES

What the exam tests

.claude/rules/ files with YAML frontmatter paths fields for conditional activation
Path-scoped rules load only when editing matching files (reduces irrelevant context)
Glob-pattern rules vs directory-level CLAUDE.md for cross-directory conventions

Path-scoped rules

Rules in .claude/rules/ support YAML frontmatter with paths fields. The rule activates only when editing files that match the glob pattern -- reducing irrelevant context:

# .claude/rules/test-conventions.md
---
paths:
  - "**/*.test.tsx"
  - "**/*.spec.ts"
  - "**/__tests__/**"
---

Use vitest for all test files.
Mock external APIs but use real database connections.
Each test file must have at least one describe block.

Loads for any test file across the entire codebase. Does not load when editing non-test files. Less irrelevant context, fewer wasted tokens.

Glob patterns vs directory-level CLAUDE.md

Approach	Best for	Limitation
Path-specific rules (glob patterns)	Conventions that span multiple directories (test files, config files)	Requires YAML frontmatter
Directory-level CLAUDE.md	Rules scoped to one directory and its children	Can't target files spread across directories

# Path-specific: targets ALL Terraform files, anywhere
# .claude/rules/terraform.md
---
paths:
  - "terraform/**/*"
  - "**/*.tf"
---
Use terraform fmt conventions.
Never hardcode AWS credentials.

# Directory-level: only targets files in infra/
# infra/CLAUDE.md
Use terraform fmt conventions.
# Misses .tf files in other directories!

Test files spread across the codebase? Path-specific rules with **/*.test.tsx. Files confined to one directory? Directory-level CLAUDE.md. The exam tests this distinction directly.

Check your understanding

Q1. Test files are spread throughout the codebase (each next to the code it tests). You want all test files to follow the same conventions. What's the most maintainable approach?

A) Place a CLAUDE.md in every directory that contains test files. B) Create a rule file in .claude/rules/ with paths: ["**/*.test.tsx", "**/*.spec.ts"] so conventions apply to all test files regardless of location. C) Put all test conventions in the root CLAUDE.md. D) Move all test files into a single tests/ directory.

Correct: B. Path-specific rules with glob patterns apply to matching files regardless of directory. One rule file covers all test files across the entire codebase. Why not A: Creating CLAUDE.md in every directory is unmaintainable. Why not C: Root CLAUDE.md loads for all files, wasting context on non-test interactions. Why not D: Restructuring the codebase to fit the tooling is backwards.

Q2. A path-specific rule uses paths: ["*.test.tsx"] but only matches test files in the project root, not in subdirectories. What's wrong?

A) The file extension is incorrect. B) The pattern needs **/*.test.tsx -- the ** prefix is required for recursive matching across subdirectories. C) Path-specific rules don't support glob patterns. D) The YAML frontmatter is malformed.

Correct: B. Without **, the pattern only matches files in the current directory. **/*.test.tsx matches test files at any depth. Why not A: .test.tsx is a valid extension. Why not C: Path-specific rules are built on glob patterns. Why not D: The frontmatter format is correct; only the pattern is too narrow.

Q3. A developer adds conventions for API endpoints to the root CLAUDE.md. These rules load on every interaction -- even when editing frontend components. What's the impact?

A) No impact -- extra context is harmless. B) The API rules waste tokens and may confuse Claude when working on unrelated files. Move them to a path-specific rule with paths: ["src/api/**"] so they load only when editing API files. C) The rules will override other conventions. D) The developer should split into directory-level CLAUDE.md files.

Correct: B. Irrelevant context wastes tokens and can confuse Claude into applying API conventions to frontend code. Path-specific rules with paths: ["src/api/**"] load only when relevant. Why not A: Extra context is not harmless -- it increases token usage and can cause irrelevant suggestions. Why not C: CLAUDE.md doesn't have override mechanics by default. Why not D: Path-specific rules are more flexible than directory-level when the pattern spans a specific path.

Exam tips

"Test files spread across codebase" --> path-specific rules with **/*.test.tsx, not directory CLAUDE.md
"Rules loading for irrelevant files" --> add paths frontmatter to scope activation
** is required for recursive subdirectory matching -- without it, only root-level files match
The core pattern: Cross-directory conventions → path-specific rules with glob patterns. Single-directory rules → directory-level CLAUDE.md.

LAB 3.4: PLAN MODE VS DIRECT EXECUTION

What the exam tests

Plan mode for complex, multi-file tasks with architectural decisions
Direct execution for simple, well-scoped, single-file changes
Explore subagent for isolating verbose discovery output from the main context
The decision rule: multiple files + architectural choices = plan. Single file + obvious fix = direct.

When to use each mode

Multiple files + architectural decision = plan mode. Single file + obvious fix = direct execution.

Signal	Mode
Multiple files affected (5+)	Plan
Architectural decision required	Plan
Multiple valid approaches exist	Plan
Single file, clear fix	Direct
Clear stack trace, obvious bug	Direct
Adding one validation check	Direct

Plan mode: think first, then act

# Enter plan mode for a complex task
> /plan

# Claude explores the codebase without making changes
# Reads files, traces dependencies, identifies approaches
# Outputs a plan: "Here's what I'd do and why"

# After reviewing the plan, switch to direct execution
> /execute
# Claude implements the approved plan

Plan mode prevents costly rework. A microservice restructuring that starts with direct execution discovers halfway through that two modules have circular dependencies. Plan mode finds this before any code changes.

Explore subagent: isolate verbose discovery

Plan mode investigation can fill the main context with raw file contents. The Explore subagent absorbs the verbose work and returns only the summary:

Main context:
  "Analyze the authentication module"
  → Spawns Explore subagent
  
Explore subagent (isolated):
  Reads 23 files
  Traces 15 import chains
  Generates 2000 tokens of discovery notes
  
Returns to main context:
  "Summary: Auth module has 3 entry points,
   depends on jwt and session libraries,
   no test coverage for token refresh."
  (50 tokens instead of 2000)

Check your understanding

Q1. You're tasked with restructuring a monolithic application into microservices. This affects 45+ files across multiple packages. Which approach?

A) Start with direct execution and refactor incrementally. B) Enter plan mode to explore the codebase, understand dependencies, and design the service boundaries before making any changes. C) Use direct execution with a detailed upfront specification. D) Start direct execution and switch to plan mode only if problems emerge.

Correct: B. Plan mode is designed for complex tasks with architectural decisions and multi-file impact. Explore before committing to changes. Why not A: Direct execution risks costly rework when you discover unexpected dependencies. Why not C: A detailed spec assumes you know the codebase structure before exploring it. Why not D: The complexity is stated upfront -- there's no need to wait for problems to use plan mode.

Q2. During a plan mode investigation, Claude has read 30 files and the main context is filling with verbose discovery output. What should you do?

A) Use /compact to compress the conversation. B) Use the Explore subagent to isolate verbose discovery and return only summaries to the main context. C) Start a new session. D) Continue -- the context will manage itself.

Correct: B. The Explore subagent isolates verbose output. Discovery happens in the subagent; only a concise summary returns to the main context. Why not A: /compact helps but doesn't prevent future verbose output. Why not C: Starting fresh loses the investigation progress. Why not D: Context overflow causes degraded responses.

Q3. A bug fix has a clear stack trace pointing to a single null check in one function. Which mode?

A) Plan mode -- always plan before fixing. B) Direct execution -- the scope is clear, the fix is obvious, and only one file is affected. C) Plan mode to investigate if there are similar bugs elsewhere. D) Use the Explore subagent to analyze the function first.

Correct: B. Simple, well-scoped fixes with clear stack traces are ideal for direct execution. Plan mode adds unnecessary overhead here. Why not A: Plan mode for a single null check is overkill. Why not C: The question asks about this specific bug, not a broader investigation. Why not D: A single function with a clear stack trace doesn't need exploration.

Exam tips

"Microservice restructuring / library migration / 45+ files" --> plan mode
"Single-file bug fix / clear stack trace" --> direct execution
"Context filling with verbose discovery" --> Explore subagent
Wrong answer trap: "Always use plan mode" -- simple fixes gain nothing from planning overhead
The core pattern: Plan for complex. Execute for simple. Explore for verbose. Never plan a single null-check fix.

What the exam tests

Concrete input/output examples as the most effective technique for consistent formatting
Test-driven iteration: define behavior with tests, iterate by sharing specific failure messages
The interview pattern: Claude asks clarifying questions before implementing in an unfamiliar domain
Interacting issues reported together vs independent issues fixed sequentially

Three refinement techniques

1. Concrete examples beat prose

Prose descriptions produce inconsistent results. Concrete examples eliminate ambiguity:

# ✗ ANTI-PATTERN: Vague prose description
"Convert the date fields to a standard format."
# Which format? ISO 8601? US? European? Claude guesses differently each time.

# ✓ CORRECT: Concrete examples
"Convert date fields as shown:
 Input: 'March 15, 2025' → Output: '2025-03-15'
 Input: '3/15/25' → Output: '2025-03-15'
 Input: '15-Mar-2025' → Output: '2025-03-15'"
# Claude sees the pattern: always ISO 8601, handles all input variants.

2. Test-driven iteration

Write tests first, then iterate by sharing failures:

# Step 1: Write the test (defines the goal)
"Write a test for getUserById that:
  - Returns user object with id, name, email
  - Throws NotFoundError if user doesn't exist
  - Validates that id is a positive integer"

# Step 2: Implement to pass the test
"Implement getUserById to pass all tests."

# Step 3: Share failures, iterate
"Test 3 fails: getUserById(0) should throw but returns null.
 Fix the validation to reject zero and negative IDs."

# Step 4: Add edge cases
"Add tests for: null input, string input, very large IDs.
 Make all tests pass."

Test failures are specific, actionable feedback -- "test 3 fails because getUserById(0) returns null instead of throwing" beats "it's not quite right."

3. The interview pattern

Unfamiliar domain? Have Claude ask questions before implementing:

# ✗ ANTI-PATTERN: Jump straight to implementation
"Build a caching layer for our API."
# Claude picks defaults that may not fit your needs.

# ✓ CORRECT: Interview first
"I need a caching layer for our API. Before implementing,
ask me 3-5 questions about requirements I may not have
considered -- things like invalidation strategy, TTL
policies, cache warming, and failure modes."

# Claude asks:
# 1. "What's your invalidation strategy? Time-based, event-based, or hybrid?"
# 2. "Should cache misses block the response or serve stale data?"
# 3. "Do you need cache warming on deploy?"
# You answer, THEN Claude implements with full context.

Interacting vs independent issues

Interacting problems (fixes affect each other) --> report all in a single message so Claude sees the interactions
Independent problems (fixes don't affect each other) --> fix sequentially, one at a time

# Interacting: both affect the same function
"Fix these together -- they interact:
 1. getUserById returns null instead of throwing NotFoundError
 2. The caller catches NotFoundError but not null"

# Independent: different functions, no interaction
"Fix this: getUserById missing null check"
# Then separately:
"Fix this: formatDate doesn't handle timezone offsets"

Check your understanding

Q1. You describe a date formatting requirement in prose, but Claude produces inconsistent formats across runs. What's the most effective fix?

A) Add more detail to the prose description. B) Provide 2-3 concrete input/output examples showing the exact transformation you want. C) Increase the model temperature for more creative solutions. D) Use a system prompt with the formatting rules.

Correct: B. Concrete examples eliminate ambiguity. Claude sees the pattern from examples and applies it consistently, even to formats not shown. Why not A: More prose still leaves room for interpretation. Why not C: Higher temperature increases variation, not consistency. Why not D: System prompt rules are still prose -- examples are more effective.

Q2. You need Claude to build a caching layer for an unfamiliar system. What's the best approach?

A) Provide a detailed specification of exactly how the cache should work. B) Use the interview pattern -- have Claude ask 3-5 questions about requirements (invalidation strategy, failure modes, cache warming) before implementing. C) Let Claude choose the best caching approach based on its training data. D) Write the caching tests first and have Claude implement to pass them.

Correct: B. The interview pattern surfaces considerations you may not have anticipated. Claude asks about invalidation, failure modes, and edge cases -- then implements with full context. Why not A: If you knew the exact spec, you'd be in a familiar domain and wouldn't need this pattern. Why not C: Claude's defaults may not match your system's constraints. Why not D: Test-driven iteration works better when you know the requirements -- the interview pattern is for when you don't.

Q3. Two bugs interact: a function returns null instead of throwing an error, and the caller catches the error but not null. How should you report them to Claude?

A) Fix them separately in two sequential messages. B) Report both in a single message so Claude can see the interaction and fix them together. C) Fix the function first, then fix the caller in a separate session. D) Describe only the caller bug and let Claude find the root cause.

Correct: B. Interacting problems should be reported together so Claude can see both sides and produce a coherent fix. Fixing one without the other may create new bugs. Why not A: Sequential fixes for interacting problems risk the first fix breaking the second. Why not C: Separate sessions lose the context of the interaction. Why not D: Withholding information forces Claude to investigate when you already know the answer.

Exam tips

"Inconsistent output format" --> concrete examples, not more prose
"Unfamiliar domain" --> interview pattern (ask questions before implementing)
"Multiple bugs that affect each other" --> single message with all issues
"Test fails" --> share the specific failure message for targeted iteration
The core pattern: Inconsistent format → examples. Unknown requirements → interview. Failing tests → share the exact error message.

LAB 3.6: CI/CD INTEGRATION

What the exam tests

The -p (--print) flag for non-interactive mode in CI pipelines
--output-format json and --json-schema for structured CI output
CLAUDE.md as the mechanism for providing project context to CI-invoked Claude Code
Session isolation: why self-review in the same session is less effective than independent review
Including prior review findings to avoid duplicate PR comments
Providing existing test files so test generation avoids duplicates
Documenting testing standards in CLAUDE.md for CI

Non-interactive mode: the -p flag

Without -p, Claude Code waits for interactive input. In CI, the job hangs indefinitely.

# ✗ ANTI-PATTERN: Hangs in CI (waits for interactive input)
claude "Review this PR"

# ✓ CORRECT: Non-interactive mode
claude -p "Review this PR for security issues and missing tests"

Structured output for CI

# Plain text output -- can't be parsed by CI tools
claude -p "Review this diff"

# JSON output -- machine-parseable
claude -p "Review this diff" --output-format json

# Schema-enforced JSON -- guaranteed structure
claude -p "Review this diff" \
  --output-format json \
  --json-schema '{
    "type": "object",
    "properties": {
      "issues": {
        "type": "array",
        "items": {
          "type": "object",
          "properties": {
            "file": {"type": "string"},
            "line": {"type": "integer"},
            "severity": {"type": "string",
                         "enum": ["critical","warning","info"]},
            "description": {"type": "string"}
          },
          "required": ["file", "severity", "description"]
        }
      }
    }
  }'

Session isolation: independent review

Same-session self-review has confirmation bias. The generator retains its reasoning context and defends its own decisions instead of questioning them:

# ✗ ANTI-PATTERN: Same-session self-review
claude -p "Write a new auth module"           # Session A
claude --resume -p "Now review your code"     # SAME session
# Retains generation context = confirmation bias

# ✓ CORRECT: Independent review
claude -p "Write a new auth module"           # Session A
claude -p "Review this diff: $(git diff)"     # Session B (fresh)
# No prior context = unbiased review

CLAUDE.md for CI context

CI-invoked Claude Code loads CLAUDE.md the same way interactive sessions do. Put testing standards and review criteria there so CI runs have project context:

# .claude/CLAUDE.md -- loaded by CI runs too

## Testing standards for CI
- Use vitest with coverage threshold 80%
- Fixtures are in tests/fixtures/ -- reference them, don't create new ones
- Mock external APIs but use real database connections
- Every new public function needs at least one test

## Review criteria for CI
- Flag functions exceeding 50 lines
- Flag async operations without try-catch
- Flag hardcoded credentials (patterns: sk-, pk-, key-)
- Do NOT flag: local naming conventions, import ordering

Prior review findings: avoid duplicate comments

When re-running reviews after new commits, include previous findings so Claude reports only new or still-unaddressed issues:

claude -p "Review this PR. Previous review found these issues:
$(cat previous-findings.json)
Report only NEW issues or issues that are still unaddressed.
Do not duplicate previously reported findings."

Check your understanding

Q1. You add a Claude Code step to your GitHub Actions workflow for automated PR summaries. The workflow runs for 6 hours until the runner times out. The logs show Claude Code printed a prompt and is waiting. What is the fix?

A) Increase the GitHub Actions timeout to 12 hours. B) Use the -p (--print) flag so Claude Code runs in non-interactive mode, processes the input, and exits. C) Pipe the PR diff into Claude Code's stdin. D) Add --no-prompt to suppress the input prompt.

Correct: B. The -p flag runs Claude Code non-interactively -- it takes the prompt as an argument, produces output, and exits. Without it, Claude Code waits for interactive terminal input that never comes in CI. Why not A: A longer timeout doesn't fix the root cause -- Claude Code will still wait indefinitely. Why not C: Piping stdin doesn't replace the need for non-interactive mode. Why not D: --no-prompt is not a real Claude Code flag.

Q2. Your CI pipeline needs Claude Code to output a list of security findings in a format your SIEM dashboard can ingest -- specifically JSON with severity, file, line, and description fields. How should you configure this?

A) Grep the natural language output for keywords like "critical" and "warning." B) Use --output-format json with a --json-schema that defines the exact field structure your SIEM expects. C) Ask Claude Code to write the findings to a findings.json file during the review. D) Post-process the output with a Python script that converts prose to JSON.

Correct: B. --output-format json with --json-schema guarantees the output conforms to your schema -- structured, machine-parseable, and consistent across runs. Why not A: Keyword matching on natural language is fragile and misses findings phrased differently. Why not C: File-based output adds filesystem complexity when structured stdout is available. Why not D: Post-processing prose is brittle and breaks when Claude's phrasing changes.

Q3. A CI step uses Claude Code to generate database migration scripts, then immediately reviews them for safety in the same session. The review passes every time. But when a separate CI step reviews the same migrations in an isolated session, it flags 2-3 issues per run. Why does the same-session review miss these?

A) The separate CI step uses stricter review criteria. B) The generation session retains the reasoning that produced the migration -- Claude recalls why it made each decision and is biased toward confirming its own work. A separate session evaluates the SQL on its merits. C) The issues are introduced during the handoff between CI steps. D) The separate session has access to more context about the database schema.

Correct: B. Same-session self-review inherits generation context. Claude remembers its rationale for each migration choice and is biased toward agreement. A separate session has no generation context and evaluates the code independently. Why not A: Both steps can use identical review prompts. Why not C: The migration file is the same -- nothing changes between steps. Why not D: Both sessions can be given the same schema context.

Exam tips

"CI job hangs" --> add -p flag for non-interactive mode
"Need machine-parseable output" --> --output-format json + --json-schema
"Self-review finds nothing, independent review finds bugs" --> session isolation (separate sessions)
"Duplicate PR comments on re-review" --> include prior findings in the review prompt
Wrong answer trap: "Parse natural language with regex" -- use structured output flags instead
The core pattern: -p for CI. --json-schema for structure. Separate sessions for review. Prior findings to prevent duplicate comments.

MODULE 4: PROMPT ENGINEERING & STRUCTURED OUTPUT

Domain 4 — 20% of exam Task Statements 4.1 – 4.6

Key Terms for Module 4

Explicit criteria: Specific, measurable review rules (e.g., "flag functions over 50 lines") vs vague instructions ("flag long functions").
False positive: A finding the system reports as an issue that isn't actually an issue. High false positive rates destroy developer trust.
Few-shot prompting: Providing 2-4 examples in the prompt to demonstrate the expected output format and reasoning pattern.
tool_use: Content block in Claude's response containing a structured tool call with JSON input matching the tool's schema.
tool_choice: API parameter controlling tool selection. "auto" = Claude chooses freely. "any" = Claude must call a tool. {"type": "tool", "name": "X"} = must call a specific tool.
Semantic validation: Checking whether extracted values are correct (line items sum to total, dates are plausible). Not caught by JSON schema validation.
Retry-with-feedback: Appending specific validation errors to the prompt and asking Claude to re-extract, giving it targeted guidance for self-correction.
detected_pattern: Field added to structured findings that records what code construct triggered the finding, enabling analysis of false positive patterns.
Message Batches API: 50% cost savings, 24-hour processing window, no multi-turn tool calling. For non-blocking workloads.
custom_id: Identifier in batch requests for correlating request/response pairs and resubmitting failures.
Self-review limitation: A model retains reasoning context from generation, making same-session review less effective than independent review.

LAB 4.1: EXPLICIT CRITERIA TO REDUCE FALSE POSITIVES

What the exam tests

Explicit criteria over vague instructions (e.g., "flag functions over 50 lines" vs "flag long functions")
General instructions like "be conservative" fail to improve precision
False positive rates destroy developer trust -- even in accurate categories
Writing specific review criteria (what to report vs skip)
Disabling high false-positive categories to restore trust
Defining severity criteria with concrete code examples

Explicit criteria beat vague instructions

Vague prompts produce inconsistent results. "Review this code for quality issues" gives Claude no boundaries -- it flags everything from missing semicolons to architectural concerns, and the results change between runs.

Explicit criteria define exactly what to report, what to skip, and how to classify severity:

# ✗ ANTI-PATTERN: Vague criteria
vague_prompt = """Review this code for quality issues.
Be thorough and conservative."""
# "Thorough" and "conservative" mean different things each run.
# Result: 30 findings, 20 are false positives.

# ✓ CORRECT: Explicit criteria
explicit_prompt = """Review this code. Flag ONLY:
1. Functions exceeding 50 lines
2. Async operations missing try-catch
3. Hardcoded strings matching: sk-, pk-, key-
4. SQL queries using string concatenation
5. Public functions missing JSDoc

DO NOT flag:
- Minor style issues (spacing, naming preferences)
- Local patterns the team has established
- Import ordering

Severity:
- critical: items 3, 4 (security)
- warning: items 1, 2 (reliability)
- info: item 5 (documentation)"""

The false positive trust spiral

When 20 of 30 findings are false positives, developers stop reading ALL findings -- including the 10 real ones. High false positive rates in one category undermine trust in every category.

Fix: Temporarily disable the high-FP category, improve the criteria offline, then re-enable when precision improves.

Check your understanding

Q1. A code review tool generates 30 findings per PR. Developers report that 20 are false positives, mostly in the "code style" category. They've started ignoring all findings. What should you do first?

A) Add "be more conservative" to the prompt. B) Increase the confidence threshold to filter low-confidence findings. C) Disable the "code style" category to restore trust, improve its criteria offline, then re-enable when false positives are reduced. D) Reduce the number of findings to 10 per PR.

Correct: C. Disabling the high-FP category stops the trust erosion immediately. Improve the criteria offline, then re-enable. Why not A: "Be conservative" is vague and doesn't improve precision measurably. Why not B: Confidence thresholds rely on uncalibrated model confidence. Why not D: Limiting finding count may hide real issues -- the problem is precision, not volume.

Q2. Two developers run the same review prompt on the same PR. Developer A gets 12 findings, Developer B gets 8 findings with different severity ratings. What is the root cause?

A) The model is non-deterministic and cannot produce consistent results. B) The review criteria are vague, allowing Claude to interpret them differently each run. Explicit, measurable criteria would produce consistent results. C) Developer A used a different model version. D) The PR was modified between reviews.

Correct: B. Vague criteria lead to inconsistent interpretation. Explicit criteria ("functions over 50 lines" instead of "long functions") produce the same findings regardless of who runs them. Why not A: With explicit criteria, Claude produces highly consistent results. Why not C/D: The question states same prompt, same PR.

Q3. Your review system flags every instance of catch(e) { log(e) } as "insufficient error handling." But the team uses this pattern intentionally in non-critical background tasks. What should you do?

A) Remove the error handling rule entirely. B) Add an explicit exception: "Do NOT flag catch-and-log in files under background-tasks/ or functions tagged @non-critical." C) Lower the severity from "warning" to "info." D) Add "be conservative" to the prompt.

Correct: B. Add explicit boundaries that define where the rule applies and where it doesn't. This preserves the rule for critical code while respecting the team's intentional patterns. Why not A: Removing the rule loses real findings in critical code. Why not C: Lower severity doesn't eliminate the false positive -- developers still have to dismiss it. Why not D: "Be conservative" is vague and doesn't specifically address this pattern.

Exam tips

"Inconsistent review results" --> explicit criteria, not vague instructions
"Developers ignoring findings" --> false positive trust spiral -- disable the high-FP category
Wrong answer trap: "Be more conservative" -- vague instructions do not improve precision
Wrong answer trap: "Increase confidence threshold" -- model confidence is not calibrated
The core pattern: Vague criteria ("be thorough") produce inconsistent results. Explicit criteria ("functions over 50 lines") produce consistent results. Always specify what to report, what to skip, and how to classify severity.

LAB 4.2: FEW-SHOT PROMPTING

What the exam tests

Few-shot examples as the most effective technique for consistent, actionable output
Examples demonstrate ambiguous-case handling and reasoning
Few-shot enables generalization to novel patterns (not just matching examples)
Reducing hallucination in extraction tasks (handling missing fields, varied structures)
Creating 2-4 targeted examples for ambiguous scenarios
Including at least one edge case / negative example

Why few-shot works

Detailed instructions tell Claude what to do. Few-shot examples show Claude how to do it -- including edge cases, output format, and reasoning for ambiguous inputs.

Optimal range: 2-4 examples. Fewer than 2 doesn't establish a pattern. More than 6 wastes tokens with diminishing returns.

How it works in code

few_shot_prompt = """Extract customer support ticket data.

Example 1 (standard ticket):
Input: "Hi, I'm Jane Smith (jane@co.com). Order #ORD-555
is missing two items."
Output: {"customer": "Jane Smith", "email": "jane@co.com",
         "order_id": "ORD-555", "issue": "missing items",
         "priority": "medium"}

Example 2 (messy ticket with abbreviations):
Input: "bob j here acct bob@mail.com - wheres my stuff??
order 777 said 2 day shipping its been a WEEK"
Output: {"customer": "Bob J", "email": "bob@mail.com",
         "order_id": "777", "issue": "late delivery",
         "priority": "high"}

Example 3 (missing data):
Input: "My package never arrived. Order number is 12345."
Output: {"customer": null, "email": null,
         "order_id": "12345", "issue": "missing package",
         "priority": "medium"}

Example 4 (not a support ticket -- edge case):
Input: "What are your business hours?"
Output: {"customer": null, "email": null,
         "order_id": null, "issue": null,
         "priority": null}

Now extract from this ticket:
Input: "{new_ticket}"
"""

Key details:

Example 2 handles messy, informal input
Example 3 returns null for missing fields instead of fabricating values
Example 4 is a negative example -- shows what NOT to extract

Few-shot enables generalization

The model doesn't just match the examples -- it learns the pattern and applies it to novel inputs. Three examples covering clean, messy, and missing data teach Claude to handle any variation, not just those specific formats.

Check your understanding

Q1. Your extraction system produces inconsistent output formats -- sometimes JSON, sometimes markdown, sometimes plain text. Detailed format instructions don't help. What is the most effective fix?

A) Add stricter format instructions in the system prompt. B) Provide 2-4 few-shot examples showing the exact desired JSON output format for different input types. C) Use regex to parse whatever format Claude returns. D) Lower the temperature to reduce variation.

Correct: B. Few-shot examples establish the output format by demonstration. Claude sees the JSON structure in every example and replicates it consistently. Why not A: Instructions tell what format to use; examples show it -- showing is more reliable. Why not C: Post-processing can't fix inconsistent structure reliably. Why not D: Temperature affects word choice, not structural format decisions.

Q2. Your extraction prompt has 8 few-shot examples but Claude's output quality hasn't improved over 4 examples. What should you do?

A) Add more examples until quality improves. B) Stop at 4 examples -- 2-4 is the optimal range. Beyond that, returns diminish and token costs increase. Focus on making existing examples more diverse instead. C) Replace examples with more detailed instructions. D) Use a larger model.

Correct: B. 2-4 examples is optimal. Additional examples add tokens without proportional quality gains. Better to make existing examples diverse (cover edge cases, ambiguous inputs, negative cases). Why not A: More examples hit diminishing returns and waste tokens. Why not C: Examples are more effective than instructions for format consistency. Why not D: Model size doesn't fix example count issues.

Q3. An extraction system correctly identifies sentiment in standard reviews but fabricates data for reviews that don't contain the expected fields (e.g., reviews without a product name). How can few-shot examples fix this?

A) Add examples that always include all fields. B) Add a few-shot example showing a review with missing fields, where the output returns null for those fields instead of fabricating values. C) Add a system prompt instruction: "Never fabricate data." D) Use tool_use to enforce the schema.

Correct: B. A few-shot example demonstrating null for missing fields teaches Claude the pattern: "if the data isn't there, return null." This prevents fabrication more reliably than instructions. Why not A: Examples with all fields don't teach how to handle missing data. Why not C: Instructions help but examples are more effective for this specific behavior. Why not D: tool_use enforces structure but doesn't prevent fabrication of values within valid fields.

Exam tips

"Inconsistent output format" --> few-shot examples (2-4), not more instructions
"System fabricates values for missing fields" --> add a negative example returning null
Optimal range: 2-4 examples. More than 6 is always wrong on the exam.
Include at least one edge case (missing data, ambiguous input, negative case)
The core pattern: Examples teach format and judgment. Instructions teach rules. 2-4 examples with at least one edge case.

LAB 4.3: STRUCTURED OUTPUT WITH TOOL USE AND JSON SCHEMAS

What the exam tests

tool_use with JSON schemas guarantees schema-compliant output (structure, not semantics)
tool_choice: "auto" vs "any" vs forced selection
Schemas eliminate syntax errors but NOT semantic errors
Schema design: required vs optional fields, enum with "other" + detail
Nullable fields prevent fabrication when data is missing
Format normalization rules alongside schemas

tool_use guarantees structure, not semantics

When you define a tool with a JSON schema, Claude's output is guaranteed to match the schema -- correct types, required fields present, valid enum values. But the values might be wrong.

import anthropic

client = anthropic.Anthropic()

extract_tool = {
    "name": "extract_invoice",
    "description": "Extract structured data from an invoice",
    "input_schema": {
        "type": "object",
        "properties": {
            "vendor_name": {"type": "string"},
            "total": {"type": "number"},
            "date": {"type": "string",
                     "description": "ISO 8601 format"},
            "category": {
                "type": "string",
                "enum": ["standard", "credit_note",
                         "proforma", "other"]
            },
            "category_detail": {
                "type": ["string", "null"],         # ← Nullable
                "description": "Required if category is 'other'"
            },
            "line_items": {
                "type": "array",
                "items": {
                    "type": "object",
                    "properties": {
                        "description": {"type": "string"},
                        "amount": {"type": "number"}
                    }
                }
            }
        },
        "required": ["vendor_name", "total", "date",
                      "category"]
    }
}

response = client.messages.create(
    model="claude-sonnet-4-5",
    tools=[extract_tool],
    tool_choice={"type": "tool",
                 "name": "extract_invoice"},  # ← Forced
    messages=[{"role": "user",
               "content": f"Extract: {invoice_text}"}]
)
# Structure guaranteed. Values need semantic validation.

Schema guarantees:

vendor_name is a string (syntax) -- but might be the wrong vendor (semantic)
total is a number (syntax) -- but might not match the actual total (semantic)
category is a valid enum value (syntax) -- but might be miscategorized (semantic)

tool_choice modes

# "auto" -- Claude may return text instead of calling a tool
# Risk: pipeline expecting JSON gets text
tool_choice={"type": "auto"}

# "any" -- Claude MUST call a tool, but chooses which
# Use when: multiple extraction schemas, unknown doc type
tool_choice={"type": "any"}

# Forced -- Claude MUST call this specific tool
# Use when: you know exactly which extraction to run
tool_choice={"type": "tool", "name": "extract_invoice"}

Schema design: nullable fields and enum extensibility

# ✗ ANTI-PATTERN: Required field forces fabrication
"vendor_phone": {"type": "string"}  # required
# If the invoice doesn't show a phone number,
# Claude fabricates one to satisfy the schema.

# ✓ CORRECT: Nullable field allows honest "missing"
"vendor_phone": {"type": ["string", "null"]}  # nullable
# Claude returns null when the data isn't present.

# ✓ CORRECT: Enum with "other" + detail field
"category": {
    "type": "string",
    "enum": ["standard", "credit_note", "proforma",
             "unclear", "other"]  # ← "unclear" and "other"
}
"category_detail": {
    "type": ["string", "null"],
    "description": "Explain if category is 'other' or 'unclear'"
}

Check your understanding

Q1. Your extraction system uses tool_use with a JSON schema. The output always has correct field types and valid enum values, but the vendor_name field is wrong in 8% of cases. What type of error is this?

A) Schema syntax error -- fix by tightening the schema. B) API configuration error -- fix by changing tool_choice. C) Semantic error -- the structure is correct but the values are wrong. Add semantic validation (e.g., cross-reference vendor name with known vendors). D) Few-shot example error -- add more examples.

Correct: C. tool_use eliminates syntax errors (wrong types, missing fields) but cannot prevent semantic errors (wrong values in correct fields). Semantic validation is a separate layer. Why not A: The schema is working -- it's enforcing correct types. Why not B: tool_choice controls which tool runs, not value accuracy. Why not D: Few-shot examples help but semantic validation is the direct fix.

Q2. Your contract extraction schema has "termination_clause": {"type": "string"} as a required field. For contracts that don't include a termination clause, Claude invents plausible-sounding legal language. What is the fix?

A) Add a prompt instruction: "Only extract clauses that actually exist in the document." B) Make the field nullable: {"type": ["string", "null"]} so Claude can return null when no termination clause exists instead of fabricating one. C) Remove the termination_clause field from the schema entirely. D) Add a regex validator to catch fabricated legal language.

Correct: B. A required string field forces Claude to produce a value. Making it nullable lets Claude honestly signal "this clause doesn't exist in the document." Why not A: The schema requirement overrides prompt instructions -- Claude must produce a string to satisfy the schema. Why not C: Many contracts do have termination clauses -- you still want to extract them when present. Why not D: Distinguishing real from fabricated legal language via regex is impractical.

Q3. Your pipeline uses tool_choice: "auto" with multiple extraction tools. For unknown document types, Claude sometimes returns a text summary instead of calling any tool. What should you change?

A) Add prompt instructions to always use a tool. B) Set tool_choice: "any" to guarantee Claude calls a tool, letting it choose the appropriate extraction schema based on the document type. C) Remove the text response option from the API. D) Force a specific tool with tool_choice: {"type": "tool", "name": "..."}.

Correct: B. "any" guarantees Claude calls a tool but lets it choose which one -- ideal when the document type is unknown and multiple extraction schemas exist. Why not A: Instructions are probabilistic. Why not C: You can't remove the text option from "auto" -- switch to "any". Why not D: Forcing a specific tool doesn't work when you don't know which extraction schema fits.

Exam tips

"tool_use eliminates all errors" --> WRONG. It eliminates syntax errors, not semantic errors.
"Claude fabricates values" --> make fields nullable (["string", "null"])
"Pipeline gets text instead of JSON" --> switch from "auto" to "any" or forced
Enum extensibility: include "unclear" and "other" + detail field
The core pattern: tool_use guarantees structure. Semantic validation guarantees correctness. Nullable fields guarantee honesty.

LAB 4.4: VALIDATION AND RETRY LOOPS

What the exam tests

Retry-with-error-feedback: appending specific validation errors to guide correction
Limits of retry: retries don't work when information is absent from the source
Feedback loop design with detected_pattern for analyzing dismissal patterns
Semantic validation errors vs schema syntax errors
Self-correction validation: calculated_total vs stated_total

Retry with specific error feedback

When extraction fails validation, don't just say "try again." Append the specific errors so Claude knows exactly what to fix:

def extract_with_retry(document, max_retries=3):
    messages = [{"role": "user",
                 "content": f"Extract from: {document}"}]

    for attempt in range(max_retries):
        response = client.messages.create(
            model="claude-sonnet-4-5",
            tools=[extract_tool],
            tool_choice={"type": "tool",
                         "name": "extract_invoice"},
            messages=messages
        )

        data = parse_tool_response(response)
        errors = validate_semantics(data)

        if not errors:
            return data              # ← Valid, done

        # CRITICAL: Append SPECIFIC errors for retry
        messages.append({"role": "assistant",
                         "content": response.content})
        messages.append({
            "role": "user",
            "content": (
                "Validation failed. Fix these errors:\n"
                + "\n".join(f"- {e}" for e in errors)
                + "\nRe-extract with corrections."
            )
        })
        # ← Claude sees exactly what went wrong

    raise ExtractionError("Failed after retries")


def validate_semantics(data):
    """Check values, not structure."""
    errors = []
    # Self-correction: compare calculated vs stated
    line_sum = sum(item["amount"]
                   for item in data.get("line_items", []))
    if abs(line_sum - data["total"]) > 0.01:
        errors.append(
            f"Line items sum to {line_sum} but total "
            f"is {data['total']} -- mismatch")
    if data["total"] <= 0:
        errors.append(
            f"Total must be positive, got {data['total']}")
    return errors

When retries don't work

Retries are effective for format and structural errors (Claude misread the layout, picked the wrong field). Retries are ineffective when the information simply isn't in the source document:

Error type	Retryable?	Example
Format mismatch	Yes	Extracted date as "March 15" instead of "2025-03-15"
Wrong field	Yes	Put vendor name in the address field
Missing from source	No	Invoice doesn't include a purchase order number
External reference	No	"See appendix B" but appendix not provided

detected_pattern for false positive analysis

When developers dismiss findings, track what code construct triggered them:

finding = {
    "file": "auth.py",
    "line": 45,
    "severity": "warning",
    "description": "Function exceeds 50 lines",
    "detected_pattern": "function_length > 50",  # ← Track this
    "dismissed": True,
    "dismiss_reason": "Intentional -- complex auth flow"
}
# After 100 reviews, analyze:
# "function_length > 50" dismissed 73% of the time
# → This criterion needs refinement or scoping

Check your understanding

Q1. An extraction produces {"total": 150.00} but the line items sum to $175.50. What type of error is this, and what's the fix?

A) Schema error -- tighten the schema. B) Semantic error -- add validation that compares calculated_total (line item sum) to stated_total, then retry with the specific discrepancy. C) Parsing error -- use a different document parser. D) Model error -- switch to a more capable model.

Correct: B. This is a semantic error (values don't add up). Validate by computing the sum independently and flag the discrepancy. Retry with: "Line items sum to $175.50 but total says $150.00 -- re-examine the invoice." Why not A: The schema is correct -- total is a valid number. Why not C: The parser extracted the data; the values are inconsistent. Why not D: Any model can make semantic errors -- validation catches them.

Q2. An extraction fails because the invoice references "See Appendix B for pricing" but the appendix wasn't provided. Retrying 3 times produces the same failure. What should you do?

A) Increase max_retries to 10. B) Recognize that retries are ineffective when the required information is absent from the source. Return null for the pricing field and flag it as "requires external document." C) Use a more powerful model. D) Add few-shot examples of appendix references.

Correct: B. The information simply isn't in the provided document. No amount of retrying will extract data that doesn't exist. Return null and flag for human review. Why not A: More retries on absent data waste time and tokens. Why not C: No model can extract data that isn't there. Why not D: Examples don't help when the source material is incomplete.

Q3. Developers dismiss 73% of "function_length > 50" findings. How should you use this data?

A) Remove the rule entirely. B) Analyze the dismissed findings to understand why -- if the team intentionally uses long functions for complex flows, scope the rule to exclude known patterns (e.g., exclude files in core/workflows/). C) Lower the threshold to 30 lines. D) Stop tracking dismissals.

Correct: B. High dismissal rates signal that the criterion needs refinement, not removal. Analyze patterns and scope the rule. Why not A: Some long functions are genuine issues -- refine, don't remove. Why not C: A lower threshold would increase false positives. Why not D: Dismissal tracking is the feedback loop that enables improvement.

Exam tips

"Retry fails repeatedly" --> check if the information is actually in the source document
"Values don't add up" --> semantic validation (calculated vs stated totals)
"Generic retry message" --> append SPECIFIC errors to the retry prompt
"High dismissal rate" --> use detected_pattern to analyze and refine criteria
The core pattern: Retry fixes format errors. Retry cannot fix absent data -- return null and flag. Track dismissal patterns to refine criteria over time.

LAB 4.5: BATCH PROCESSING WITH MESSAGE BATCHES API

What the exam tests

Message Batches API: 50% cost savings, up to 24-hour processing window
Appropriate for non-blocking, latency-tolerant workloads
The batch API does NOT support multi-turn tool calling
custom_id fields for correlating request/response pairs
Matching API approach to workflow latency requirements
Calculating batch submission frequency based on SLA constraints
Handling failures: resubmit only failed items by custom_id
Prompt refinement on a sample set before batch processing

When to use batch vs synchronous

Workload	API	Why
Pre-merge PR review	Synchronous	Blocking -- developer is waiting
Nightly code audit	Batch	Non-blocking -- runs overnight
Weekly compliance report	Batch	No one waits for weekly reports
Real-time chat response	Synchronous	User is waiting for reply
Test generation for 1000 files	Batch	No urgency, 50% savings

Decision rule: Is someone waiting? Synchronous. No one waiting? Batch (50% off).

How it works in code

import anthropic

client = anthropic.Anthropic()

# Build batch requests with custom_id for tracking
requests = []
for doc in documents:
    requests.append({
        "custom_id": f"doc-{doc['id']}",      # ← For tracking
        "params": {
            "model": "claude-sonnet-4-5",
            "max_tokens": 1024,
            "messages": [{"role": "user",
                          "content": f"Extract: {doc['text']}"}]
        }
    })

# Submit batch
batch = client.messages.batches.create(requests=requests)
# Processes within 24 hours at 50% cost

# Poll for results
results = client.messages.batches.results(batch.id)
for result in results:
    if result.result.type == "succeeded":
        process(result.custom_id, result.result.message)
    else:
        # Resubmit ONLY failed items
        failed_ids.append(result.custom_id)

Key limitation: Batch API does NOT support multi-turn tool calling. Each request is a single message exchange -- you can't execute tools mid-request and return results. For multi-turn workflows, use the synchronous API.

Calculating batch submission frequency

If your SLA requires results within 30 hours and the batch API has a 24-hour processing window:

SLA window:           30 hours
Batch processing max: 24 hours
Buffer needed:        30 - 24 = 6 hours
Submit every:         6 hours (worst case: submitted at hour 0,
                      completed at hour 24, within 30-hour SLA)

More conservative:    Submit every 4 hours
                      (gives 6 hours of buffer)

Prompt refinement before batch

Before processing 10,000 documents in batch, test your prompt on 10-20 documents synchronously. Fix issues on the small set -- then batch the full volume. This prevents expensive re-processing.

Check your understanding

Q1. A pre-merge CI check needs Claude to review PRs before merging. A developer suggests using the Batch API for 50% cost savings. Is this appropriate?

A) Yes -- the savings justify the 24-hour window. B) No -- pre-merge checks are blocking. Developers wait for results. Use the synchronous API. C) Yes -- if you configure a shorter timeout. D) No -- but only because the Batch API doesn't support code review.

Correct: B. Pre-merge checks are blocking -- developers can't merge until the review completes. The 24-hour batch window is unacceptable for this use case. Why not A: The 50% savings don't matter if developers are blocked for 24 hours. Why not C: The batch API's processing window is up to 24 hours with no latency guarantee. Why not D: The batch API can do code review -- the issue is latency, not capability.

Q2. A batch of 500 extraction requests completes. 480 succeed and 20 fail with "context length exceeded." What should you do?

A) Resubmit the entire batch of 500. B) Use the custom_id field to identify the 20 failed documents, chunk them into smaller pieces, and resubmit only those 20. C) Increase the context window for the next batch. D) Skip the failed documents.

Correct: B. Resubmit only the failed items (identified by custom_id) with modifications (chunking for context-length errors). Don't reprocess the 480 that succeeded. Why not A: Resubmitting all 500 wastes tokens on already-successful extractions. Why not C: Context window is model-specific and can't be changed per-request. Why not D: Skipping loses data.

Q3. You plan to batch-process 10,000 documents. Your first batch of 100 has a 30% error rate due to prompt issues. What should you do?

A) Submit the full 10,000 and fix errors after. B) Refine the prompt on a small synchronous sample (10-20 documents), fix the issues, then batch-process the full volume. C) Increase the number of retries per document. D) Switch to the synchronous API for all 10,000.

Correct: B. Prompt refinement on a small sample before batch processing maximizes first-pass success and avoids expensive re-processing. Why not A: 30% error rate on 10,000 documents = 3,000 failures to reprocess. Why not C: Retries fix transient errors, not prompt issues. Why not D: Synchronous for 10,000 documents is 2x the cost with no benefit.

Exam tips

"Someone is waiting" --> synchronous API. "No one waiting" --> batch API (50% savings)
"Batch request fails" --> resubmit only failed items by custom_id
Wrong answer trap: "Use batch for pre-merge checks" -- blocking workflows need synchronous
Batch limitation: no multi-turn tool calling within a single request
The core pattern: Is someone waiting for the result? Synchronous. No one waiting? Batch. Always test on 10-20 documents before batching thousands.

LAB 4.6: MULTI-INSTANCE AND MULTI-PASS REVIEW

What the exam tests

Self-review limitations: retains reasoning context from generation
Independent instances (no prior context) catch more issues than self-review
Multi-pass: per-file local analysis + cross-file integration passes
Confidence reporting alongside findings for calibrated review routing

Why self-review fails

When Claude generates code and then reviews it in the same session, it retains the reasoning context from generation. It defends its own decisions instead of questioning them:

# ✗ ANTI-PATTERN: Same-session self-review
session = create_session()
session.run("Write an auth module")          # Generates code
session.run("Now review that code for bugs") # Same session
# Claude remembers WHY it made each decision.
# Result: "No issues found" (because it agrees with itself)

# ✓ CORRECT: Independent review instance
session_a = create_session()
session_a.run("Write an auth module")        # Session A generates

session_b = create_session()                 # Fresh session
session_b.run(f"Review this code:\n{code}")  # Session B reviews
# No generation context. Reviews the code on its own merits.
# Result: Finds 3 issues the generator was blind to.

Multi-pass review: per-file + cross-file

Large PRs reviewed in a single pass suffer from attention dilution -- files in the middle get less attention. Split into two passes:

Pass 1 -- Per-file local analysis: Each file reviewed independently. Catches: unused imports, missing null checks, style violations.

Pass 2 -- Cross-file integration analysis: Review how files interact. Catches: broken interfaces, type mismatches across modules, inconsistent error handling patterns.

# Pass 1: Per-file (each file gets 100% attention)
file_findings = []
for f in changed_files:
    findings = review_file(f)       # Focused on one file
    file_findings.extend(findings)

# Pass 2: Cross-file (focus on interactions)
integration_findings = review_integration(
    changed_files, file_findings)    # How do files interact?

Confidence reporting for review routing

Each finding can include a confidence score for routing decisions:

finding = {
    "file": "auth.py",
    "issue": "Missing rate limiting on login endpoint",
    "severity": "critical",
    "confidence": 0.95     # ← High: auto-report
}

finding = {
    "file": "utils.py",
    "issue": "Possible memory leak in connection pool",
    "severity": "warning",
    "confidence": 0.55     # ← Low: route to human review
}

# Routing thresholds:
# confidence > 0.85  → auto-report as finding
# confidence 0.5-0.85 → route to human reviewer
# confidence < 0.5   → suppress (likely false positive)

Important: These confidence scores are calibrated through labeled validation sets (callback to Lab 5.5), not raw model self-assessment.

Check your understanding

Q1. A team uses two review architectures: System A has Claude write API documentation and review it in the same session. System B has one Claude instance write the docs and a separate instance review them. System A consistently rates its own docs as "excellent" while System B's reviewer catches missing error codes, incorrect parameter types, and outdated examples. What architectural principle explains the difference?

A) System B's reviewer has access to more documentation sources. B) System A retains the author's reasoning context -- the reviewer "remembers" why it wrote each section and confirms its own choices instead of questioning them. System B's reviewer evaluates the docs independently. C) System A uses a less capable model for review. D) System B's reviewer applies stricter review criteria.

Correct: B. The author and reviewer sharing a session means the reviewer inherits the author's reasoning. It confirms its own decisions rather than evaluating the output independently. Separate instances eliminate this confirmation bias. Why not A: Both systems can access the same sources. Why not C: Both can use the same model -- the difference is session isolation, not model capability. Why not D: Both can use identical review criteria.

Q2. A security audit reviews 20 API endpoints in a single Claude pass. The audit produces thorough findings for the authentication and payment endpoints (reviewed first and last) but gives superficial "looks fine" assessments for the 12 middleware endpoints in between. What should you change?

A) Use a model with a larger context window for security audits. B) Run the same audit pass three times and aggregate the findings. C) Audit each endpoint independently (per-endpoint pass), then run a cross-endpoint pass to check for systemic patterns like inconsistent auth middleware or missing rate limiting across endpoints. D) Prioritize only the endpoints with the most code changes.

Correct: C. Per-endpoint analysis gives each endpoint the model's full attention. The cross-endpoint pass catches systemic issues that span multiple endpoints. Why not A: A larger context window doesn't fix the attention distribution problem -- middle items still receive less focus. Why not B: Repeating the same single-pass approach may reproduce the same blind spots. Why not D: Security vulnerabilities can exist in any endpoint regardless of change volume.

Q3. Your multi-pass review generates findings with confidence scores. A finding has confidence 0.55 -- not high enough to auto-report but not low enough to suppress. How should you route it?

A) Auto-report it since any finding is worth reporting. B) Suppress it since it's probably a false positive. C) Route it to a human reviewer for assessment, prioritizing limited reviewer capacity on uncertain findings. D) Retry the review to get a higher confidence score.

Correct: C. Mid-confidence findings are the ideal candidates for human review -- they're uncertain enough that automated decisions may be wrong. Route to humans who can apply judgment. Why not A: Auto-reporting uncertain findings may increase false positives. Why not B: Suppressing at 0.55 may miss real issues. Why not D: Retrying doesn't reliably change confidence -- the uncertainty is inherent.

Exam tips

"Self-review finds nothing" --> use independent review instance (fresh session, no generation context)
"Review misses files in the middle" --> per-file + cross-file passes
Wrong answer trap: "Run the review again" -- repetition does not fix attention dilution
Confidence routing: high → auto-report, medium → human review, low → suppress
The core pattern: Same-session review has confirmation bias. Use a separate session. Large reviews → per-file local pass + cross-file integration pass.

MODULE 5: CONTEXT MANAGEMENT & RELIABILITY

Domain 5 — 15% of exam Task Statements 5.1 – 5.6

Key Terms for Module 5

Progressive summarization: Repeatedly condensing conversation history, losing specific numbers, dates, and customer details each round.
Case facts block: Structured, immutable reference section containing critical transactional data (amounts, dates, IDs) that persists unchanged across conversation turns.
Lost-in-the-middle effect: Models reliably process information at the beginning and end of long inputs but may miss findings from middle sections.
Escalation triggers: Valid reasons to hand off to a human: explicit customer request, policy gap, capability limit, business threshold exceeded.
Structured error context: Error responses that include failure type, what was attempted, partial results, and alternatives -- enabling intelligent recovery by the coordinator.
Access failure vs empty result: A search that couldn't execute (isError: true) vs a search that executed and found nothing (isError: false). Conflating these misleads the agent.
Context degradation: Response quality decreasing as conversations lengthen -- models start giving generic answers instead of referencing specific findings.
Scratchpad files: External files for persisting key findings across context boundaries and session resets.
Stratified random sampling: Sampling N items per category (document type, field) rather than random sampling from the full population. Ensures rare categories are represented.
Field-level accuracy: Measuring extraction accuracy per individual field (vendor_name, total_amount) rather than per document. Reveals which fields are weak.
Claim-source mapping: Structured data linking every extracted claim to its source document, URL, date, and relevant excerpt -- preserving provenance through synthesis.
Temporal metadata: Publication or collection dates attached to findings, preventing temporal differences from being misinterpreted as contradictions.

LAB 5.1: CONTEXT MANAGEMENT ACROSS LONG INTERACTIONS

What the exam tests

Progressive summarization risks: condensing loses numbers, dates, customer expectations
The lost-in-the-middle effect: beginning and end get more attention than middle sections
Tool results accumulate tokens disproportionately (40+ fields when only 5 are relevant)
Passing complete conversation history in subsequent API requests
Extracting transactional facts into persistent case facts blocks
Trimming verbose tool outputs to relevant fields
Placing key findings at beginning of inputs, organizing with section headers

Progressive summarization destroys critical details

Summarization is lossy. Each round removes specifics:

Turn 1: "Customer Jane Smith (CUST-001) was charged
$149.99 for order ORD-555 but expected $99.99 under
promotion SUMMER2026. Overcharge: $50.00."

Turn 5 (after summarization): "Customer was overcharged
on a recent order due to a promotion issue."

Turn 10: "Customer has a billing concern."

The exact amount, order number, and promotion code are gone. No agent -- human or AI -- can process the refund from "billing concern."

Case facts block: the fix

Extract critical transactional data into a structured block that persists unchanged regardless of summarization:

## CASE FACTS (Do not summarize)
| Field | Value |
|-------|-------|
| Customer | Jane Smith (CUST-001) |
| Order | ORD-555 |
| Expected price | $99.99 (promotion SUMMER2026) |
| Charged price | $149.99 |
| Overcharge | $50.00 |
| Resolution | Refund $50.00 to original payment |

The conversation can be summarized. The case facts cannot. Include the block in every prompt, outside the summarized history.

Trimming verbose tool outputs

lookup_order returns 40+ fields. Only 5-6 matter for the current issue. Untrimmed results accumulate fast -- 40 fields x 20 calls fills context with noise:

# ✗ ANTI-PATTERN: Full tool output in context
# 40 fields x 20 tool calls = context fills with irrelevant data

# ✓ CORRECT: Trim to relevant fields
RELEVANT_FIELDS = {
    "refund": ["order_id", "total", "items", "status",
               "payment_method"],
    "shipping": ["order_id", "status", "tracking",
                 "estimated_delivery"],
}

def trim_tool_result(result, issue_type):
    fields = RELEVANT_FIELDS.get(issue_type, result.keys())
    return {k: result[k] for k in fields if k in result}

Lost-in-the-middle mitigation

Place key findings summaries at the beginning of aggregated inputs. Organize detailed results with explicit section headers:

## KEY FINDINGS (read first)
- Order ORD-555 overcharged by $50.00
- Promotion SUMMER2026 was not applied at checkout
- Customer is a 7-year gold-tier member

## DETAILED RESULTS
### Order lookup
[full order details here]
### Promotion verification
[promotion details here]
### Customer history
[account details here]

Check your understanding

Q1. A claims processing agent handles insurance disputes. After 12 turns of investigation, it refers to "the policyholder's claim" but can no longer cite the policy number, claim amount ($4,237.50), or the date of the incident. What is the root cause?

A) The model's context window is too small for insurance conversations. B) Progressive summarization condensed the specific policy number, dollar amount, and date into generic descriptions like "the claim." Extract these identifiers into an immutable case facts block at the top of the context. C) The agent needs to re-query the claims database more frequently. D) The conversation history is being truncated between API calls.

Correct: B. Each round of summarization makes details vaguer: "$4,237.50 claim on policy #HM-9912" becomes "the policyholder's claim." Case facts blocks sit outside the summarization process and preserve identifiers verbatim. Why not A: Context size affects how much fits, not whether specifics survive summarization. Why not C: Re-querying wastes tokens when the data was already retrieved -- the problem is how it's stored in context. Why not D: Truncation would cause total loss of early messages, not gradual loss of specifics.

Q2. An agent retrieves order details with 40+ fields, but only needs 5 for the current issue. After 20 tool calls, the context is nearly full. What should you do?

A) Increase the context window. B) Trim tool outputs to only the fields relevant to the current issue type before they accumulate in context. C) Summarize the tool outputs after every 5 calls. D) Limit the agent to 10 tool calls per conversation.

Correct: B. Trimming at the source prevents bloat. 5 relevant fields per call instead of 40 reduces context consumption by 87%. Why not A: Larger windows delay the problem without solving it. Why not C: Summarization loses specifics (the progressive summarization trap). Why not D: Arbitrary call limits may prevent the agent from completing the task.

Q3. A report synthesized from 8 sources covers findings from sources 1-3 and 7-8 but omits findings from sources 4-6. What is the likely cause?

A) Sources 4-6 had no relevant information. B) The lost-in-the-middle effect -- information positioned centrally in long inputs receives less attention. Place key findings from all sources at the beginning with section headers. C) The agent ran out of context space. D) Sources 4-6 were in a different format.

Correct: B. Models attend more to the beginning and end of long inputs. Middle-positioned findings get overlooked. Mitigate by placing summaries at the beginning and using explicit section headers for each source. Why not A: The question implies relevant information was present but missed. Why not C: Context overflow would affect all sources, not just the middle ones. Why not D: Format differences wouldn't consistently affect middle positions.

Exam tips

"Agent loses specific details over time" --> case facts block, not better summarization
"Context filling up with tool results" --> trim to relevant fields before they accumulate
"Report misses middle sources" --> lost-in-the-middle effect, place summaries at beginning
Wrong answer trap: "Summarize more aggressively" -- summarization is the cause, not the fix
The core pattern: Summarization destroys details. Case facts blocks preserve them. Trim tool outputs at the source. Place key findings at the beginning of long inputs.

LAB 5.2: ESCALATION AND AMBIGUITY RESOLUTION

What the exam tests

Valid escalation triggers: explicit customer request, policy gaps, inability to progress
Immediate escalation on explicit request vs offering resolution for straightforward issues
Why sentiment-based and self-reported confidence are unreliable proxies
Multiple customer matches require clarification, not heuristic selection
Adding explicit escalation criteria with few-shot examples
Acknowledging frustration while offering resolution

Valid vs invalid escalation triggers

Sentiment is NEVER a valid escalation trigger on the exam. Self-reported confidence is NEVER a valid trigger. Both are always wrong answers.

Valid triggers	Invalid triggers
Customer explicitly requests a human	Negative sentiment detected
Policy gap (no rule covers this situation)	Model self-reports low confidence
Agent can't make meaningful progress	Customer uses profanity
Business threshold exceeded ($500+ refund)	Conversation is long

How it works in code

def should_escalate(context):
    """Check valid escalation triggers."""

    # VALID: Customer explicitly asked for human
    if context.customer_requested_human:
        return True, "Customer requested human agent"

    # VALID: Policy gap -- no rule covers this
    if context.policy_gap:
        return True, "No policy covers this situation"

    # VALID: Business threshold exceeded
    if context.refund_amount > 500:
        return True, f"${context.refund_amount} exceeds limit"

    # VALID: Agent stuck after multiple attempts
    if context.attempts >= 3 and not context.progress_made:
        return True, "Unable to make progress"

    # ✗ INVALID: Sentiment is NOT a valid trigger
    # if context.sentiment == "negative":
    #     return True  # WRONG -- sentiment != complexity

    # ✗ INVALID: Self-reported confidence is NOT reliable
    # if context.confidence < 0.7:
    #     return True  # WRONG -- confidence is uncalibrated

    return False, None

Handling ambiguity: clarify, don't guess

Multiple customer matches require clarification, not heuristic selection:

# ✗ ANTI-PATTERN: Heuristic selection
customers = lookup("John Smith")  # Returns 3 matches
selected = customers[0]           # Just pick the first one
# Wrong customer → wrong refund → compliance violation

# ✓ CORRECT: Ask for clarification
customers = lookup("John Smith")  # Returns 3 matches
if len(customers) > 1:
    return "I found 3 accounts for John Smith. "
           "Could you provide your email address or "
           "account number so I can find the right one?"

Acknowledging frustration vs escalating

Frustrated customer ≠ escalation trigger. If the issue is within the agent's capability, resolve it:

Customer: "This is ridiculous, I've been waiting a week!"

# ✗ WRONG: Escalate because of negative sentiment
→ Escalate to human (sentiment-based)

# ✓ CORRECT: Acknowledge frustration, offer resolution
→ "I understand your frustration with the wait. Let me
look into this right now and get it resolved."
→ Only escalate if the customer EXPLICITLY asks for a human
or if you can't actually resolve the issue.

Check your understanding

Q1. An agent achieves 55% first-contact resolution, well below the 80% target. It escalates standard damage replacements while attempting complex policy exceptions autonomously. What is the fix?

A) Have the agent self-report confidence scores and escalate below a threshold. B) Add explicit escalation criteria with few-shot examples showing when to escalate (policy gaps, customer requests) vs when to resolve autonomously (standard replacements with clear procedures). C) Deploy a sentiment analysis classifier to detect frustrated customers. D) Add a secondary model to verify escalation decisions.

Correct: B. Explicit criteria with examples teach the agent the boundary between "I can handle this" and "this needs a human." The problem is unclear decision boundaries, not missing infrastructure. Why not A: Self-reported confidence is poorly calibrated -- the agent is already incorrectly confident on hard cases. Why not C: Sentiment doesn't correlate with case complexity. Why not D: Over-engineered when prompt optimization hasn't been tried.

Q2. A customer writes "I've explained this three times already, please transfer me to someone who can help." The agent responds with "I understand your frustration. Let me look into this one more time" and continues troubleshooting. Is this correct?

A) Yes -- the agent should exhaust all resolution options before transferring. B) No -- the customer has explicitly requested a transfer. Honor the request immediately. Continuing to troubleshoot after a direct request erodes trust further. C) Yes -- but only if the agent hasn't attempted a resolution yet. D) No -- the agent should apologize more emphatically, then continue troubleshooting.

Correct: B. When a customer explicitly asks to be transferred, further troubleshooting communicates that their request was ignored. Immediate escalation is both the correct policy and the trust-preserving response. Why not A: Explicit transfer requests override the agent's assessment of remaining options. Why not C: The customer has already been through multiple attempts -- they've decided. Why not D: A better apology doesn't fix the problem of ignoring the customer's request.

Q3. A customer lookup returns 3 matching accounts for "Jane Smith." The agent selects the first result and processes a refund. What should the agent have done?

A) Selected the account with the most recent activity. B) Asked the customer for additional identifying information (email, account number, phone) to disambiguate before proceeding. C) Processed the refund for all 3 accounts. D) Escalated to a human agent immediately.

Correct: B. Multiple matches require clarification. Selecting by heuristic (first result, most active) risks acting on the wrong account. Why not A: "Most recent activity" is still a heuristic that may pick the wrong account. Why not C: Processing for all accounts is incorrect and wasteful. Why not D: Escalation is unnecessary -- the agent can resolve this by asking a question.

Exam tips

"Agent escalates easy cases, handles hard ones" --> add explicit escalation criteria with few-shot examples
"Customer requests human" --> honor immediately, no investigation first
Wrong answer trap: "Use sentiment analysis" -- sentiment does not equal complexity, always wrong on the exam
Wrong answer trap: "Self-reported confidence threshold" -- model confidence is uncalibrated
"Multiple matches" --> ask for additional identifiers, never select by heuristic
The core pattern: Explicit request → escalate immediately. Policy gap → escalate. Business threshold → escalate. Negative sentiment → acknowledge and resolve. Multiple matches → clarify.

LAB 5.3: ERROR PROPAGATION IN MULTI-AGENT SYSTEMS

What the exam tests

Structured error context (failure type, attempted query, partial results) enables coordinator recovery
Access failures (timeout) vs valid empty results (no matches) -- the critical distinction
Generic error statuses ("search unavailable") hide valuable context
Silent error suppression and full termination on single failures are both anti-patterns
Subagents should attempt local recovery before propagating
Synthesis output should include coverage annotations for gaps

Structured error context enables recovery

When a subagent fails, the coordinator needs to know: what failed, what was tried, what partial results exist, and what alternatives are available.

# ✗ ANTI-PATTERN: Generic error
return {"error": "Search failed"}
# Coordinator has nothing to work with.

# ✓ CORRECT: Structured error context
return {
    "isError": True,
    "errorContext": {
        "failure_type": "rate_limit",
        "attempted_query": "AI market size 2025",
        "partial_results": [
            {"source": "cached", "data": "..."}
        ],
        "alternatives": [
            "Try a different search provider",
            "Use cached results from last week"
        ],
        "retries_attempted": 3
    }
}
# Coordinator can decide: use partial results,
# try alternative source, or escalate.

Access failure vs empty result

This distinction is critical and heavily tested:

# ACCESS FAILURE: The search couldn't execute
# isError: True -- something went wrong
{
    "isError": True,
    "errorContext": {
        "failure_type": "timeout",
        "message": "Database unreachable after 5s"
    }
}

# EMPTY RESULT: The search executed, found nothing
# isError: False -- the search worked fine
{
    "isError": False,
    "results": [],
    "message": "No matching records found"
}

Never return isError: false with empty results for an access failure. If the database was down and you return {"results": []}, Claude tells the customer "no records found" when the truth is it couldn't search at all.

Local recovery before propagation

Subagents should handle transient errors locally (retry 2-3 times) and only propagate to the coordinator what they can't resolve:

def search_agent(query):
    for attempt in range(3):
        try:
            return search(query)         # Local retry
        except RateLimitError:
            time.sleep(2 ** attempt)     # Backoff
        except TimeoutError:
            continue

    # Exhausted retries -- propagate with context
    return {
        "isError": True,
        "errorContext": {
            "failure_type": "exhausted_retries",
            "attempted_query": query,
            "retries": 3,
            "partial_results": cached_results,
            "recommendation": "Try alternative source"
        }
    }

Coverage annotations in synthesis

When synthesis output is based on incomplete data (some sources failed), annotate which findings are well-supported vs which have gaps:

synthesis = {
    "findings": [...],
    "coverage": {
        "web_sources": "complete",
        "internal_docs": "partial -- 2 of 5 sources unavailable",
        "database": "failed -- timeout after 3 retries"
    }
}

Check your understanding

Q1. A pricing subagent returns {"error": "service unavailable"} to the coordinator after failing to reach the pricing API. The coordinator responds to the customer: "We don't have pricing information for that product." But the product does have pricing -- the API was just temporarily down. What went wrong?

A) The coordinator should cache pricing data for common products. B) The generic error string doesn't distinguish between "the pricing API is down" and "this product has no pricing." Return structured error context (failure type: "transient", what was attempted, whether cached data is available) so the coordinator can respond accurately. C) The pricing API timeout should be increased. D) The coordinator should always escalate pricing questions to a human.

Correct: B. A generic "service unavailable" gives the coordinator no basis for an accurate response. Structured context lets it distinguish an API failure (temporary, retryable) from a genuine "no pricing exists" result and respond accordingly. Why not A: Caching helps but doesn't fix the coordinator's inability to interpret the error. Why not C: A longer timeout may help but the coordinator still needs to understand what happened. Why not D: Escalating all pricing questions is overkill -- only unresolvable failures need escalation.

Q2. In a 3-source research system, one source fails completely. A developer suggests terminating the entire workflow. What is the correct approach?

A) Terminate -- incomplete data is worse than no data. B) Continue with the 2 successful sources. Include coverage annotations in the synthesis noting which source failed and what topic areas may have gaps. C) Silently omit the failed source and present results as complete. D) Retry the failed source indefinitely until it succeeds.

Correct: B. Partial results with annotated gaps are more useful than no results. The coverage annotations let the reader know which findings are well-supported and which areas may be incomplete. Why not A: Terminating discards the successful results. Why not C: Silent omission misleads the reader about coverage completeness. Why not D: Indefinite retries block the entire workflow.

Q3. A subagent encounters a database timeout. It immediately propagates the error to the coordinator without attempting any retry. What should the subagent do differently?

A) Nothing -- error propagation is correct. B) Attempt local recovery (2-3 retries with backoff) for transient errors, propagating to the coordinator only after local retries are exhausted, with structured context about what was attempted. C) Return empty results instead of an error. D) Increase the timeout limit.

Correct: B. Subagents should handle transient failures locally before burdening the coordinator. Propagate only unresolvable errors with full context (retries attempted, partial results, alternatives). Why not A: Immediate propagation forces the coordinator to handle a recoverable error. Why not C: Empty results for a timeout is the worst anti-pattern -- it misleads the agent. Why not D: Increasing timeout doesn't guarantee success and doesn't address the retry strategy.

Exam tips

"Generic error message" --> return structured error context (failure type, attempted query, partial results, alternatives)
"Search returned empty" --> is it a genuine empty result or an access failure? Check isError.
"One source fails, terminate everything" --> WRONG. Continue with partial results + coverage annotations.
Wrong answer trap: "Return empty results for failed searches" -- silent suppression is the worst anti-pattern
The core pattern: Structured errors enable recovery. Local retry before propagation. Partial results with coverage annotations beat full termination.

LAB 5.4: CONTEXT MANAGEMENT IN LARGE CODEBASE EXPLORATION

What the exam tests

Context degradation in extended sessions (generic answers instead of specific findings)
Scratchpad files for persisting findings across context boundaries
Subagent delegation for isolating verbose exploration output
Structured state persistence for crash recovery (manifest files)
Using /compact to reduce context during extended sessions

Context degradation

After extended codebase exploration, Claude shifts from "the PaymentHandler class at line 45 of billing.py uses the Strategy pattern" to "this module typically uses handler patterns." Specific findings get buried under newer tool outputs as context fills.

Scratchpad files: external persistence

Scratchpad files survive context compression. Write findings to a file as you discover them:

# As you explore, persist findings to a scratchpad
echo "## Key Findings
- PaymentHandler (billing.py:45) uses Strategy pattern
- 3 untested edge cases in refund_flow.py
- auth middleware depends on deprecated jwt library
" > progress.md

# After /compact or session restart, re-read the scratchpad
cat progress.md
# Claude sees the findings without having to re-discover them

Subagent delegation: isolate verbose output

Reading 50 files in the coordinator's context buries key findings under raw file contents. Delegate exploration to subagents:

# ✗ ANTI-PATTERN: Coordinator reads 50 files directly
for file in all_files:
    Read(file)      # 50 files in coordinator context
# Context full of file contents, key findings buried

# ✓ CORRECT: Subagent explores, returns summary
coordinator.run("""
Delegate to explore subagent:
'Find all test files and identify gaps in coverage.'
The subagent reads the files (verbose).
You receive only: 'Found 12 test files, 3 modules
have zero coverage: billing, auth, notifications.'
""")
# 50 files stay in subagent context.
# Coordinator sees 2-line summary.

Crash recovery with state manifests

Long-running multi-agent analysis needs state persistence. Each agent exports its state; on crash recovery, the coordinator loads the manifest and resumes from the incomplete phase:

# Each agent writes state on completion
agent_state = {
    "agent": "search",
    "status": "complete",
    "findings": [...],
    "files_analyzed": ["billing.py", "auth.py"],
    "timestamp": "2025-04-15T10:30:00Z"
}
write_file("state/search_agent.json", agent_state)

# Coordinator writes manifest
manifest = {
    "agents_completed": ["search", "analysis"],
    "agents_pending": ["synthesis"],
    "overall_progress": "2/3 phases complete"
}
write_file("state/manifest.json", manifest)

# On crash recovery:
manifest = read_file("state/manifest.json")
# Resume from synthesis phase, skip search and analysis

/compact for context management

/compact compresses conversation history while preserving the essential thread. Combine with scratchpad files: write key findings to the scratchpad before compacting, re-read the scratchpad after. The findings survive because they exist outside the conversation.

Check your understanding

Q1. After reading 80+ files across a monorepo, Claude's responses shift from citing specific function names and line numbers to generic statements like "the service layer follows standard patterns." You haven't changed topics -- Claude is answering the same types of questions with less specificity. What should you do?

A) Start over with a new session. B) Write key findings (function names, file paths, architectural decisions) to a scratchpad file, run /compact to reclaim context, then re-read the scratchpad to restore the critical details. C) Switch to a model with a larger context window. D) Re-read every file you've already examined.

Correct: B. A scratchpad file externalizes your findings to disk where they persist across context compression. After /compact clears the verbose file-reading output, re-reading the concise scratchpad restores the important details. Why not A: Starting over discards everything you've learned. Why not C: Larger context delays degradation but doesn't prevent it -- the same problem occurs later. Why not D: Re-reading 80 files adds even more content to already-strained context.

Q2. A codebase analysis requires reading 200 files. How should you structure the exploration?

A) Read all 200 files in the main agent's context. B) Spawn subagents to investigate specific questions (e.g., "find test files," "trace refund dependencies"). Each subagent reads the relevant files and returns a concise summary to the coordinator. C) Read the 20 most important files and skip the rest. D) Use a larger model to handle more files.

Correct: B. Subagent delegation keeps verbose file contents out of the coordinator's context. Each subagent handles a focused question and returns a summary. Why not A: 200 files in one context causes degradation. Why not C: Skipping files may miss critical dependencies. Why not D: Model size doesn't solve context management.

Q3. A multi-agent codebase analysis crashes after completing 2 of 3 phases. No state was saved. What should have been done?

A) Run the entire analysis again. B) Each agent should export structured state to a manifest file on completion. On crash recovery, the coordinator loads the manifest and resumes from the incomplete phase. C) Use longer session timeouts. D) Run all phases in parallel to avoid crashes.

Correct: B. Structured state persistence (manifests) enables crash recovery. The coordinator knows which phases completed and which need re-running. Why not A: Re-running completed phases wastes time and tokens. Why not C: Longer timeouts don't prevent crashes. Why not D: Parallel execution doesn't address state persistence.

Exam tips

"Generic answers after long session" --> context degradation. Scratchpad + /compact.
"200 files to explore" --> subagent delegation (verbose output stays isolated)
"Crash recovery" --> structured state manifests exported by each agent
Wrong answer trap: "Re-read all files" -- adds more content to already-full context
The core pattern: Generic answers = context degradation. Fix: scratchpad for persistence, subagents for isolation, manifests for crash recovery.

LAB 5.5: HUMAN REVIEW WORKFLOWS AND CONFIDENCE CALIBRATION

What the exam tests

Aggregate accuracy metrics (e.g., 97% overall) may mask poor performance on specific document types or fields
Stratified random sampling for measuring error rates in high-confidence extractions
Field-level confidence scores calibrated using labeled validation sets
Validating accuracy by document type and field before automating
Implementing stratified sampling for ongoing measurement
Analyzing accuracy by document type and field across all segments
Calibrating review thresholds using labeled validation sets
Routing low-confidence or ambiguous extractions to human review

Aggregate accuracy hides per-type failures

94% overall accuracy might mean:

99% on standard invoices (80% of volume)
88% on purchase orders (12% of volume)
62% on handwritten receipts (8% of volume)

The weighted average looks fine. One in three handwritten receipts has errors flowing into your accounting system.

Aggregate accuracy is misleading. Always break down by document type AND by field before automating.

How to validate

def validate_accuracy(results, ground_truth):
    """Compute accuracy by document type and field."""

    # Per-document-type accuracy
    by_type = {}
    for doc in results:
        dt = doc["doc_type"]
        if dt not in by_type:
            by_type[dt] = {"correct": 0, "total": 0}
        by_type[dt]["total"] += 1
        if doc["extracted"] == ground_truth[doc["id"]]:
            by_type[dt]["correct"] += 1

    print("ACCURACY BY DOCUMENT TYPE:")
    for dt, stats in by_type.items():
        acc = stats["correct"] / stats["total"]
        flag = " <-- FAILING" if acc < 0.80 else ""
        print(f"  {dt}: {acc:.0%}{flag}")

    # Per-field accuracy
    by_field = {}
    for doc in results:
        for field, value in doc["fields"].items():
            if field not in by_field:
                by_field[field] = {"correct": 0, "total": 0}
            by_field[field]["total"] += 1
            truth = ground_truth[doc["id"]][field]
            if value == truth:
                by_field[field]["correct"] += 1

    print("\nACCURACY BY FIELD:")
    for field, stats in by_field.items():
        acc = stats["correct"] / stats["total"]
        flag = " <-- WEAK" if acc < 0.85 else ""
        print(f"  {field}: {acc:.0%}{flag}")

Stratified random sampling

Random sampling from a 1000-document corpus may include zero handwritten receipts (8% of volume). Stratified sampling guarantees every type is represented:

def stratified_sample(documents, n_per_type=5):
    """Sample N documents per type, not N total."""
    by_type = {}
    for doc in documents:
        by_type.setdefault(doc["doc_type"], []).append(doc)

    sample = []
    for doc_type, docs in by_type.items():
        picks = random.sample(docs, min(n_per_type, len(docs)))
        sample.extend(picks)
    return sample
    # 3 types x 5 = 15 documents
    # EVERY type represented, including rare ones

Why this matters for ongoing monitoring: After initial validation, you still need to catch novel error patterns. New vendor formats, regulatory changes, or data drift can introduce errors the model is confidently wrong about. Stratified sampling per review cycle catches these.

Confidence calibration with labeled data

Model-reported confidence (0.92) doesn't mean 92% of those extractions are correct. Calibrate using a labeled validation set:

# Among docs where model reports confidence > 0.95:
#   Actual accuracy: 97% (well-calibrated)

# Among docs where model reports confidence 0.85-0.90:
#   Actual accuracy: 71% (overconfident!)

# Handwritten receipts specifically:
#   Model confidence: 0.88 average
#   Actual accuracy: 62% (severely overconfident)

High confidence does not equal high accuracy. Calibration reveals where the model is overconfident -- enabling per-type automation thresholds instead of one global cutoff.

Check your understanding

Q1. Your legal contract extraction system achieves 91% overall accuracy across 3 contract types. Management wants to deploy it without human review. What should you do first?

A) Agree -- 91% exceeds the acceptable error threshold for legal documents. B) Break down accuracy by contract type and field -- the aggregate may mask that NDAs extract at 99% accuracy while employment contracts extract at 68%, which is unacceptable for legal use. C) Set the confidence threshold to 0.95 to filter out uncertain extractions. D) Run the test on 10x more contracts to get a more reliable aggregate number.

Correct: B. Aggregate accuracy masks per-type variation. A system that excels on NDAs but fails on employment contracts looks adequate in aggregate but is dangerous for the failing type. Validate each contract type independently before removing human review. Why not A: 91% aggregate could hide a contract type with an unacceptable error rate. Why not C: Confidence thresholds assume the model's confidence is calibrated -- which hasn't been verified. Why not D: A larger test set may confirm the aggregate without revealing per-type failures.

Q2. Your medical records extraction system has maintained 97% accuracy for four months. Currently, only extractions with confidence below 0.80 go to human review. A director suggests removing all human review to cut costs. What is the right approach?

A) Agree -- four months of 97% accuracy demonstrates the system is reliable. B) Keep reviewing low-confidence extractions AND add stratified random sampling of high-confidence extractions to catch errors the model is confidently wrong about. Data drift and new record formats can introduce silent failures. C) Reduce the confidence threshold to 0.60 to review even fewer extractions. D) Switch to flat random 3% sampling across all extraction types.

Correct: B. High-confidence errors are the most dangerous -- the system is wrong but doesn't know it. Stratified sampling by record type ensures rare formats aren't missed. Past accuracy doesn't protect against future data drift. Why not A: Four months of stability doesn't account for new record formats, provider changes, or seasonal variation. Why not C: Lowering the threshold reviews even fewer items -- you'd catch less, not more. Why not D: Flat random sampling undersample rare record types where errors concentrate.

Q3. Among documents with model confidence > 0.90, actual accuracy is 98% for invoices but 65% for handwritten receipts. Both are in the same confidence band. What does this reveal?

A) The confidence threshold should be higher. B) Confidence scores are well-calibrated and reliable. C) Confidence must be validated by document type -- a single confidence band can contain very different actual accuracy rates across document types. Calibrate thresholds per type, not globally. D) Handwritten receipts should be excluded from the system.

Correct: C. Same confidence band, wildly different actual accuracy by type. Global thresholds are misleading -- calibrate per document type. Why not A: Higher thresholds don't fix per-type variation within a band. Why not B: This proves confidence is NOT well-calibrated for receipts. Why not D: Exclusion is extreme -- route to human review instead while improving the extraction.

Exam tips

"94% accuracy, automate everything" --> validate by document type and field first
"Review only low-confidence" --> add stratified sampling of HIGH-confidence extractions
Wrong answer trap: "Increase confidence threshold" -- thresholds assume calibration, which must be verified first
Wrong answer trap: "Larger test set" -- confirms aggregate without revealing per-type failures
Wrong answer trap: "Random 5% sampling" -- misses rare document types; use stratified sampling
The core pattern: Aggregate accuracy → per-type accuracy → per-field accuracy → stratified sampling → calibrated thresholds. Skip a step and you miss failures.

LAB 5.6: INFORMATION PROVENANCE AND MULTI-SOURCE SYNTHESIS

What the exam tests

Source attribution lost during summarization (claim-source mappings compressed away)
Structured claim-source mappings that synthesis agents preserve and merge
Conflicting statistics from credible sources: annotate with source, don't silently pick one
Temporal data: publication dates prevent false contradictions
Requiring subagents to output structured mappings (source, URL, date, excerpt)
Distinguishing well-established findings from contested ones
Rendering content types appropriately (financial as tables, news as prose)

Source attribution is lost in summarization

Seamless synthesis paragraphs destroy traceability. The reader cannot determine which fact came from which source:

# ✗ ANTI-PATTERN: Unsourced synthesis
"Battery costs have dropped 89% and the AI market is
growing at 35% CAGR."
# Which source? What date? Can't verify.

# ✓ CORRECT: Claim-source mappings preserved
[
    {
        "claim": "Battery costs dropped 89% since 2010",
        "source": "BloombergNEF Energy Report",
        "url": "https://example.com/energy-2024",
        "date": "2024-11-20",
        "excerpt": "Lithium-ion pack prices fell 89%..."
    },
    {
        "claim": "AI market growing at 35% CAGR",
        "source": "Gartner Technology Forecast",
        "url": "https://example.com/ai-forecast",
        "date": "2025-03-15",
        "excerpt": "The global AI market is projected..."
    }
]

Handling conflicting sources

Two credible sources disagree: annotate the conflict with both values. Never silently pick one. Never average.

# ✗ ANTI-PATTERN: Silently select one value
market_size = "$150B"  # Picked Source A, ignored Source B

# ✓ CORRECT: Annotate the conflict
{
    "claim": "AI market size",
    "values": [
        {"value": "$150B", "source": "Gartner",
         "date": "2025-03"},
        {"value": "$184B", "source": "IDC",
         "date": "2025-01"}
    ],
    "conflict": True,
    "resolution_note": "Difference likely due to scope: "
        "Gartner excludes hardware, IDC includes it."
}

Temporal metadata prevents false contradictions

# Without dates, this looks like a contradiction:
# "Battery cost: $400/kWh" vs "Battery cost: $139/kWh"

# With dates, it's a trend:
{
    "claim": "Battery cost per kWh",
    "values": [
        {"value": "$400/kWh", "source": "DOE",
         "date": "2015-06"},
        {"value": "$139/kWh", "source": "DOE",
         "date": "2024-12"}
    ],
    "conflict": False,
    "note": "Price decline over 9 years, same source"
}

Rendering content types appropriately

Match the output format to the content type:

Financial data --> tables with precise numbers
News/events --> prose with dates and source attribution
Technical findings --> structured lists with code references
Conflicting data --> side-by-side comparison with source metadata

Check your understanding

Q1. A research report combines findings from 4 sources into a seamless narrative. A stakeholder acts on a pricing figure that turns out to be from an outdated Slack message. What should the synthesis agent have done?

A) Only use official sources, excluding Slack. B) Preserve claim-source mappings throughout synthesis so every fact can be traced to its source, URL, and date -- enabling the reader to assess reliability. C) Add a disclaimer that sources may be outdated. D) Have the coordinator verify all facts before synthesis.

Correct: B. Claim-source mappings let the reader see that the pricing figure came from a Slack message (lower reliability) vs an official document (higher reliability). Why not A: Slack messages can contain valuable information -- the issue is traceability, not exclusion. Why not C: Disclaimers don't help the reader identify which specific facts are unreliable. Why not D: Verification before synthesis is impractical for large reports.

Q2. Two credible sources report different AI market sizes: $150B (Gartner, March 2025) and $184B (IDC, January 2025). What should the synthesis agent do?

A) Average the two numbers. B) Use the more recent source. C) Annotate the conflict with both values, sources, dates, and a note explaining the likely reason for the difference (e.g., different scope definitions). D) Exclude both values since they conflict.

Correct: C. Conflicting values from credible sources should be preserved with full attribution, not silently resolved. The resolution note helps the reader understand the discrepancy. Why not A: Averaging creates a number neither source reported. Why not B: Recency doesn't determine correctness -- the scopes may differ. Why not D: Excluding conflicting data hides important information.

Q3. A synthesis report lists "Battery cost: $400/kWh" and "Battery cost: $139/kWh" as contradictory findings. Both are from the same DOE source. What information is missing?

A) The methodology used for each measurement. B) Publication dates -- $400/kWh is from 2015, $139/kWh from 2024. With dates, this is a trend, not a contradiction. C) The confidence level of each finding. D) The geographic region of each measurement.

Correct: B. Temporal metadata (publication dates) reveals this is a price decline over 9 years, not a contradiction. Without dates, the synthesis agent misinterprets the difference. Why not A: Methodology is relevant but dates alone resolve this case. Why not C: Confidence doesn't explain the numerical difference. Why not D: Geographic region isn't the differentiating factor here.

Exam tips

"Can't trace which source" --> preserve claim-source mappings through synthesis
"Two sources disagree" --> annotate the conflict with both values, sources, and dates
"Apparent contradiction from same source" --> check publication dates (temporal metadata resolves it)
Wrong answer trap: "Average the conflicting values" -- creates a number no source reported
Wrong answer trap: "Use the most recent source" -- recency does not determine correctness
The core pattern: Every claim linked to source + date. Conflicts annotated, never silently resolved. Temporal metadata prevents false contradictions.

LAB FINAL: CAPSTONE EXAM SCENARIO

Domains reinforced: 1, 2, 3, 4, 5 Combines: agentic loops, tool design, Claude Code config, prompt engineering, context management

Scenario: Customer Support Resolution Agent

Customer support resolution agent using the Claude Agent SDK. Handles returns, billing disputes, and account issues via MCP tools (get_customer, lookup_order, process_refund, escalate_to_human). Target: 80%+ first-contact resolution.

This scenario integrates all five domains -- the exam tests each domain's patterns in an integrated context:

Domain 1 (Agentic Architecture): The agentic loop pattern, prerequisite gates, hooks
Domain 2 (Tool Design): Structured error responses, scoped tool access
Domain 3 (Claude Code): CLAUDE.md for project context, path-specific rules
Domain 4 (Prompt Engineering): Explicit escalation criteria, few-shot examples
Domain 5 (Context Management): Case facts blocks, escalation patterns, accuracy validation

The integrated architecture

# Domain 1: Agentic loop with stop_reason control
while True:
    response = client.messages.create(
        model="claude-sonnet-4-5",
        tools=support_tools,
        messages=messages
    )
    if response.stop_reason == "end_turn":
        break
    if response.stop_reason == "tool_use":
        # Domain 1: Prerequisite gate
        # get_customer must succeed before process_refund
        execute_and_append(response, messages)

# Domain 2: Structured error responses from tools
def process_refund(order_id, amount):
    if amount > 500:
        return {
            "isError": True,
            "errorCategory": "business",
            "isRetryable": False,
            "customer_friendly": "I'll connect you with "
                "a supervisor for refunds over $500."
        }

# Domain 5: Case facts block
case_facts = """
## CASE FACTS (Do not summarize)
| Customer | Jane Smith (CUST-001) |
| Order | ORD-555 |
| Issue | Overcharged $50.00 |
| Resolution | Refund to original payment |
"""

# Domain 3: CLAUDE.md for CI-deployed support agent
# .claude/CLAUDE.md
Use the support agent configuration.
@import ./rules/escalation.md

# .claude/rules/escalation.md
---
paths: ["support/**"]
---
Escalate when: customer requests human, policy gap,
refund > $500. Do NOT escalate on sentiment alone.

# Domain 4: Explicit criteria with few-shot examples
system_prompt = """You are a support agent. Escalate ONLY when:
1. Customer explicitly requests a human
2. No policy covers this situation
3. Refund amount exceeds $500

Example (resolve autonomously):
Customer: "I got the wrong item, order ORD-123"
Action: Look up order, process replacement. No escalation.

Example (escalate):
Customer: "I want to speak to a manager"
Action: Escalate immediately. Do not investigate first."""

Check your understanding

Q1. (Domain 1) The agent occasionally transfers funds between accounts without confirming the destination account with the customer. The system prompt says "always confirm destination before transfers." Audit shows 3% of transfers go to unconfirmed accounts. What is the correct fix?

A) Rewrite the system prompt with stronger language about confirmation requirements. B) Add a programmatic prerequisite gate that blocks transfer_funds until confirm_destination has been called and the customer has approved. C) Add few-shot examples showing the confirmation step. D) Require the customer to enter the destination account number twice.

Correct: B. Financial transfers require deterministic enforcement. A gate makes it architecturally impossible to transfer without confirmation -- no prompt compliance gap. Why not A/C: Prompt improvements reduce violations but leave a non-zero failure rate -- unacceptable for fund transfers. Why not D: Double-entry is a UX pattern, not an agent architecture solution.

Q2. (Domain 2) The check_warranty tool returns {"warranty": null} when the warranty service is temporarily unreachable. The agent tells the customer "your product is not covered under warranty." The customer actually has an active warranty. What should the tool return?

A) A more descriptive null value like {"warranty": "unknown"}. B) isError: true with errorCategory: "transient" and isRetryable: true -- so the agent knows the warranty check failed and can retry or escalate instead of assuming no coverage. C) The same response but log the failure internally. D) A default warranty status of "active" to avoid false negatives.

Correct: B. The tool is signaling success when it should signal failure. isError: true tells the agent the check didn't execute, preventing it from making false claims about warranty status. Why not A: "Unknown" is better than null but still lacks the error metadata the agent needs for recovery. Why not C: Internal logging doesn't help the agent make a correct real-time decision. Why not D: Defaulting to "active" creates false positives -- equally wrong in the opposite direction.

Q3. (Domain 3) You want Claude Code to apply escalation rules only when editing support agent files, not when working on billing or shipping code. How?

A) Put the rules in the root CLAUDE.md. B) Create a path-specific rule in .claude/rules/escalation.md with paths: ["support/**"] so the escalation rules load only when editing support files. C) Create a directory-level CLAUDE.md in every support directory. D) Add the rules to a skill.

Correct: B. Path-specific rules with glob patterns load conditionally based on which files you're editing. Why not A: Root CLAUDE.md loads for all files. Why not C: Directory-level CLAUDE.md works but is less maintainable across multiple support directories. Why not D: Skills are for on-demand workflows, not always-loaded rules.

Q4. (Domain 4) The agent handles password resets autonomously (should escalate -- requires identity verification) but escalates billing inquiries to humans (should resolve -- straightforward lookups). The escalation logic is inverted. What fixes this?

A) Train a sentiment model to detect when customers need human help. B) Add explicit escalation criteria with examples: "Escalate: identity verification, account recovery, security concerns. Resolve autonomously: billing lookups, order status, plan details." C) Have the agent self-report confidence and escalate when uncertain. D) Add a secondary model that audits escalation decisions.

Correct: B. Explicit criteria with examples define clear boundaries between "must escalate" (security-sensitive) and "can resolve" (informational) categories. Why not A: Sentiment doesn't predict which tasks require human involvement. Why not C: Self-reported confidence is uncalibrated -- the agent is already confident about the wrong decisions. Why not D: Over-engineered when the core problem is missing criteria.

Q5. (Domain 5) After a lengthy troubleshooting conversation, the agent tells the customer their shipping address is "the one on file" but can no longer recall the specific address, tracking number (TRK-88421), or the promised delivery date (April 25). What is the fix?

A) Increase the context window to hold more conversation history. B) Extract transactional details (address, tracking number, delivery date) into an immutable case facts block that persists unchanged regardless of conversation length. C) Have the agent re-call get_shipping_details to refresh the information. D) Implement more aggressive summarization to keep the context lean.

Correct: B. Case facts blocks preserve specific identifiers verbatim outside the summarization process. The tracking number and delivery date survive any amount of conversation. Why not A: A larger context delays the problem but summarization still erodes specifics. Why not C: Re-querying wastes tokens when the data was already retrieved -- and the problem will recur. Why not D: Summarization is what's causing the loss of specifics -- more of it makes the problem worse.

Exam tips

This capstone demonstrates how each domain's pattern applies in an integrated system:

Agentic loops (Domain 1) drive the conversation via stop_reason
Structured errors (Domain 2) enable intelligent tool failure handling
CLAUDE.md and path rules (Domain 3) provide project context in CI and interactive sessions
Explicit criteria (Domain 4) calibrate escalation decisions with few-shot examples
Case facts (Domain 5) preserve critical data across long conversations

Each domain's anti-patterns remain wrong in the integrated context. A prompt-based gate is still wrong in the capstone. Sentiment-based escalation is still wrong. Silent error suppression is still wrong. The integration doesn't change the rules -- it combines them.

Take the Practice Exam. Score 900+ before scheduling the real exam. If a domain scores below 80%, re-read those sections.

SCENARIO WALKTHROUGHS

The exam draws 4 of 6 scenarios at random. Each of these walkthroughs frames what the scenario tests, which labs cover it, and the traps that catch exam-takers who studied the labs in isolation without seeing how the patterns combine.

SCENARIO 1: CUSTOMER SUPPORT RESOLUTION AGENT

The setup

A support agent built on the Claude Agent SDK handles returns, billing disputes, and account issues. It integrates with backend systems via custom MCP tools (get_customer, lookup_order, process_refund, escalate_to_human). The target is 80%+ first-contact resolution while correctly escalating cases that exceed the agent's scope. This scenario is the one worked in depth as the Final Capstone Lab — the walkthrough here is a study-aid summary; the full implementation is in the capstone.

Primary domains tested

Domain 1: Agentic Architecture & Orchestration
Domain 2: Tool Design & MCP Integration
Domain 5: Context Management & Reliability

Key architectural decisions

1. Prerequisite gate on customer verification

process_refund must not execute until get_customer returns a verified ID. Enforce it programmatically (a code-level check that blocks the tool call), not in the prompt — prompt instructions have a non-zero failure rate, and financial operations cannot tolerate any. See Lab 1.4.

2. PreToolUse hook on refund limits

Refunds above $500 require human approval. A PreToolUse hook that blocks the tool call and redirects to escalate_to_human is deterministic; a system-prompt instruction is probabilistic. See Lab 1.5.

3. Structured error responses from every tool

When lookup_order fails, the tool returns {isError, errorCategory, isRetryable, customer_friendly} — not a generic failure. Access failures must be distinguishable from empty results so the agent knows whether to retry, try an alternative, or escalate. See Lab 2.2.

4. Case facts block for long conversations

Customer ID, order number, issue summary, and amount go into an immutable "CASE FACTS" block at the top of every prompt. Progressive summarization erodes specifics over long conversations; the case facts block preserves them verbatim. See Lab 5.1.

5. Explicit escalation criteria, not sentiment

Escalate on policy gaps, capability limits, explicit customer requests, or business thresholds. Never on sentiment — an angry customer requesting a simple address change does not need a human. See Lab 5.2.

6. Structured handoff when escalating

The handoff to a human must include customer ID, summary, root cause analysis, partial resolution status, and recommended next action. "APP-002: needs review" forces the reviewer to reconstruct the case. See Lab 1.4 "Structured handoff" section.

What the exam actually tests

Choosing programmatic gates over prompt instructions for financial/compliance rules
Distinguishing hooks (deterministic) from prompts (probabilistic) for business-rule enforcement
Identifying sentiment-based escalation as an anti-pattern
Recognizing when a case facts block is the fix vs. when more summarization is the fix (it's almost always the former)
Structured error responses that let the agent choose retry vs alternative vs escalate

Common wrong answers

"Strengthen the prompt" for a compliance rule — always probabilistic, always wrong for critical operations
"Escalate because the customer is angry" — sentiment ≠ complexity
"Summarize more aggressively" when specifics are disappearing — more summarization makes it worse
"Return {} on database timeout" — use a structured error so the agent can recover

Worked implementation: LAB FINAL: Capstone Exam Scenario.

SCENARIO 2: CODE GENERATION WITH CLAUDE CODE

The setup

Your team uses Claude Code across the software-delivery workflow: refactoring, debugging, generating boilerplate, authoring docs. You need it to behave consistently across the team, integrate with CI/CD, and stay useful across long sessions. The architectural questions are about configuration and workflow selection, not about the model's reasoning.

Primary domains tested

Domain 3: Claude Code Configuration & Workflows (dominant)
Domain 5: Context Management & Reliability (supporting, for longer sessions)

Key architectural decisions

1. Where does each rule belong?

The CLAUDE.md hierarchy is the single most-tested concept in this scenario.

Rule type	Goes in	Why
Team coding standards, testing conventions	`.claude/CLAUDE.md` (project)	Committed to git, every developer gets it automatically
Verbosity / tone / editor-style preferences	`~/.claude/CLAUDE.md` (user)	Personal, not imposed on the team
Scoped rules (e.g., API directory must use Zod)	Directory-level `CLAUDE.md` or `.claude/rules/` with `paths:` frontmatter	Applies only where relevant
Topic-specific rules (testing, accessibility, etc.)	Individual files in `.claude/rules/`	Avoids monolithic CLAUDE.md merge conflicts

Exam trap: "A new contractor sees different Claude behavior than the existing team." The cause is almost always a rule that lives in a full-time developer's ~/.claude/CLAUDE.md but should live in .claude/CLAUDE.md. The fix is moving the rule, not rewriting the prompt.

See Lab 3.1, Lab 3.3.

2. Slash command or skill?

A slash command runs in the current session's context. A Skill runs in a forked sub-context with restricted allowed-tools. Use a Skill when the command would otherwise pollute the session with exploration noise, or when you want to restrict what tools Claude can use for that specific operation.

Exam trap: "My /review command fills the session with file contents so the rest of the conversation gets confused." A Skill with context: fork isolates the verbose exploration and returns only the summary.

See Lab 3.2.

3. Plan mode or direct execution?

Plan mode is for tasks where the design is the hard part: multi-file refactors, architectural changes, anything where reviewing the plan first is cheaper than reviewing the diff. Direct execution is for tasks where the design is obvious: single-file fixes, typos, straightforward renames.

Exam trap: A question describing a one-line typo fix with plan mode as a distractor answer — plan mode is overkill. Conversely, a multi-file refactor with direct execution as a distractor — direct execution skips the review step that the task deserves.

See Lab 3.4.

4. Iterative refinement

When Claude's first attempt is wrong, replacing vague prompts with concrete examples works better than making the prompt longer. Test-driven iteration (write the failing test, then implement) produces more consistent results than free-form iteration. The interview pattern (Claude asks clarifying questions before implementing) beats guessing defaults.

See Lab 3.5.

5. CI/CD integration

CI runs need the -p flag for non-interactive mode (without it, Claude Code hangs waiting for terminal input). Structured output uses --output-format json with --json-schema when the pipeline needs to parse findings. Separate sessions for generation and review — same-session self-review has confirmation bias; independent review finds issues the generator was blind to.

See Lab 3.6, Lab 4.6.

6. Context across long sessions

Long refactor sessions accumulate verbose tool outputs. Symptoms: Claude starts giving generic answers like "this pattern typically handles X" instead of referencing specific classes it found earlier. Mitigation: /compact or a fresh session seeded with a structured summary of prior findings — not a full transcript.

See Lab 5.1, Lab 1.7.

What the exam actually tests

Diagnosing configuration problems from symptoms (not from reading the config)
Choosing between commands, skills, and CLAUDE.md rules for a given requirement
Matching plan mode vs direct execution to task shape
Recognizing session-isolation problems in CI (generator + reviewer in the same session)
Recognizing context degradation in long sessions and choosing the right recovery

Common wrong answers

"Strengthen the system prompt" when the fix is moving a rule to project-level CLAUDE.md
"Restart the session" when the fix is /memory to surface loaded rules
"Use plan mode everywhere" — overkill for small tasks
"Keep the session open indefinitely" — accumulated context degrades answer quality

SCENARIO 3: MULTI-AGENT RESEARCH SYSTEM

The setup

A coordinator agent receives research topics and delegates to specialized subagents: one searches the web, one analyzes documents, one synthesizes findings, and one formats the final cited report. The system needs to produce comprehensive reports without gaps, without duplicated work, and with traceable provenance for every claim. This is the hardest scenario in the exam because it exercises multi-agent orchestration deeply — most wrong answers come from subtle context-isolation or provenance violations.

Primary domains tested

Domain 1: Agentic Architecture & Orchestration (dominant)
Domain 2: Tool Design & MCP Integration
Domain 5: Context Management & Reliability

Key architectural decisions

1. Hub-and-spoke, never flat

The coordinator is the only agent that talks to other agents. Subagents don't message each other or share state. Flat architectures where every agent sees the global conversation look simpler but produce worse results: subagents get distracted by irrelevant context, tool selection degrades, and the coordinator loses the ability to arbitrate conflicts.

See Lab 1.2.

2. Pass only what the subagent needs

The coordinator's conversation history is long and full of context from other delegations. Forwarding it to each subagent wastes tokens and confuses the subagent about what its specific task is. Each Task call should pass a focused, task-scoped prompt — not coordinator.full_conversation_history.

Exam trap: A question shows a Task call with context=coordinator_history and asks what's wrong. The answer is always that subagents need explicit, scoped context — not the full coordinator state.

See Lab 1.3.

3. Parallel spawning, not sequential

Four independent 20-second subtasks done sequentially take 80 seconds. Four Task calls emitted in one coordinator response run in parallel and finish in ~20 seconds. The exam tests this specifically — a scenario describing "slow research system" where the fix is parallel spawning, not switching to a faster model.

4. Provenance on every claim

When two subagents return conflicting statistics (one says 23%, another says 27%), the coordinator can't resolve the conflict without provenance — which source, from when, at what confidence level. The synthesis agent attaches {source, confidence, timestamp, agent_id} to every claim and surfaces conflicts explicitly rather than silently averaging or picking one.

See Lab 5.6.

5. Structured error propagation

A subagent whose tool call fails must distinguish access failure ("couldn't reach the database") from empty result ("searched, found nothing"). Silent empty-returns on access failures cause the coordinator to synthesize confidently from nothing. Structured error dicts with failure_type, attempted_query, and partial_results let the coordinator decide: use partial data, retry, or escalate.

See Lab 5.3.

6. Narrow decomposition is the usual root cause of "missing coverage"

If the final report has gaps, the instinct is to blame the search agent. The exam-relevant answer is almost always: the coordinator's decomposition was too narrow. Broaden the task split and the coverage improves without changing any subagent.

See Lab 1.2 "narrow decomposition" section.

What the exam actually tests

Recognizing context-isolation violations from code (pass-full-history anti-pattern)
Diagnosing slow multi-agent systems (sequential vs parallel)
Conflict resolution rules (provenance-based, not averaging)
Distinguishing access failures from empty results in subagent responses
Narrow-decomposition diagnosis when coverage is incomplete

Common wrong answers

"Give each subagent the full coordinator context" — always wrong
"Average conflicting statistics" — wrong; surface the conflict with source attribution
"Return an empty list on database timeout" — wrong; structured error with failure_type: "access_failure"
"Add more tools to the search agent" — wrong; the fix is usually decomposition, not tooling

SCENARIO 4: DEVELOPER PRODUCTIVITY WITH CLAUDE

The setup

An agent built on the Claude Agent SDK helps engineers explore unfamiliar codebases, understand legacy systems, generate boilerplate, and automate repetitive tasks. It uses the built-in tools (Read, Write, Edit, Bash, Grep, Glob) and integrates with MCP servers. Most questions in this scenario reward engineers who know the tool taxonomy cold — Grep vs Glob, Edit's failure modes, when Bash is wrong — and who think carefully about agent scope.

Primary domains tested

Domain 2: Tool Design & MCP Integration (dominant)
Domain 3: Claude Code Configuration & Workflows
Domain 1: Agentic Architecture & Orchestration

Key architectural decisions

1. Keep each agent to 4-5 tools

An agent with 18 tools calls delete_account when asked about billing. The fix is not better tool descriptions or a bigger model — it is scoping. A billing agent gets billing tools. Account-modification tools live behind a different subagent that the coordinator invokes only when needed.

See Lab 2.3.

2. Grep searches content; Glob searches names

This is the single most-tested tool-taxonomy distinction. "Find files that call processRefund" → Grep (content). "Find all files named *.test.tsx" → Glob (names). Passing a function name to Glob returns zero hits; passing a glob pattern to Grep returns confusing content matches on the literal pattern string.

See Lab 2.5.

3. Edit's non-unique-match failure has an escalation ladder

Edit fails when the old string appears multiple times. Try these in order: (1) add surrounding context to make the match unique; (2) if every occurrence should change identically, use replace_all: true; (3) if the occurrences must change differently, Read the full file, modify in memory, Write the result. Writing over an existing file as the first move is always wrong — Write replaces everything, losing any content you didn't touch.

4. MCP secrets via `${ENV_VAR}`, never hardcoded

Team tooling lives in .mcp.json (committed). Personal experiments live in ~/.claude.json (not committed). Authentication tokens always use ${ENV_VAR} expansion. A hardcoded ghp_abc123 in .mcp.json survives in git history forever even after removal — treat it as already compromised.

See Lab 2.4.

5. Incremental exploration, not read-everything

Reading 500 files upfront overflows the context window before Claude finds the relevant code. The correct pattern: Grep to find entry points → Read the primary implementation → follow imports via further Grep calls. Each step narrows the search based on what was found.

See Lab 2.5 "Incremental codebase understanding" section.

6. Built-in tool over Bash equivalent

Bash("cat config.json") is wrong when Read("config.json") exists. Bash("grep pattern file") is wrong when Grep("pattern") exists. Purpose-built tools give Claude structured results and are safer than shelling out. Bash is for operations that have no built-in equivalent.

What the exam actually tests

Tool-taxonomy questions (Grep vs Glob, Read vs Write vs Edit)
Scope diagnosis (18-tool agent misbehaving → distribute, don't describe)
MCP config placement (.mcp.json vs ~/.claude.json)
Edit failure recovery (ladder: context → replace_all → Read+Write)
Built-in-over-Bash preference

Common wrong answers

"Make tool descriptions more detailed" when the real fix is reducing tool count
"Use Bash" when a built-in tool exists
"Use Write to change a line" — Write replaces the whole file
"Put the token in .mcp.json directly" — always wrong

SCENARIO 5: CLAUDE CODE FOR CONTINUOUS INTEGRATION

The setup

Claude Code runs inside a CI/CD pipeline: on every PR it reviews the diff, flags issues, and sometimes generates tests. The pipeline needs structured output that other tools can parse, consistent review quality across runs, and behavior that doesn't hang a runner for hours. Teams care about false-positive rates because noisy findings train engineers to ignore the tool entirely.

Primary domains tested

Domain 3: Claude Code Configuration & Workflows (dominant)
Domain 4: Prompt Engineering & Structured Output

Key architectural decisions

1. Non-interactive mode is mandatory

Without -p (--print), Claude Code waits for terminal input. In a GitHub Actions runner there is no terminal — the job hangs until the runner times out. The first question on CI/CD questions is almost always: "why is my CI job running for 6 hours?" Answer: missing -p.

See Lab 3.6.

2. Structured output with schema enforcement

-p --output-format json --json-schema '{…}' produces JSON that matches a declared shape every time. A SIEM or PR-comment bot needs severity, file, line, description — enforce that with a schema, not with regex on prose. Parsing natural-language findings with regex breaks the moment Claude phrases something slightly differently.

3. Separate session for review

Same-session self-review inherits the generator's reasoning and rationalizes its own decisions. An independent session sees only the code and evaluates it on its merits. The classic pattern: claude -p "write X" in Session A, then claude -p "review diff: $(git diff)" in Session B. The review catches things the generator was blind to.

See Lab 3.6 and Lab 4.6.

4. Batch API when nobody is waiting

Nightly test-generation for 1000 files? Batch API, 50% cheaper, processes within 24 hours. Pre-merge PR check where the developer is waiting to merge? Synchronous, never Batch — the 24-hour window is unacceptable. The decision rule: is someone waiting → sync; nobody waiting → Batch.

See Lab 4.5.

5. Explicit criteria to avoid the false-positive trust spiral

"Flag long functions" returns 30 findings with 20 false positives. Engineers stop reading the output. "Flag functions exceeding 50 lines of code" returns 8 findings, all actionable. Quantify everything: line thresholds, complexity thresholds, severity cutoffs.

See Lab 4.1.

6. Prior review findings prevent duplicate comments

When CI re-runs on a new commit, feed the previous review's findings back in: "previous review found X, Y, Z. Report only NEW issues or issues still unaddressed." Without this, Claude re-flags the same issues and drowns the PR thread.

What the exam actually tests

Diagnosing hanging CI jobs (missing -p)
Designing structured output for downstream parsing
Identifying same-session self-review bias
Choosing Batch vs sync based on latency requirement
Fixing false-positive trust problems via explicit criteria

Common wrong answers

"Increase the job timeout" instead of adding -p
"Parse the review output with regex" instead of --json-schema
"Have Claude review its own generation" — always wrong in CI
"Use Batch API for pre-merge checks" — 24-hour window is a deal-breaker for blocking work

SCENARIO 6: STRUCTURED DATA EXTRACTION

The setup

Extract structured data from unstructured documents — invoices, receipts, contracts, resumes — validate against a JSON schema, and hand off to a downstream system. The pipeline must handle edge cases (missing fields, unexpected document types, handwritten entries) without either crashing or fabricating values. Aggregate accuracy metrics will look fine; the real question is whether invoices specifically extract at 97% or 70%.

Primary domains tested

Domain 4: Prompt Engineering & Structured Output (dominant)
Domain 5: Context Management & Reliability

Key architectural decisions

1. `tool_use` for schema compliance, not plain-text prompting

"Output as JSON" in a system prompt is probabilistic — the model can return prose instead, and pipelines downstream break. tool_use with a JSON schema guarantees the response matches the schema's structure. Force a specific tool with tool_choice: {"type": "tool", "name": "extract_invoice"} so the model can't return text at all.

See Lab 4.3.

2. Structure ≠ semantics

Schema compliance means the JSON fields are in the right shape. It does not mean the values are correct. An invoice phone number may be fabricated even though the field is valid. Validate the values against business rules (sum checks, date ranges, format constraints) as a separate semantic layer.

Exam trap: An extraction system shows schema-valid but semantically-wrong output. The fix is business-rule validation, not "strengthen the schema."

3. Nullable fields, not required-everywhere

Required fields force the model to fabricate when the source doesn't contain them. If the invoice has no phone number, "phone": null is honest; a made-up "+1-555-0000" satisfies the schema but is wrong. Mark fields nullable: true when the source may legitimately lack them.

4. Retry with specific errors, never generic

Validation fails. You retry. "Please try again" produces the same error. "The total field was 150.00 but the sum of line_items is 149.00 — check for a missing line item or rounding" gives Claude actionable signal. Append the specific field, expected value, and what was detected.

See Lab 4.4.

5. Few-shot examples cover the edge cases

2-4 examples — one canonical, one ambiguous, one negative/edge-case. More than 6 wastes tokens with no benefit. Include an "other" enum value and a companion detail field for unexpected document types; a rigid enum without "other" forces the model to misclassify.

See Lab 4.2.

6. Stratified accuracy, not aggregate

"97% accuracy" across 10,000 documents can hide 62% accuracy on handwritten receipts (8% of the volume). Stratify accuracy by document type and by field before automating. Flat-random sampling for validation misses rare types entirely — use stratified sampling to guarantee every type appears in the review pool.

See Lab 5.5.

What the exam actually tests

Choosing tool_use + forced tool_choice over prompt-based JSON
Distinguishing structural validation from semantic validation
Recognizing required-field-forces-fabrication as the cause of hallucinated values
Designing retry messages with actionable specifics
Detecting aggregate-metrics traps and applying stratified sampling

Common wrong answers

"Add 'output JSON' to the prompt" — probabilistic, not guaranteed
"Trust schema validation and skip business rules" — structure ≠ semantics
"Make every field required" — forces fabrication on absent data
"Retry with 'there were errors'" — no signal for what to fix
"Deploy based on 97% aggregate" — per-type metrics will reveal the hidden failure modes

ANTI-PATTERNS CHEAT SHEET

A consolidated reference across all five domains. Use this in the final week before the exam to drill pattern recognition. Each entry maps to the lab where it is taught in depth.

Priority legend: Critical appears on nearly every exam form; High appears on most; Medium is context-dependent.

Domain 1 — Agentic Architecture & Orchestration

Parsing text for loop termination Critical

Symptom: Agent exits mid-task when Claude writes "Final answer:" during intermediate reasoning alongside a tool_use block. Fix: Check stop_reason == "end_turn" exclusively. Never inspect text content for control flow. See: Lab 1.1.

Arbitrary iteration caps as primary stop signal Critical

Symptom: Loop terminates at max_iter before Claude reaches end_turn, leaving work unfinished. Fix: Iteration caps are a safety net, not a termination strategy. stop_reason is the signal. See: Lab 1.1.

Missing tool_result blocks Critical

Symptom: API error "every tool_use block needs a matching tool_result." Fix: Every tool_use block requires a tool_result with the same tool_use_id in the next role: "user" message. See: Lab 1.1.

Running the full pipeline when dynamic routing fits High

Symptom: Coordinator always invokes search → analyze → synthesize even when the query only needs analysis. Fix: Let the coordinator decide which subagents to invoke per query. Procedural pipelines waste tokens and latency. See: Lab 1.2.

Subagents inherit coordinator context (assumed) Critical

Symptom: Subagent produces vague output because it was told "continue the analysis" with no data. Fix: Pass complete findings explicitly. Subagents see only the prompt they are given -- no conversation history. See: Lab 1.3.

Flat-string handoff loses source attribution High

Symptom: Final report says "several competitors raised prices" with no companies, dates, or amounts. Fix: Pass structured data between agents with source, confidence, and timestamp fields. See: Lab 1.3.

Sequential spawning for independent subagents High

Symptom: Four 20-second tasks take 80 seconds instead of 20. Fix: Emit all Task calls in a single coordinator response so they execute in parallel. See: Lab 1.3.

Step-by-step procedural coordinator prompts Medium

Symptom: Coordinator can't adapt when step 2 finds nothing. Fix: Write goal-based prompts with quality criteria, not rigid instruction lists. See: Lab 1.3.

Prompt-based enforcement for critical rules Critical

Symptom: 3-5% of refunds process without verification, or discount policy violated occasionally. Fix: PreToolUse hook for deterministic enforcement. Prompts are probabilistic. See: Lab 1.4, Lab 1.5.

Self-reported confidence scores High

Symptom: Agent reports "95% confident" but is wrong 40% of the time at that level. Fix: Use structured criteria and programmatic checks. Model confidence is uncalibrated. See: Lab 5.5.

Domain 2 — Tool Design & MCP Integration

Vague or overlapping tool descriptions Critical

Symptom: Agent calls extract_web_results when the user says "check this document." Fix: Specific, differentiated descriptions. State what each tool is for AND what it is not for. See: Lab 2.1.

One generic tool where three specific tools fit High

Symptom: Generic extract_data(source) forces the agent to guess source type. Fix: Split into extract_document_data, extract_web_results, extract_api_response with distinct schemas. See: Lab 2.1.

Silent empty results (access failure masked as no-data) Critical

Symptom: Agent tells user "no orders found" when the orders database was unreachable. Fix: Distinguish access failures (isError: true) from genuinely empty results. Never return [] on connection failure. See: Lab 2.2.

Generic error messages without category or retry guidance High

Symptom: Agent retries indefinitely on business errors, or escalates transient glitches. Fix: Return isError, errorCategory (transient/business/access), and isRetryable. See: Lab 2.2.

18-tool agent degrades tool selection High

Symptom: Agent calls delete_account when user asks about billing. Fix: Keep each agent to 4-5 focused tools. Distribute to specialized subagents. See: Lab 2.3.

Generic fetch accepting any URL High

Symptom: Agent hits internal metadata endpoints or arbitrary third-party URLs. Fix: Constrained tool validates inputs against an allowlist. See: Lab 2.3.

Hardcoded API keys in .mcp.json Critical

Symptom: Secrets leaked via version control. Fix: Environment variable expansion: ${ENV_VAR} in configuration files. See: Lab 2.4.

Glob for content search; Grep for file names Medium

Symptom: Glob("processRefund") returns no results (it searches names, not contents). Fix: Grep searches file contents. Glob searches file paths. Never confuse them. See: Lab 2.5.

Read-everything-first codebase exploration Medium

Symptom: Reading 500 files overflows context before finding the relevant code. Fix: Incremental understanding: Grep to find references → Read implementation → follow imports. See: Lab 2.5.

Domain 3 — Claude Code Configuration & Workflows

Personal preferences in project CLAUDE.md Medium

Symptom: New team member sees different behavior than experienced developers. Fix: Team standards go in .claude/CLAUDE.md (committed). Personal preferences go in ~/.claude/CLAUDE.md. See: Lab 3.1.

Monolithic 800-line CLAUDE.md Medium

Symptom: Merge conflicts every sprint as multiple developers edit the same file. Fix: Split by topic into .claude/rules/ -- each file auto-loads, independently editable. See: Lab 3.1.

Commands for complex context-polluting tasks High

Symptom: /review command fills session context with exploration noise. Fix: Use a Skill with context: fork so verbose output stays isolated. See: Lab 3.2.

Plan mode for trivial single-file fixes Medium

Symptom: Simple typo fix takes three times longer because Claude produces a plan first. Fix: Plan mode is for multi-file architectural work. Direct execution for obvious fixes. See: Lab 3.4.

Same-session self-review in CI Critical

Symptom: Self-review finds no issues; independent review finds 2-3 per migration. Fix: Run generator and reviewer in separate sessions. Same-session review inherits generation context. See: Lab 3.6, Lab 4.6.

Missing -p flag in CI Critical

Symptom: GitHub Actions job runs 6 hours until runner times out. Fix: -p (--print) flag for non-interactive mode. Without it, Claude Code waits for terminal input. See: Lab 3.6.

Parsing natural-language CI output High

Symptom: SIEM pipeline breaks whenever Claude rephrases findings. Fix: --output-format json with --json-schema for deterministic structure. See: Lab 3.6.

Domain 4 — Prompt Engineering & Structured Output

Vague instructions ("make it better", "flag long functions") Critical

Symptom: 30 findings per PR, 20 are false positives. Reviewers stop trusting the output. Fix: Explicit measurable criteria: "flag functions exceeding 50 lines." See: Lab 4.1.

More than 6 few-shot examples Medium

Symptom: Prompt is huge, quality plateaus. Fix: 2-4 examples is optimal. More than 6 is always wrong on the exam. See: Lab 4.2.

Assuming tool_use guarantees semantic correctness High

Symptom: Schema validation passes, but extracted phone numbers are fabricated. Fix: tool_use guarantees JSON structure only. Values still need business-rule validation. See: Lab 4.3.

Required fields forcing hallucination High

Symptom: Invoice has no phone number; Claude invents one to satisfy the schema. Fix: Mark fields optional or nullable when the source may not contain them. See: Lab 4.3.

Generic retry messages ("please try again") High

Symptom: Retry loop produces the same error. Fix: Append field-specific details: expected format, detected value, what to fix. See: Lab 4.4.

Batch API for blocking workflows Medium

Symptom: Pre-merge PR check blocks for 24 hours. Fix: Batch API (50% savings) is for non-blocking work. Someone waiting → synchronous. See: Lab 4.5.

Domain 5 — Context Management & Reliability

Progressive summarization destroys specifics Critical

Symptom: "Customer John Smith (ACC-12345)" becomes "billing issue" after three summaries. Fix: Case facts block -- immutable structured reference preserving IDs, amounts, dates verbatim. See: Lab 5.1.

Lost-in-the-middle placement of critical data High

Symptom: Agent ignores important information buried mid-context. Fix: Place key findings at the beginning or end of long inputs. See: Lab 5.1.

Sentiment-based escalation Critical

Symptom: Angry customer requesting a simple address change gets escalated to a human. Fix: Escalate on policy gaps, capability limits, explicit requests, or business thresholds -- not sentiment. See: Lab 5.2.

Access failure returned as empty result Critical

Symptom: Upstream tool couldn't connect; downstream agent acts as if there's no data. Fix: Propagate structured errors with error_type: "access_failure" vs "empty_result". See: Lab 5.3.

Aggregate accuracy hides per-category failures Critical

Symptom: Overall 95% accuracy masks invoices at 70% vs receipts at 97.8%. Fix: Track accuracy stratified by document type. Aggregate metrics lie. See: Lab 5.5.

No provenance tracking in multi-agent systems High

Symptom: Subagents disagree; coordinator can't determine which to trust. Fix: Tag every claim with source, confidence, timestamp, and agent id. Verified > extracted > inferred. See: Lab 5.6.

How to use this sheet: In the week before your exam, read one domain per day and quiz yourself on the "Symptom" without looking at the "Fix." If you can recall the fix and cite the pattern, that anti-pattern is locked in. If not, re-read the linked lab.

FAQ

Practical answers for learners using this guide to prepare for the Claude Certified Architect — Foundations exam.

About the exam

What is the CCA Foundations exam?

Anthropic's entry-level certification for solution architects who design and build production applications with Claude. It validates practical judgment across the Claude Agent SDK, the Claude API, Claude Code, and the Model Context Protocol (MCP).

What's the format?

60 multiple-choice questions. Each question has one correct answer and three distractors. Unanswered questions are scored as incorrect. The exam is scenario-based: 4 of 6 scenarios are selected at random, and each scenario frames a set of questions.

What's the passing score and time limit?

720 out of 1000, scaled. The scaled score adjusts for form-to-form difficulty differences, so you do not literally need 72% correct. On the Practice Exam in this guide, aim for 90%+ in Exam Mode before scheduling the real exam. Time limit is 120 minutes.

What does the exam cost?

Free at the time of writing.

Do I need to write code during the exam?

No. You will read Python, bash, and JSON fluently — recognizing correct implementations vs anti-patterns in code shown in questions — but you will not write code. Every question is multiple choice.

How do I register?

Anthropic's registration page.

Using this guide

What's the intended order?

Work Modules 1 → 5 in order. After the Final Capstone, take the Practice Exam in Exam Mode. If any domain scores below 80%, re-read those labs and retake the exam.

Can I skip labs on topics I already know?

The Check-Your-Understanding questions and Anti-pattern blocks are where most exam-relevant details live. A concept you think you understand may have a wrong-answer trap you haven't internalized yet — skimming a lab you think you know costs little and often catches gaps you didn't know you had.

How do I know I'm ready?

Three criteria: (1) you complete all 31 labs including the capstone, (2) you score 900+ on the Practice Exam in Exam Mode, and (3) for any domain below 90%, you can close your eyes and recite the two or three hardest anti-patterns without prompting. If all three hold, schedule the exam.

What if a domain scores low on the practice exam?

Start with the Anti-Patterns Cheat Sheet entries for that domain. Each entry links back to the lab that teaches the pattern. Re-read the lab, redo its Check-Your-Understanding questions, then retry the practice exam. Most domain weaknesses resolve with one focused re-read.

About this guide

Where does the canonical exam content live?

Anthropic's official exam guide is the authoritative source. If anything here contradicts it, trust the official guide.

Which Claude model do the code examples use?

claude-sonnet-4-5. If you are running the examples yourself against the Anthropic API, update the model ID to the latest Sonnet available at the time.

Where do I report errors or suggest improvements?

Open an issue on the GitHub repo. Include the lab number and a short description of what's wrong or could be clearer.

PRACTICE EXAM

60 scenario-based questions across 5 domains. Simulates the real CCA Foundations certification exam.

Questions

720

Pass Score

120m

Time Limit

Score 900+ in Exam Mode before scheduling the real CCA Foundations exam. If a domain scores below 80%, re-read those module labs and focus on the Exam tips sections.

Keys: A–D select · ←/→ navigate · F flag

Review Questions

AnsweredUnansweredFlagged

Exam Results

Domain Breakdown

Question Review

CorrectIncorrect