Building a Multi-Agent Clue Board Game with LLMs: A Complete Technical Guide

Introduction

What happens when you hand the classic murder-mystery board game Clue over to six AI agents and let them compete to solve the case? That’s exactly the question I set out to answer by building clue-board-game-with-llm — a fully autonomous, multi-agent implementation of Cluedo powered by CrewAI and Google Gemini 2.5 Flash.

This blog walks through every technical decision in that project: why LLMs alone are not enough for structured game logic, how a deterministic detective notebook prevents agents from “forgetting” what they’ve deduced, how validation tools act as built-in evaluators (evals), and how a comprehensive test suite keeps the whole thing reliable. If you want to build something similar — a game, a simulation, or any long-running multi-agent workflow — this is the guide for you.

The Problem: Why Is This Interesting?

Clue (Cluedo) is deceptively complex to automate with LLMs. Each round, every player:

Moves across a 9-room mansion with specific doors and secret passages
Makes a suggestion — naming a suspect, weapon, and the current room
Observes disproval — one card is secretly shown by the first opponent who can refute the suggestion
Updates their mental model — narrowing down the 21 possible cards
Decides when to accuse — a wrong accusation eliminates that player permanently

The game spans many turns. A raw LLM conversation thread cannot reliably track structured state across dozens of back-and-forth exchanges — it hallucinates cards it “showed” or “received,” confuses room names, and makes illogical accusations. The challenge, then, is not just prompting an LLM; it is engineering an architecture that keeps the LLM in its sweet spot (natural language reasoning) while offloading state tracking and rule enforcement to deterministic code.

Architecture Overview

┌────────────────────────────────────────────────────────────┐
│                     main.py (game loop)                     │
│  ┌──────────┐  ┌───────────────────────────────────────┐  │
│  │ Moderator│  │  6 Player Agents (Scarlet, Mustard…)  │  │
│  │  Agent   │  │  Each runs CrewAI mini-crew per turn   │  │
│  └──────────┘  └───────────────────────────────────────┘  │
│         │                      │                            │
│         ▼                      ▼                            │
│  ┌─────────────┐    ┌─────────────────────────────────┐   │
│  │ Moderator   │    │         Player Tools              │   │
│  │ Tools       │    │  game_tools  │  notebook_tools    │   │
│  │ (validation)│    └─────────────────────────────────┘   │
│  └─────────────┘                  │                         │
│                                   ▼                         │
│                       ┌─────────────────────┐              │
│                       │   GameState          │              │
│                       │  (single source of   │              │
│                       │   truth, Python obj) │              │
│                       └─────────────────────┘              │
│                                   │                         │
│                                   ▼                         │
│                       ┌─────────────────────┐              │
│                       │  Detective Notebook   │              │
│                       │  (deterministic grid) │              │
│                       └─────────────────────┘              │
└────────────────────────────────────────────────────────────┘

There are four distinct layers:

Layer	Responsibility
LLM Agent (CrewAI + Gemini)	Natural language reasoning, planning, decision-making
Tools	Thin wrappers that translate LLM decisions into game actions
GameState	Authoritative Python object holding all game facts
Detective Notebook	Per-player deterministic deduction grid

Project Structure

src/clue_game/
├── __init__.py
├── main.py              # Game loop — orchestrates all turns
├── game_state.py        # All game state (rooms, players, cards, turn order)
├── crew.py              # CrewAI agent & task definitions
├── notebook.py          # Deterministic per-player deduction notebook
├── toon_utils.py        # TOON format utilities for token efficiency
├── config/
│   ├── agents.yaml      # Agent personalities and rules knowledge
│   └── tasks.yaml       # Task templates
└── tools/
    ├── __init__.py
    ├── game_tools.py       # Move, suggest, accuse, get-status tools
    ├── notebook_tools.py   # Notebook read/write tools
    └── validation_tools.py # Moderator validation & quality tracking
tests/
├── test_game_state.py
├── test_game_tools.py
├── test_notebook.py
├── test_toon_format.py
├── test_validation.py
└── test_main.py

The LLM Layer: CrewAI + Google Gemini 2.5 Flash

Why CrewAI?

CrewAI is a Python framework for orchestrating multiple LLM agents, each with their own role, backstory, and tools. It maps cleanly onto the Clue model: one moderator agent (impartial referee) and six player agents (autonomous detectives).

Each agent is defined in agents.yaml with:

role: e.g., "Detective Miss Scarlet" or "Game Moderator"
goal: what this agent is trying to achieve
backstory: personality flavor that influences LLM tone
tools: the Python functions this agent can call

Why Gemini 2.5 Flash?

Gemini 2.5 Flash offers a large context window (1M tokens), strong reasoning, tool-calling support, and competitive cost. The 1M context is important: a full game of Clue involves many turns of tool outputs, and you want the model to be able to see its own history without truncation.

The Agent Definition

# crew.py (simplified)
from crewai import Agent, Crew, Process, Task

@CrewBase
class ClueGameCrew:
    agents_config = "config/agents.yaml"
    tasks_config  = "config/tasks.yaml"

    @agent
    def player_scarlet(self) -> Agent:
        return Agent(
            config=self.agents_config["player_scarlet"],
            tools=PLAYER_TOOLS,
            verbose=False,
        )

    @agent
    def game_moderator(self) -> Agent:
        return Agent(
            config=self.agents_config["game_moderator"],
            tools=MODERATOR_TOOLS,
            verbose=False,
        )

The key insight is that players and the moderator have different tool sets. Players get game-action tools and notebook tools; the moderator gets validation and status tools.

The Autonomy Loop: Perceive → Reason → Plan → Act

Every player agent is instructed — in its task description — to follow a four-step autonomy loop on each turn:

1. PERCEIVE  — Use tools to observe: current location, cards, notebook, event log
2. REASON    — Analyze observations: update notebook, deduce what is known/unknown
3. PLAN      — Formulate strategy: move? suggest? accuse?
4. ACT       — Execute using tools: move_to_room, make_suggestion, make_accusation

This structure appears explicitly in the CrewAI task prompt:

# crew.py — player turn task description (excerpt)
description = f"""
It's your turn in the Clue game! You are {player_name}.

═══════════════════════════════════════════════════
AUTONOMOUS AGENT LOOP: Perceive → Reason → Plan → Act
═══════════════════════════════════════════════════
1. **Perceive**: Get your cards, location, notebook grid, unknown cards, event log.
2. **Reason**:   Deduce what you know. Update your notebook.
3. **Plan**:     Decide: accuse? move? suggest?
4. **Act**:      Execute using the right tool(s).

Always use your detective notebook tools. Do not rely on memory.
"""

Why does this matter? Because LLMs tend to act immediately on whatever is most salient in their context. The explicit loop forces the agent to gather information first, then reason about it, rather than jumping straight to a move.

The Core Challenge: LLMs Forget

LLMs cannot maintain reliable structured state across dozens of tool calls. Specifically:

They may “remember” a card was shown to them when it was not
They may forget that Player B couldn’t disprove a suggestion three turns ago
They may attempt to accuse based on stale information
They may waste suggestions by proposing cards they already know are innocent

The solution is the Detective Notebook.

The Detective Notebook: Deterministic State Tracking

The notebook (notebook.py) is a Python object — completely outside the LLM — that deterministically tracks what each player knows:

class DetectiveNotebook:
    """Per-player deduction grid tracking card ownership."""

    # Grid: card → {player → KnowledgeState}
    grid: Dict[str, Dict[str, KnowledgeState]]

    # Knowledge states
    class KnowledgeState(Enum):
        UNKNOWN  = "?"     # No information yet
        HAS      = "Y"     # Player definitely has this card
        NOT_HAS  = "N"     # Player definitely does NOT have this card
        SOLUTION = "S"     # This card is in the solution envelope

Every time an agent observes something — a card is shown, a player can’t disprove, a card is identified as their own — it calls a notebook tool to record the fact. The notebook then applies deterministic inference rules:

Auto-Deduction Rules

def auto_deduce(self, card: str, player: str):
    """
    If every player except one is marked NOT_HAS for a card,
    the remaining player MUST have it.
    """
    unknown_players = [
        p for p in self.players
        if self.grid[card][p] == KnowledgeState.UNKNOWN
    ]
    if len(unknown_players) == 1:
        self.mark_has(card, unknown_players[0])

def check_solution_category(self, category: str):
    """
    If all players are marked NOT_HAS for a card,
    it must be the solution card.
    """
    for card in self.get_cards_by_category(category):
        all_not_has = all(
            self.grid[card][p] == KnowledgeState.NOT_HAS
            for p in self.players
        )
        if all_not_has:
            self.grid[card]["SOLUTION"] = KnowledgeState.SOLUTION

Accusation Safety

The notebook blocks wrong accusations. Before an agent can call make_accusation, the get_possible_solution notebook tool checks whether all three categories have been narrowed to exactly one card. If not, the tool returns a warning instead of allowing the accusation.

# notebook_tools.py
@tool("Get Possible Solution")
def get_possible_solution(player_name: str) -> str:
    """Check if all 3 solution categories are confirmed. If not, warns the agent."""
    notebook = get_notebook(player_name)
    solution = notebook.get_possible_solution()
    
    if solution["can_accuse"]:
        return f"READY TO ACCUSE: {solution['suspect']} with {solution['weapon']} in {solution['room']}"
    else:
        remaining = solution["remaining_possibilities"]
        return f"NOT READY. Remaining: {remaining['suspects']} suspects, {remaining['weapons']} weapons, {remaining['rooms']} rooms"

This prevents one of the most common LLM failure modes: making a premature accusation and getting eliminated.

Tools: The LLM–Game Interface

Tools are thin Python functions decorated with @tool from CrewAI. They translate the LLM’s natural-language decisions into structured game actions and return results the LLM can reason about.

Game Action Tools (`game_tools.py`)

Tool	Purpose
`roll_dice`	Returns dice roll result (1–12; 1 = magnifying glass free clue)
`get_available_moves`	Returns rooms reachable from current position
`move_to_room`	Validates and executes movement, enforcing door/passage rules
`make_suggestion`	Executes a suggestion, runs automatic disproval, returns result
`make_accusation`	Validates the accusation against the solution, wins or eliminates
`get_my_cards`	Returns the agent’s own dealt cards
`get_my_knowledge`	Returns a summary of the agent’s current knowledge state

The `make_suggestion` Tool in Depth

This is the most complex tool. When an agent calls it:

The tool validates that the agent is in a room (required by rules)
It checks the suggested room matches the agent’s current room
It moves the suggested suspect to the current room (official rule)
It iterates through all other players clockwise looking for a card to disprove
The first player who has a matching card is forced to show exactly one card
The result (card shown, or “no one could disprove”) is returned to the agent
If no one disproved, it marks this as a strong lead

@tool("Make Suggestion")
def make_suggestion(player_name: str, suspect: str, weapon: str) -> str:
    """
    Make an official Clue suggestion. Room is always the player's current room.
    Triggers automatic disproval from all other players clockwise.
    """
    game_state = get_game_state()
    player = game_state.get_player_by_name(player_name)
    
    # Validate: must be in a room
    if not player.current_room:
        return "ERROR: You must be in a room to make a suggestion."
    
    room = player.current_room
    
    # Move suspect to current room (official rule)
    game_state.move_suspect_to_room(suspect, room)
    
    # Automatic disproval — clockwise from suggesting player
    for other_player in game_state.get_players_clockwise(player):
        matching_cards = other_player.get_matching_cards([suspect, weapon, room])
        if matching_cards:
            # Other player must show ONE card (randomly chosen from matches)
            shown_card = random.choice(matching_cards)
            return f"{other_player.name} showed you: {shown_card}"
    
    return "No one could disprove your suggestion! Strong lead."

Notebook Tools (`notebook_tools.py`)

These are the tools that prevent the LLM from forgetting. The key ones:

Tool	When to Use
`initialize_notebook`	First turn only — loads the agent’s dealt cards into the grid
`mark_player_has_card`	When someone shows you a card
`mark_player_not_has_card`	When someone can’t disprove (doesn’t have any of the three cards)
`record_suggestion_in_notebook`	After any suggestion — stores full context
`get_unknown_cards`	Find which cards still have no owner assigned
`get_possible_solution`	Check if ready to accuse
`get_strategic_suggestion`	Get the best suspect/weapon to suggest to gain new information
`view_notebook_grid`	Display the full deduction grid

Evaluations (Evals): The Validation System

One of the hardest problems in LLM agent engineering is knowing whether your agent is making good decisions versus just plausible-sounding decisions. The project tackles this with a built-in eval system via the validation tools.

The Moderator as Evaluator

The moderator agent watches every suggestion and can call:

@tool("Track Suggestion Quality")
def track_suggestion_quality(player_name: str, is_wasted: bool, reason: str = "") -> str:
    """
    Track whether a suggestion was logical or wasted.
    A suggestion is 'wasted' if it includes cards already known to be innocent.
    """

A wasted suggestion is one where the agent asks about a card it already knows the owner of — it learns nothing new. A logical suggestion uses at least one unknown card, potentially eliminating it from the solution.

Grading System

Every agent receives a grade at the end of the game:

Grade	Criteria
A (Excellent)	≥80% logical suggestions, 0 invalid move attempts
B (Good)	≥60% logical suggestions, ≤2 invalid attempts
C (Fair)	≥40% logical suggestions
D (Needs Improvement)	<40% logical suggestions

Validation Warnings

The moderator logs structured warnings for any rule violations:

@tool("Log Validation Warning")
def log_validation_warning(
    player_name: str,
    warning_type: str,   # "invalid_move", "wasted_suggestion", "illogical_accusation"
    details: str,
    severity: str = "warning"  # "info", "warning", "error"
) -> str:
    """Log a validation event for agent performance tracking."""
    game_state = get_game_state()
    player = game_state.get_player_by_name(player_name)
    
    warning_entry = {
        "turn": game_state.turn_number,
        "player": player_name,
        "type": warning_type,
        "details": details,
        "severity": severity
    }
    
    player.validation_warnings.append(warning_entry)
    if severity == "error":
        player.invalid_move_attempts += 1
    
    game_state.validation_log.append(warning_entry)

End-Game Quality Report

At the end of every game, the moderator calls get_game_quality_report() which aggregates:

Per-agent logical/wasted suggestion counts
Per-agent invalid move attempts
Overall game suggestion quality percentage
All validation warning entries with turn numbers

This gives you immediate, quantitative feedback on whether your prompts and notebook integration are working — the LLM equivalent of unit test coverage.

Testing: Validating Every Layer

The project has six test files covering every layer of the stack. Here is what each tests and why.

`test_game_state.py` — Rules Correctness

These tests validate the GameState class implements official Cluedo rules correctly:

def test_initial_setup():
    """Verify exactly 1 suspect, 1 weapon, 1 room in solution envelope."""
    state = create_fresh_game_state()
    assert len(state.solution) == 3
    assert state.solution["suspect"] in SUSPECTS
    assert state.solution["weapon"] in WEAPONS
    assert state.solution["room"] in ROOMS

def test_suggestion_moves_suspect():
    """Official rule: suggesting a suspect teleports them to the current room."""
    state = create_fresh_game_state()
    player = state.players[0]
    player.current_room = "Kitchen"
    state.make_suggestion(player.name, "Miss Scarlet", "Knife")
    assert state.suspect_locations["Miss Scarlet"] == "Kitchen"

def test_wrong_accusation_eliminates_player():
    """A wrong accusation eliminates the player but they still disprove."""
    state = create_fresh_game_state()
    player = state.players[0]
    result = state.make_accusation(player.name, "Wrong Suspect", "Wrong Weapon", "Wrong Room")
    assert result["correct"] is False
    assert player.is_eliminated is True
    assert player.still_disproves is True

`test_notebook.py` — Deduction Logic

Tests for the deterministic notebook that is critical for agent correctness:

def test_auto_deduce_last_unknown():
    """If all but one player are marked NOT_HAS, the last one MUST have it."""
    notebook = DetectiveNotebook(players=["Alice", "Bob", "Carol"], ...)
    notebook.mark_not_has("Knife", "Alice")
    notebook.mark_not_has("Knife", "Bob")
    # Carol is the only unknown — notebook should auto-deduce
    assert notebook.get_state("Knife", "Carol") == KnowledgeState.HAS

def test_solution_detection():
    """If all players are NOT_HAS for a card, it's the solution."""
    notebook = DetectiveNotebook(players=["Alice", "Bob"], ...)
    for player in ["Alice", "Bob"]:
        notebook.mark_not_has("Ballroom", player)
    assert notebook.get_possible_solution()["room"] == "Ballroom"

def test_accusation_blocked_when_not_ready():
    """get_possible_solution returns can_accuse=False if any category unclear."""
    notebook = DetectiveNotebook(...)
    solution = notebook.get_possible_solution()
    assert solution["can_accuse"] is False

`test_game_tools.py` — Tool Contract Testing

These tests verify the tool outputs match what the LLM expects to see:

def test_make_suggestion_requires_room():
    """Agent not in a room should receive an error, not crash."""
    setup_game_with_player_in_hallway("Scarlet")
    result = make_suggestion.run({"player_name": "Scarlet", "suspect": "Mustard", "weapon": "Knife"})
    assert "ERROR" in result
    assert "must be in a room" in result.lower()

def test_disproval_shows_exactly_one_card():
    """When multiple cards match, exactly ONE is returned to the suggesting player."""
    setup_game_where_opponent_has_multiple_matching_cards()
    result = make_suggestion.run({"player_name": "Scarlet", "suspect": "Mustard", "weapon": "Knife"})
    # Result should show exactly ONE card name
    shown_count = result.count("showed you")
    assert shown_count == 1

`test_toon_format.py` — Token Efficiency

These tests validate that the TOON format actually reduces token counts:

def test_toon_reduces_tokens_vs_verbose():
    """TOON output must use fewer tokens than equivalent verbose text."""
    verbose = generate_verbose_player_status(player)
    toon = generate_toon_player_status(player)
    assert token_count(toon) < token_count(verbose)

def test_toon_savings_at_least_30_percent():
    """Savings must be at least 30% to be worth the format complexity."""
    verbose = generate_verbose_game_status(state)
    toon = generate_toon_game_status(state)
    savings = (token_count(verbose) - token_count(toon)) / token_count(verbose)
    assert savings >= 0.30

`test_validation.py` — Eval System

These tests verify the grading and quality-tracking system works:

def test_wasted_suggestion_is_tracked():
    """Calling track_suggestion_quality with is_wasted=True increments counter."""
    setup_player("Scarlet")
    track_suggestion_quality.run({"player_name": "Scarlet", "is_wasted": True, "reason": "Knew the card"})
    metrics = get_player_performance_metrics.run({"player_name": "Scarlet"})
    assert "Wasted suggestions: 1" in metrics

def test_grade_a_requires_80_percent_logical():
    """Grade A is only awarded for ≥80% logical suggestions with 0 invalid attempts."""
    setup_player_with_metrics("Scarlet", logical=8, wasted=2, invalid=0)
    report = get_game_quality_report.run({})
    assert "A (Excellent)" in report

`test_main.py` — Agent Autonomy Loop

These tests validate the high-level game loop behavior:

def test_each_player_follows_prpa_loop():
    """Every turn task description must contain the PRPA headings."""
    task = create_player_turn_crew("Scarlet", agent, moderator)
    description = task.tasks[0].description
    assert "Perceive" in description
    assert "Reason" in description
    assert "Plan" in description
    assert "Act" in description

def test_first_turn_includes_notebook_init():
    """First turn must include notebook initialization instruction."""
    task = create_player_turn_crew("Scarlet", agent, moderator, is_first_turn=True)
    description = task.tasks[0].description
    assert "Initialize My Notebook" in description

Token Efficiency: TOON Format

LLMs charge by the token. In a multi-turn game, tool outputs are returned to the LLM on every turn. Verbose, human-readable text can be 2× more expensive than necessary. The project uses TOON (Token-Oriented Object Notation), a compact format that combines YAML-like keys with CSV-style lists.

Before and After

Verbose text (≈70 tokens):

=== PLAYER STATUS ===

Player: Miss Scarlet
Your cards (3 total):
  - Knife
  - Ballroom
  - Colonel Mustard

Current location: Kitchen
Can suggest: Yes

Unknown suspects: Miss Scarlet, Mr. Green
Unknown weapons: Rope, Lead Pipe

TOON format (≈35 tokens):

player: Miss Scarlet
cards[3]: Knife,Ballroom,Colonel Mustard
location: Kitchen
can_suggest: true
unknown_suspects[2]: Miss Scarlet,Mr. Green
unknown_weapons[2]: Rope,Lead Pipe

Savings: ~50% fewer tokens for identical information. Across a full game, this translates to meaningful API cost reduction and more available context for reasoning.

Configuring TOON

TOON is on by default. To disable it (useful for debugging):

export CLUE_TOON_ENABLED=false

The toon_utils.py module handles formatting. Every tool output that can be compressed uses it.

LLM Observability: MLflow Tracing

Understanding what your LLM agent is doing requires more than just reading terminal output. The project integrates MLflow Tracing to capture every LLM call, tool invocation, and agent execution in a queryable UI.

After running a game:

mlflow ui
# Open http://localhost:5000

You’ll see:

Which agent ran each task and for how long
Every LLM call with the full input prompt and response
Every tool call with arguments and output
Token usage per call (helps identify expensive operations)
Exceptions if any agent crashed

Configuration

# Disable MLflow tracing (saves overhead during rapid iteration)
export CLUE_MLFLOW_ENABLED=false

# Use a remote MLflow server
export MLFLOW_TRACKING_URI=http://your-mlflow-server:5000

# Custom experiment name
export MLFLOW_EXPERIMENT_NAME=My-Clue-Experiment

MLflow tracing is the difference between “it seemed to work” and “I know exactly what the LLM was thinking and how much it cost.”

Getting Started: Running the Game Yourself

Prerequisites

Python 3.11+
uv (fast Python package manager)
A Google AI API key (free tier available at aistudio.google.com)

Installation

# Clone the repository
git clone https://github.com/nmadhire-agents/clue-board-game-with-llm.git
cd clue-board-game-with-llm

# Install all dependencies
uv sync

# Configure your API key
cp .env.example .env
# Edit .env and set GOOGLE_API_KEY=your-key-here

Run the Game

uv run clue-game

Run the Tests

uv run pytest

How to Build a Similar Project

If you want to adapt this architecture for your own LLM-powered game or simulation, here is the step-by-step blueprint.

Step 1: Define Your State

Identify all game state that must be tracked perfectly across many turns. In Clue, that is:

Player locations, cards, and elimination status
The solution envelope
Suggestion history
The detective notebooks

Put it all in a single authoritative Python object (your GameState). Never rely on the LLM to remember this. The LLM is your reasoning engine, not your database.

# game_state.py pattern
class GameState:
    players: List[Player]
    solution: Dict[str, str]
    turn_number: int
    suggestion_history: List[Suggestion]
    validation_log: List[ValidationEntry]

_game_state: Optional[GameState] = None

def get_game_state() -> GameState:
    return _game_state  # Global singleton accessed by all tools

Step 2: Define Your Tools

Write one Python function per game action. Decorate with @tool. Return strings that tell the LLM exactly what happened and what its options are.

from crewai.tools import tool

@tool("Move To Room")
def move_to_room(player_name: str, room: str) -> str:
    """Move the player to the specified room if it is reachable."""
    state = get_game_state()
    player = state.get_player(player_name)
    
    if room not in state.get_available_moves(player):
        available = state.get_available_moves(player)
        return f"ERROR: Cannot move to {room}. Available: {available}"
    
    player.current_room = room
    return f"Moved to {room}. You can now make a suggestion here."

Key principles: - Validate everything — the LLM will attempt invalid actions - Return actionable strings — tell the agent what to do next - Never raise exceptions — return error strings instead; LLMs handle those better - Be explicit about what changed — “Moved to Kitchen” is better than just “OK”

Step 3: Build Your Deterministic Memory Layer

For any state the agent must reason about across many turns, create a deterministic Python class (like DetectiveNotebook) and expose it through tools. The pattern is:

# State stored outside the LLM
_notebooks: Dict[str, DetectiveNotebook] = {}

@tool("Mark Player Has Card")
def mark_player_has_card(player_name: str, card: str, card_holder: str) -> str:
    """Record that card_holder definitely has this card."""
    notebook = _notebooks[player_name]
    notebook.mark_has(card, card_holder)
    notebook.auto_deduce()  # Run inference rules
    return f"Recorded: {card_holder} has {card}. Notebook updated."

This is the single most important architectural pattern: move all structured reasoning out of the LLM and into code.

Step 4: Create Your Agents

Define agents in YAML with clear roles, goals, and backstories. Use different tool sets for different roles (player vs. moderator).

# config/agents.yaml
player_scarlet:
  role: "Detective Miss Scarlet"
  goal: >
    Be the first to identify the murderer, weapon, and room.
    Use your detective notebook to track all information and make
    logical suggestions to narrow down the solution.
  backstory: >
    Miss Scarlet is cunning and methodical. She never wastes a suggestion
    and always consults her notebook before acting.

Step 5: Write Your Prompts with the Autonomy Loop

Every player task prompt should explicitly encode the Perceive → Reason → Plan → Act loop and remind the agent to use notebook tools:

task_description = f"""
You are {player_name}. Your goal is to solve the mystery.

AUTONOMY LOOP — follow this every turn:
1. PERCEIVE: Call get_my_cards, get_current_location, view_notebook_grid, get_event_log
2. REASON:   Analyze your notebook. What do you know? What is still unknown?
3. PLAN:     Decide your action. Can you accuse? Should you move or suggest?
4. ACT:      Execute using the appropriate tool(s).

WARNING: ONLY use make_accusation when get_possible_solution shows all 3 confirmed.
"""

Step 6: Add Evals from Day One

Don’t wait until the game is “working” to add quality tracking. Build your evaluation system at the same time as your tools. For every decision your agent makes, ask: Is this decision good? and How can I measure that?

For Clue: - Is this suggestion logical? (uses unknown cards) - Did the agent follow the Perceive → Reason → Plan → Act loop? - Did the agent make any invalid move attempts?

Encode these checks in your moderator agent’s tools and run them on every turn.

Step 7: Write Tests for Each Layer

Follow this test hierarchy:

Layer	What to Test
State layer	Rules are enforced correctly (setup, movement, elimination)
Notebook layer	Deduction inference is correct (auto-deduce, solution detection)
Tool layer	Tool outputs match LLM expectations, errors are handled gracefully
Eval layer	Quality metrics are tracked correctly, grading thresholds work
Agent loop layer	Task prompts contain all required loop steps

Run tests on every code change with pytest. This is the only way to know that your non-LLM code is correct — and you need the non-LLM code to be perfect, because the LLM is already the uncertain part.

Key Lessons Learned

1. The LLM is the reasoning engine, not the state machine

Never ask an LLM to track structured state in its context window. Use Python for state; use LLMs for decision-making. The boundary is: if it requires perfect recall across many turns, it belongs in code.

2. Tools are your contract with the LLM

Every tool return value is a message to the LLM. Make them explicit, actionable, and complete. “ERROR: Cannot move to Kitchen. Available rooms: Library, Ballroom” is far better than just “invalid move.”

3. Evals need to be built in, not bolted on

The validation/quality-tracking system was designed alongside the game logic, not added afterward. This means every turn generates quantitative data about agent quality. Without this, you’re flying blind.

4. Autonomy requires explicit structure

Left to its own devices, an LLM will often skip steps (e.g., act without perceiving). The explicit Perceive → Reason → Plan → Act loop in the task prompt significantly improved agent consistency.

5. Tests save you from yourself

The notebook auto-deduction logic and tool validation logic are subtle. Without tests, you would introduce regressions with every edit. The test suite caught several bugs during development — wrong clockwise ordering, edge cases in auto-deduction, and incorrect card counting.

6. Token efficiency matters at scale

A single game can involve hundreds of tool calls. If every tool output is verbose, you burn through context and API budget quickly. Invest in compact output formats (like TOON) early.

Summary

Building an LLM-powered Clue game is a microcosm of real-world agentic AI engineering:

Architecture first: Separate LLM reasoning from deterministic state and rules.
Tools as contracts: Every tool output is a structured message your agent depends on.
Deterministic memory: Never trust the LLM to remember; provide a notebook.
Evals from day one: Quality tracking gives you data-driven insight into agent behavior.
Tests at every layer: State, notebook, tools, evals, and agent loops all need tests.
Observability: MLflow tracing lets you see inside the black box.

The full source code is available at github.com/nmadhire-agents/clue-board-game-with-llm. Clone it, run a game, and then try adapting the architecture for your own project.

Want to see more projects like this? Follow me on GitHub or connect on LinkedIn.