Skip to content

Debugging Guide

This guide covers systematic debugging of OSTwin components, from log file analysis to common failure patterns and their resolutions.

Log Files

OSTwin generates logs at multiple levels:

LogLocationContent
Manager log.agents/logs/manager.logOrchestration events, state transitions
Agent logs.agents/logs/room-NNN.logPer-room agent output
MCP log.agents/mcp/mcp.logMCP server requests and responses
Dashboard logdashboard/logs/API and frontend logs
Audit logwar-rooms/room-NNN/audit.logRoom-level audit trail

Viewing Logs

Terminal window
.agents/logs.sh tail
.agents/logs.sh tail --room room-003

War-Room Inspection

When a room is stuck or failing, inspect its state files directly:

  1. Check current status

    Terminal window
    cat .agents/war-rooms/room-001/status
    cat .agents/war-rooms/room-001/retries
    cat .agents/war-rooms/room-001/state_changed_at
  2. Read the latest channel messages

    Terminal window
    python .agents/bin/channel_cmd.py read --room .agents/war-rooms/room-001 --last 5
  3. Check the lifecycle definition

    Terminal window
    cat .agents/war-rooms/room-001/lifecycle.json | python -m json.tool

    Verify the current status maps to a valid state with appropriate signals.

  4. Inspect the room config

    Terminal window
    cat .agents/war-rooms/room-001/config.json | python -m json.tool

    Check depends_on, assigned_role, and constraints.

  5. Review the audit log

    Terminal window
    cat .agents/war-rooms/room-001/audit.log

    The audit log records every state transition with timestamps.

MCP Audit Logs

The MCP server logs all tool calls and responses:

Terminal window
cat .agents/mcp/mcp.log | python -m json.tool

Key things to look for:

  • Tool call failures (non-zero exit codes)
  • Timeout errors
  • Permission denied errors
  • Malformed tool parameters

MCP Configuration

The MCP config lives at .agents/mcp/config.json:

Terminal window
cat .agents/mcp/config.json | python -m json.tool

Verify the transport mode (stdio vs SSE) and server endpoints.

Memory Debugging

When cross-room context is missing or incorrect:

  1. Check if memory is enabled

    Terminal window
    python -c "import json; print(json.load(open('.agents/config.json'))['memory'])"
  2. Verify the memory daemon is running

    Terminal window
    .agents/health.sh

    Look for the memory daemon status in the output.

  3. List all memory entries

    Use the memory CLI:

    Terminal window
    python .agents/memory/memory-cli.py list
    python .agents/memory/memory-cli.py search "grid rendering"
  4. Check for superseded entries

    Superseded entries are excluded from queries. If an expected memory is missing, it may have been superseded:

    Terminal window
    python .agents/memory/memory-cli.py list --include-superseded
  5. Restart the memory daemon

    Terminal window
    .agents/memory/start-memory-daemon.sh

Dashboard Issues

Terminal window
# Check if the API process is running
cat .agents/dashboard.pid
ps aux | grep uvicorn
# Restart the dashboard
.agents/dashboard.sh stop
.agents/dashboard.sh start
# Check API logs
tail -50 dashboard/logs/api.log

Common Failure Patterns

Room Stuck in “developing”

Symptoms: Room status shows developing but no agent is running.

Diagnosis:

Terminal window
cat .agents/war-rooms/room-NNN/status
cat .agents/war-rooms/room-NNN/state_changed_at

If state_changed_at is older than state_timeout_seconds, the room has timed out.

Resolution: The manager should detect this and transition to failed. If the manager itself is stuck, restart it.

Retry Loop

Symptoms: Room cycles between optimize and review without passing.

Diagnosis:

Terminal window
cat .agents/war-rooms/room-NNN/retries
cat .agents/war-rooms/room-NNN/channel.jsonl | grep '"type":"fail"'

Resolution: Read the QA fail messages to understand why. Consider escalating via triage or increasing max_retries.

DAG Cycle Detection

Symptoms: ostwin run fails with “Circular dependency detected.”

Diagnosis: Check depends_on fields in the plan for circular references.

Resolution: Remove the cycle. Use ostwin run PLAN.md --dry-run to validate without executing.

Agent Timeout

Symptoms: Room transitions to failed with “timeout exceeded” in the error message.

Diagnosis:

Terminal window
cat .agents/war-rooms/room-NNN/channel.jsonl | grep '"type":"error"'

Resolution: Increase timeout_seconds in the role config or reduce the scope of the epic.

Channel File Corruption

Symptoms: JSON parse errors when reading channel.jsonl.

Diagnosis:

Terminal window
python -c "
import json
with open('.agents/war-rooms/room-NNN/channel.jsonl') as f:
for i, line in enumerate(f, 1):
try:
json.loads(line)
except json.JSONDecodeError as e:
print(f'Line {i}: {e}')
"

Resolution: The corrupted line was likely caused by an interrupted write. Remove or fix the malformed line. Future writes will use file locking to prevent recurrence.

Memory Daemon Crash

Symptoms: Agents report “memory operation failed” but continue working.

Diagnosis:

Terminal window
.agents/health.sh # Check memory daemon status
cat .agents/memory/mcp.log # Check for crash logs

Resolution: Restart the daemon with start-memory-daemon.sh. Memory operations fail silently by design, so no data is lost — only cross-room context is temporarily unavailable.