Debugging Guide
This guide covers systematic debugging of OSTwin components, from log file analysis to common failure patterns and their resolutions.
Log Files
OSTwin generates logs at multiple levels:
| Log | Location | Content |
|---|---|---|
| Manager log | .agents/logs/manager.log | Orchestration events, state transitions |
| Agent logs | .agents/logs/room-NNN.log | Per-room agent output |
| MCP log | .agents/mcp/mcp.log | MCP server requests and responses |
| Dashboard log | dashboard/logs/ | API and frontend logs |
| Audit log | war-rooms/room-NNN/audit.log | Room-level audit trail |
Viewing Logs
.agents/logs.sh tail.agents/logs.sh tail --room room-003.agents/logs.sh search "error".agents/logs.sh search "EPIC-001"tail -f .agents/logs/manager.logtail -100 .agents/logs/room-003.logWar-Room Inspection
When a room is stuck or failing, inspect its state files directly:
-
Check current status
Terminal window cat .agents/war-rooms/room-001/statuscat .agents/war-rooms/room-001/retriescat .agents/war-rooms/room-001/state_changed_at -
Read the latest channel messages
Terminal window python .agents/bin/channel_cmd.py read --room .agents/war-rooms/room-001 --last 5Terminal window tail -5 .agents/war-rooms/room-001/channel.jsonl | python -m json.tool -
Check the lifecycle definition
Terminal window cat .agents/war-rooms/room-001/lifecycle.json | python -m json.toolVerify the current status maps to a valid state with appropriate signals.
-
Inspect the room config
Terminal window cat .agents/war-rooms/room-001/config.json | python -m json.toolCheck
depends_on,assigned_role, andconstraints. -
Review the audit log
Terminal window cat .agents/war-rooms/room-001/audit.logThe audit log records every state transition with timestamps.
MCP Audit Logs
The MCP server logs all tool calls and responses:
cat .agents/mcp/mcp.log | python -m json.toolKey things to look for:
- Tool call failures (non-zero exit codes)
- Timeout errors
- Permission denied errors
- Malformed tool parameters
MCP Configuration
The MCP config lives at .agents/mcp/config.json:
cat .agents/mcp/config.json | python -m json.toolVerify the transport mode (stdio vs SSE) and server endpoints.
Memory Debugging
When cross-room context is missing or incorrect:
-
Check if memory is enabled
Terminal window python -c "import json; print(json.load(open('.agents/config.json'))['memory'])" -
Verify the memory daemon is running
Terminal window .agents/health.shLook for the memory daemon status in the output.
-
List all memory entries
Use the memory CLI:
Terminal window python .agents/memory/memory-cli.py listpython .agents/memory/memory-cli.py search "grid rendering" -
Check for superseded entries
Superseded entries are excluded from queries. If an expected memory is missing, it may have been superseded:
Terminal window python .agents/memory/memory-cli.py list --include-superseded -
Restart the memory daemon
Terminal window .agents/memory/start-memory-daemon.sh
Dashboard Issues
# Check if the API process is runningcat .agents/dashboard.pidps aux | grep uvicorn
# Restart the dashboard.agents/dashboard.sh stop.agents/dashboard.sh start
# Check API logstail -50 dashboard/logs/api.log# Check Node.js versionnode --version # Should be 20+
# Rebuildcd dashboard/fepnpm installpnpm build
# Check for TypeScript errorspnpm tsc --noEmit# Check if the API supports WebSocketcurl -v http://localhost:8000/ws
# Verify CORS settings in API config# Check browser console for connection errorsCommon Failure Patterns
Room Stuck in “developing”
Symptoms: Room status shows developing but no agent is running.
Diagnosis:
cat .agents/war-rooms/room-NNN/statuscat .agents/war-rooms/room-NNN/state_changed_atIf state_changed_at is older than state_timeout_seconds, the room has timed out.
Resolution: The manager should detect this and transition to failed. If the manager itself is stuck, restart it.
Retry Loop
Symptoms: Room cycles between optimize and review without passing.
Diagnosis:
cat .agents/war-rooms/room-NNN/retriescat .agents/war-rooms/room-NNN/channel.jsonl | grep '"type":"fail"'Resolution: Read the QA fail messages to understand why. Consider escalating via triage or increasing max_retries.
DAG Cycle Detection
Symptoms: ostwin run fails with “Circular dependency detected.”
Diagnosis: Check depends_on fields in the plan for circular references.
Resolution: Remove the cycle. Use ostwin run PLAN.md --dry-run to validate without executing.
Agent Timeout
Symptoms: Room transitions to failed with “timeout exceeded” in the error message.
Diagnosis:
cat .agents/war-rooms/room-NNN/channel.jsonl | grep '"type":"error"'Resolution: Increase timeout_seconds in the role config or reduce the scope of the epic.
Channel File Corruption
Symptoms: JSON parse errors when reading channel.jsonl.
Diagnosis:
python -c "import jsonwith open('.agents/war-rooms/room-NNN/channel.jsonl') as f: for i, line in enumerate(f, 1): try: json.loads(line) except json.JSONDecodeError as e: print(f'Line {i}: {e}')"Resolution: The corrupted line was likely caused by an interrupted write. Remove or fix the malformed line. Future writes will use file locking to prevent recurrence.
Memory Daemon Crash
Symptoms: Agents report “memory operation failed” but continue working.
Diagnosis:
.agents/health.sh # Check memory daemon statuscat .agents/memory/mcp.log # Check for crash logsResolution: Restart the daemon with start-memory-daemon.sh. Memory operations fail silently by design, so no data is lost — only cross-room context is temporarily unavailable.