Debugging Guide

This guide covers systematic debugging of OSTwin components, from log file analysis to common failure patterns and their resolutions.

Log Files

OSTwin generates logs at multiple levels:

Log	Location	Content
Manager log	`.agents/logs/manager.log`	Orchestration events, state transitions
Agent logs	`.agents/logs/room-NNN.log`	Per-room agent output
MCP log	`.agents/mcp/mcp.log`	MCP server requests and responses
Dashboard log	`dashboard/logs/`	API and frontend logs
Audit log	`war-rooms/room-NNN/audit.log`	Room-level audit trail

Viewing Logs

.agents/logs.sh tail
.agents/logs.sh tail --room room-003

.agents/logs.sh search "error"
.agents/logs.sh search "EPIC-001"

tail -f .agents/logs/manager.log
tail -100 .agents/logs/room-003.log

War-Room Inspection

When a room is stuck or failing, inspect its state files directly:

Check current status

cat .agents/war-rooms/room-001/status
cat .agents/war-rooms/room-001/retries
cat .agents/war-rooms/room-001/state_changed_at

Read the latest channel messages

Python CLI
Direct Read

python .agents/bin/channel_cmd.py read --room .agents/war-rooms/room-001 --last 5

tail -5 .agents/war-rooms/room-001/channel.jsonl | python -m json.tool

Check the lifecycle definition
Terminal window
```
cat .agents/war-rooms/room-001/lifecycle.json | python -m json.tool
```
Verify the current status maps to a valid state with appropriate signals.
Inspect the room config
Terminal window
```
cat .agents/war-rooms/room-001/config.json | python -m json.tool
```
Check depends_on, assigned_role, and constraints.
Review the audit log
Terminal window
```
cat .agents/war-rooms/room-001/audit.log
```
The audit log records every state transition with timestamps.

MCP Audit Logs

The MCP server logs all tool calls and responses:

cat .agents/mcp/mcp.log | python -m json.tool

Key things to look for:

Tool call failures (non-zero exit codes)
Timeout errors
Permission denied errors
Malformed tool parameters

MCP Configuration

The MCP config lives at .agents/mcp/config.json:

cat .agents/mcp/config.json | python -m json.tool

Verify the transport mode (stdio vs SSE) and server endpoints.

Memory Debugging

When cross-room context is missing or incorrect:

Check if memory is enabled

python -c "import json; print(json.load(open('.agents/config.json'))['memory'])"

Verify the memory daemon is running
Terminal window
```
.agents/health.sh
```
Look for the memory daemon status in the output.

List all memory entries

Use the memory CLI:

python .agents/memory/memory-cli.py list
python .agents/memory/memory-cli.py search "grid rendering"

Check for superseded entries

Superseded entries are excluded from queries. If an expected memory is missing, it may have been superseded:
Terminal window
```
python .agents/memory/memory-cli.py list --include-superseded
```
Restart the memory daemon
Terminal window
```
.agents/memory/start-memory-daemon.sh
```

Dashboard Issues

# Check if the API process is running
cat .agents/dashboard.pid
ps aux | grep uvicorn

# Restart the dashboard
.agents/dashboard.sh stop
.agents/dashboard.sh start

# Check API logs
tail -50 dashboard/logs/api.log

# Check Node.js version
node --version  # Should be 20+

# Rebuild
cd dashboard/fe
pnpm install
pnpm build

# Check for TypeScript errors
pnpm tsc --noEmit

# Check if the API supports WebSocket
curl -v http://localhost:8000/ws

# Verify CORS settings in API config
# Check browser console for connection errors

Common Failure Patterns

Room Stuck in “developing”

Symptoms: Room status shows developing but no agent is running.

Diagnosis:

cat .agents/war-rooms/room-NNN/status
cat .agents/war-rooms/room-NNN/state_changed_at

If state_changed_at is older than state_timeout_seconds, the room has timed out.

Resolution: The manager should detect this and transition to failed. If the manager itself is stuck, restart it.

Retry Loop

Symptoms: Room cycles between optimize and review without passing.

Diagnosis:

cat .agents/war-rooms/room-NNN/retries
cat .agents/war-rooms/room-NNN/channel.jsonl | grep '"type":"fail"'

Resolution: Read the QA fail messages to understand why. Consider escalating via triage or increasing max_retries.

DAG Cycle Detection

Symptoms: ostwin run fails with “Circular dependency detected.”

Diagnosis: Check depends_on fields in the plan for circular references.

Resolution: Remove the cycle. Use ostwin run PLAN.md --dry-run to validate without executing.

Agent Timeout

Symptoms: Room transitions to failed with “timeout exceeded” in the error message.

Diagnosis:

cat .agents/war-rooms/room-NNN/channel.jsonl | grep '"type":"error"'

Resolution: Increase timeout_seconds in the role config or reduce the scope of the epic.

Channel File Corruption

Symptoms: JSON parse errors when reading channel.jsonl.

Diagnosis:

python -c "
import json
with open('.agents/war-rooms/room-NNN/channel.jsonl') as f:
    for i, line in enumerate(f, 1):
        try:
            json.loads(line)
        except json.JSONDecodeError as e:
            print(f'Line {i}: {e}')
"

Resolution: The corrupted line was likely caused by an interrupted write. Remove or fix the malformed line. Future writes will use file locking to prevent recurrence.

Memory Daemon Crash

Symptoms: Agents report “memory operation failed” but continue working.

Diagnosis:

.agents/health.sh  # Check memory daemon status
cat .agents/memory/mcp.log  # Check for crash logs

Resolution: Restart the daemon with start-memory-daemon.sh. Memory operations fail silently by design, so no data is lost — only cross-room context is temporarily unavailable.