worker-bee¶

The guardian of your AI infrastructure

Worker-bee continuously monitors the health of all F3L1X realms, logs activity for audit trails, and alerts you when services degrade or fail. It's your operational safety net—the thing that notices when things go wrong before you do.

What It Does¶

Worker-bee runs background checks on all 40+ realms in the F3L1X ecosystem. It:

Pings each realm - Verifies every service is responsive
Checks dependencies - Ensures Herald, Redis, PostgreSQL are running
Monitors resources - Tracks CPU, memory, disk usage
Logs all activity - Creates audit trail for compliance and debugging
Alerts on failures - Notifies you when services degrade

Think of worker-bee as your on-call operations engineer, working 24/7 without needing coffee breaks.

Key Capabilities¶

Realm Monitoring¶

Health Checks: Ping all realms every 30 seconds
Dependency Verification: Ensure Herald, Redis, PostgreSQL are accessible
Port Availability: Detect port conflicts and blocked ports
Service Responsiveness: Track response times (P50, P95, P99 latency)

Auto-Recovery¶

Service Restart: Auto-restart crashed realms (configurable per realm)
Dependency Recovery: Restart Redis if message queuing fails
Graceful Degradation: Continue operating if secondary services fail
Recovery Logging: Record all restart events for audit trail

Activity Logging¶

Request Logs: Every HTTP request logged with timestamp and status
Error Logs: Full stack traces for exceptions
Audit Trail: Compliance logging for security-sensitive operations
Compressed Storage: Logs rotated and compressed daily

Alerting System¶

Email Alerts: Critical failures sent to configured email
Slack Integration: Channel notifications for service issues
Severity Levels: Critical/warning/info classification
Silence Rules: Prevent alert fatigue on expected downtime

Accessing worker-bee¶

Web Dashboard¶

URL: http://127.0.0.1:8082

The worker-bee dashboard shows:
- Realm status grid (green/yellow/red)
- Uptime percentages for each service
- Resource usage graphs
- Recent alerts and recovery actions

Command Line Interface¶

# Run ecosystem verification
python manage.py verify-ecosystem

# Check specific realm
python manage.py check-realm <realm-name>

# View recent alerts
python manage.py alerts --limit 50

# Generate health report
python manage.py health-report

Metrics & Alerts API¶

Endpoint	Purpose	Method
`/api/health/`	Overall ecosystem status	GET
`/api/realms/`	Status of all realms	GET
`/api/realms/<name>/`	Status of single realm	GET
`/api/alerts/`	Recent alerts	GET

Common Use Cases¶

Use Case 1: Check Ecosystem Health Before Work¶

Goal: Verify all services are healthy before starting session

Open worker-bee dashboard at :8082
View "Realm Status Grid"
Check for red (down) or yellow (degraded) services
Green = all healthy, ready to work
If problems found, click realm for details

Use Case 2: Investigate Service Outage¶

Goal: Understand what happened when a realm crashed

Open worker-bee dashboard
Click "Recent Alerts" section
Find alert for failed realm
Review timestamp and error message
Click realm name to see recovery log
Check if auto-restart succeeded

Use Case 3: Monitor Long-Running Job¶

Goal: Ensure services stay healthy during intensive tasks

Start long-running operation
Open worker-bee dashboard
Monitor resource graphs (CPU, memory, disk)
Watch for service degradation
Worker-bee alerts you if problems occur

Use Case 4: Configure Auto-Recovery¶

Goal: Enable automatic service restart on failure

Configuration in worker-bee settings:
- Essential Realms (auto-restart enabled):
- Herald (critical dependency)
- PostgreSQL (critical dependency)
- Redis (critical dependency)

Non-Critical Realms (manual restart):
Dashboard features
Analysis tools
Experimental services

Important Notes¶

Monitoring Standards¶

Health Check Criteria:
- Service responds to HTTP request within 5 seconds
- PostgreSQL accepts connections
- Redis responds to PING command
- Disk space >5% available

Alert Thresholds:
- Critical: Service down 2+ minutes
- Warning: Response time >2 seconds
- Info: Resource usage >80%

Auto-Recovery Behavior¶

When a realm crashes:
1. Worker-bee detects failure
2. Sends alert if configured
3. If auto-restart enabled: Waits 10 seconds, restarts realm
4. If auto-restart disabled: Alerts you to manually restart
5. Logs recovery attempt with result

Critical services will auto-restart:
- Herald (all realms depend on it)
- PostgreSQL (data storage)
- Redis (message queuing)

Optional services require manual restart:
- Most realm services
- Analysis/experimentation tools

Log Retention & Compliance¶

Log Storage:
- Recent 30 days: Full detail, searchable
- 31-90 days: Compressed, archive access
- 90+ days: Long-term archive (rarely accessed)

Log Deletion:
- Automatically deleted after 1 year
- Manual deletion requires confirmation
- Compliance mode keeps all logs indefinitely

Search logs:

python manage.py search-logs --realm <name> --date 2026-02-17

Troubleshooting¶

Dashboard shows red for realm but service is running¶

Symptom: Realm appears down but you know it's running
Fix: Realm port may be blocked, check firewall, restart realm

Constant alerts for non-critical service¶

Symptom: Getting too many alerts for optional realm
Fix: Adjust alert severity rules or disable alerts for that realm

Auto-restart not working¶

Symptom: Realm crashes but doesn't automatically restart
Fix: Check if auto-restart is enabled for that realm in worker-bee settings

Logs growing too large¶

Symptom: Disk usage increasing rapidly
Fix: Run python manage.py rotate-logs to compress old logs

Resource graphs show memory leak¶

Symptom: Memory usage increasing over time
Fix: Check realm logs for errors, may need to restart specific service

herald - Worker-bee monitors herald for overall ecosystem health
doc-u-me - Worker-bee logs activity accessible via doc-u-me search
f3l1x-dashboard - Shows worker-bee health data in realm status
All other realms - All services monitored by worker-bee