worker-bee¶
The guardian of your AI infrastructure
Worker-bee continuously monitors the health of all F3L1X realms, logs activity for audit trails, and alerts you when services degrade or fail. It's your operational safety net—the thing that notices when things go wrong before you do.
What It Does¶
Worker-bee runs background checks on all 40+ realms in the F3L1X ecosystem. It:
- Pings each realm - Verifies every service is responsive
- Checks dependencies - Ensures Herald, Redis, PostgreSQL are running
- Monitors resources - Tracks CPU, memory, disk usage
- Logs all activity - Creates audit trail for compliance and debugging
- Alerts on failures - Notifies you when services degrade
Think of worker-bee as your on-call operations engineer, working 24/7 without needing coffee breaks.
Key Capabilities¶
Realm Monitoring¶
- Health Checks: Ping all realms every 30 seconds
- Dependency Verification: Ensure Herald, Redis, PostgreSQL are accessible
- Port Availability: Detect port conflicts and blocked ports
- Service Responsiveness: Track response times (P50, P95, P99 latency)
Auto-Recovery¶
- Service Restart: Auto-restart crashed realms (configurable per realm)
- Dependency Recovery: Restart Redis if message queuing fails
- Graceful Degradation: Continue operating if secondary services fail
- Recovery Logging: Record all restart events for audit trail
Activity Logging¶
- Request Logs: Every HTTP request logged with timestamp and status
- Error Logs: Full stack traces for exceptions
- Audit Trail: Compliance logging for security-sensitive operations
- Compressed Storage: Logs rotated and compressed daily
Alerting System¶
- Email Alerts: Critical failures sent to configured email
- Slack Integration: Channel notifications for service issues
- Severity Levels: Critical/warning/info classification
- Silence Rules: Prevent alert fatigue on expected downtime
Accessing worker-bee¶
Web Dashboard¶
URL: http://127.0.0.1:8082
The worker-bee dashboard shows:
- Realm status grid (green/yellow/red)
- Uptime percentages for each service
- Resource usage graphs
- Recent alerts and recovery actions
Command Line Interface¶
# Run ecosystem verification
python manage.py verify-ecosystem
# Check specific realm
python manage.py check-realm <realm-name>
# View recent alerts
python manage.py alerts --limit 50
# Generate health report
python manage.py health-report
Metrics & Alerts API¶
| Endpoint | Purpose | Method |
|---|---|---|
/api/health/ |
Overall ecosystem status | GET |
/api/realms/ |
Status of all realms | GET |
/api/realms/<name>/ |
Status of single realm | GET |
/api/alerts/ |
Recent alerts | GET |
Common Use Cases¶
Use Case 1: Check Ecosystem Health Before Work¶
Goal: Verify all services are healthy before starting session
- Open worker-bee dashboard at :8082
- View "Realm Status Grid"
- Check for red (down) or yellow (degraded) services
- Green = all healthy, ready to work
- If problems found, click realm for details
Use Case 2: Investigate Service Outage¶
Goal: Understand what happened when a realm crashed
- Open worker-bee dashboard
- Click "Recent Alerts" section
- Find alert for failed realm
- Review timestamp and error message
- Click realm name to see recovery log
- Check if auto-restart succeeded
Use Case 3: Monitor Long-Running Job¶
Goal: Ensure services stay healthy during intensive tasks
- Start long-running operation
- Open worker-bee dashboard
- Monitor resource graphs (CPU, memory, disk)
- Watch for service degradation
- Worker-bee alerts you if problems occur
Use Case 4: Configure Auto-Recovery¶
Goal: Enable automatic service restart on failure
Configuration in worker-bee settings:
- Essential Realms (auto-restart enabled):
- Herald (critical dependency)
- PostgreSQL (critical dependency)
- Redis (critical dependency)
- Non-Critical Realms (manual restart):
- Dashboard features
- Analysis tools
- Experimental services
Important Notes¶
Monitoring Standards¶
Health Check Criteria:
- Service responds to HTTP request within 5 seconds
- PostgreSQL accepts connections
- Redis responds to PING command
- Disk space >5% available
Alert Thresholds:
- Critical: Service down 2+ minutes
- Warning: Response time >2 seconds
- Info: Resource usage >80%
Auto-Recovery Behavior¶
When a realm crashes:
1. Worker-bee detects failure
2. Sends alert if configured
3. If auto-restart enabled: Waits 10 seconds, restarts realm
4. If auto-restart disabled: Alerts you to manually restart
5. Logs recovery attempt with result
Critical services will auto-restart:
- Herald (all realms depend on it)
- PostgreSQL (data storage)
- Redis (message queuing)
Optional services require manual restart:
- Most realm services
- Analysis/experimentation tools
Log Retention & Compliance¶
Log Storage:
- Recent 30 days: Full detail, searchable
- 31-90 days: Compressed, archive access
- 90+ days: Long-term archive (rarely accessed)
Log Deletion:
- Automatically deleted after 1 year
- Manual deletion requires confirmation
- Compliance mode keeps all logs indefinitely
Search logs:
python manage.py search-logs --realm <name> --date 2026-02-17
Troubleshooting¶
Dashboard shows red for realm but service is running¶
Symptom: Realm appears down but you know it's running
Fix: Realm port may be blocked, check firewall, restart realm
Constant alerts for non-critical service¶
Symptom: Getting too many alerts for optional realm
Fix: Adjust alert severity rules or disable alerts for that realm
Auto-restart not working¶
Symptom: Realm crashes but doesn't automatically restart
Fix: Check if auto-restart is enabled for that realm in worker-bee settings
Logs growing too large¶
Symptom: Disk usage increasing rapidly
Fix: Run python manage.py rotate-logs to compress old logs
Resource graphs show memory leak¶
Symptom: Memory usage increasing over time
Fix: Check realm logs for errors, may need to restart specific service
Related Realms¶
- herald - Worker-bee monitors herald for overall ecosystem health
- doc-u-me - Worker-bee logs activity accessible via doc-u-me search
- f3l1x-dashboard - Shows worker-bee health data in realm status
- All other realms - All services monitored by worker-bee