Ep 26: Never Crash — Error Handling, Retry Strategies & Alert Systems
Production Reality
Workflows that work perfectly in dev hit: API timeouts, rate limits, expired credentials, malformed data...
graph TB
L1["🛡️ Layer 1: Node-level retry & fallback"]
L2["🛡️ Layer 2: Error Trigger global catch"]
L3["🛡️ Layer 3: Multi-channel alerts + auto-recovery"]
L1 --> L2 --> L3
style L1 fill:#22c55e,stroke:#16a34a,color:#fff
style L2 fill:#f59e0b,stroke:#d97706,color:#fff
style L3 fill:#ef4444,stroke:#dc2626,color:#fff1. Node-Level Error Strategies
// Settings → On Error:
// Retry on Fail: max 3 retries, 2s wait — for transient failures
// Continue on Error: pass error as Item — for batch processing
// Stop Workflow: halt immediately — for critical path nodes
// Output Error Data: structured error in Item for downstream handling
2. Error Trigger (Global Catch)
graph TB
ET[⚠️ Error Trigger] --> Parse[Parse Error]
Parse --> Log[📝 Data Table]
Parse --> Alert[🚨 Slack/Email]
Parse --> Retry[🔄 Auto-recover]
style ET fill:#ef4444,stroke:#dc2626,color:#fff3. Graceful Degradation
graph TB
API[Call API] --> Check{Success?}
Check -->|"✅"| Normal[Normal]
Check -->|"❌"| FB{Fallback}
FB --> Cache["📦 Use cached data"]
FB --> Default["📋 Use defaults"]
FB --> Queue["📬 Retry queue"]
style FB fill:#f59e0b,stroke:#d977064. Alert Severity Levels
| Level | Trigger | Channel |
|---|---|---|
| P0 Critical | Payment flow down | Slack + SMS + PagerDuty |
| P1 Important | API fails ≥ 5x | Slack + Email |
| P2 Warning | Single fail, auto-recovered | Data Table log |
Next Episode
Ep 27: Sub-Workflow architecture — splitting complex logic into reusable modular components.