Ep 26: Never Crash — Error Handling, Retry Strategies & Alert Systems

⏱ Est. reading time: 9 min Updated on 4/9/2026

Production Reality

Workflows that work perfectly in dev hit: API timeouts, rate limits, expired credentials, malformed data...

graph TB
    L1["🛡️ Layer 1: Node-level retry & fallback"]
    L2["🛡️ Layer 2: Error Trigger global catch"]
    L3["🛡️ Layer 3: Multi-channel alerts + auto-recovery"]
    L1 --> L2 --> L3
    style L1 fill:#22c55e,stroke:#16a34a,color:#fff
    style L2 fill:#f59e0b,stroke:#d97706,color:#fff
    style L3 fill:#ef4444,stroke:#dc2626,color:#fff

1. Node-Level Error Strategies

// Settings → On Error:
// Retry on Fail: max 3 retries, 2s wait — for transient failures
// Continue on Error: pass error as Item — for batch processing
// Stop Workflow: halt immediately — for critical path nodes
// Output Error Data: structured error in Item for downstream handling

2. Error Trigger (Global Catch)

graph TB
    ET[⚠️ Error Trigger] --> Parse[Parse Error]
    Parse --> Log[📝 Data Table]
    Parse --> Alert[🚨 Slack/Email]
    Parse --> Retry[🔄 Auto-recover]
    style ET fill:#ef4444,stroke:#dc2626,color:#fff

3. Graceful Degradation

graph TB
    API[Call API] --> Check{Success?}
    Check -->|"✅"| Normal[Normal]
    Check -->|"❌"| FB{Fallback}
    FB --> Cache["📦 Use cached data"]
    FB --> Default["📋 Use defaults"]
    FB --> Queue["📬 Retry queue"]
    style FB fill:#f59e0b,stroke:#d97706

4. Alert Severity Levels

Level Trigger Channel
P0 Critical Payment flow down Slack + SMS + PagerDuty
P1 Important API fails ≥ 5x Slack + Email
P2 Warning Single fail, auto-recovered Data Table log

Next Episode

Ep 27: Sub-Workflow architecture — splitting complex logic into reusable modular components.