While many AI workflows in software engineering focus on assisting with tasks like code writing, stack trace explanation, or pull request summarization, the fundamental operational loop for incident management often remains heavily human-dependent. This means a developer still needs to manually notice, prioritize, investigate, and decide on the criticality of an issue.
The author recently encountered this limitation firsthand after shipping a new feature. Despite proper testing, a significant crash went unnoticed because it failed to trigger notification thresholds in Firebase Crashlytics. This experience highlighted a critical gap: relying solely on passive alerts meant potentially missing serious issues without manual inspection, sparking the idea for a more proactive, automated solution.
Leveraging existing access to Crashlytics through internal MCP tooling, the author began experimenting with automating the entire crash discovery and investigation process. The resulting workflow, powered by a system named Hermes, is now largely automated.
The workflow operates as follows: Hermes regularly checks Crashlytics for new issues every four hours. Upon detection, issues are automatically documented and prioritized. Subsequently, one issue at a time is delegated to a specialized AI agent workflow. This agent then gathers additional context using MCP, attempts to reproduce the crash, writes or updates relevant tests, builds the necessary platform bundle, works towards a fix, and if successful, opens a pull request. The issue document is updated dynamically throughout this process.
Once the pull request is successfully merged, Hermes automatically closes the corresponding Crashlytics issue. Notably, CodeRabbit provides an initial review of the PR even before human developers engage. This end-to-end process achieves approximately 90% automation, with human involvement primarily focused on final review and validation of the proposed fixes.
A crucial realization from this project is that the most significant breakthroughs stemmed less from raw AI model intelligence and more from achieving optimal operational fit. While the underlying models are important, their ability to integrate seamlessly and perform effectively within the existing operational context proved far more critical.
A model that excels at creative writing might be unsuitable for a complex codebase, and a seemingly cost-effective model, despite good benchmarks, could waste substantial time if it produces poor investigations or weak fixes. The ultimate goal isn't to minimize AI expenditure, but to accelerate problem-solving and enhance reliability.
Different stages of the workflow demand distinct AI capabilities. Triage, for instance, is relatively lightweight; Crashlytics already furnishes severity ratings, impacted user counts, stack traces, and environmental data. Smaller, more efficient models can typically handle prioritization effectively here.
However, investigation requires a higher degree of reasoning quality. In this stage, the system must comprehend platform limitations, validate assumptions, parse documentation, and logically justify why a problem might be unresolvable if that's the outcome.
The workflow also proved valuable in surfacing unforeseen edge cases. For instance, one production issue emerged from a user attempting to upload a 200MB file on Android. The feature itself was not inherently broken, nor was the testing inadequate; rather, the specific limitation regarding large file uploads on Android (which iOS handled differently) simply hadn't been accounted for. This scenario underscores the system's ability to uncover overlooked design considerations and push the boundaries of automated issue detection.