kaushikpaul90/ai-driven-incident-management
An AI-powered system for autonomous incident detection, diagnosis, and remediation in high-performance computing (HPC) environments. Leverages agentic workflows, RAG-based knowledge retrieval, and ML on log data to reduce downtime and enhance reliability. Features runbooks for HPC issues like disk failures, memory errors, and network timeouts.
Source: https://github.com/kaushikpaul90/ai-driven-incident-management
Discovered from GitHub repositories pushed in the last 24 hours for agent skills, Claude/Codex/Gemini workflows, MCP tooling, and adjacent AI-agent automation.