You'll plan and design distributed systems that enable AI-driven incident management, define integrations with AI/ML services, and develop Kubernetes-native operators that autonomously remediate infrastructure issues based on AI insights. You'll tackle challenges like telemetry correlation across logs/metrics/traces, automated root cause analysis, and knowledge graph systems that power runbook automation. You'll design production systems that effectively utilize AI/ML services while ensuring core functionality remains intact during infrastructure failures and network partitions. * Expert Programming: Deep expertise in languages such as Python, Java, or Go with focus on distributed systems, service integration, and cloud-native architecture at scale * Observability Skills: Experience with technologies like Prometheus, OpenTelemetry, Grafana, timeseries databases, and distributed ...
mehr