AI DevOps Engineer — Digital Worker
AI DevOps Engineer automates operational tasks: incident diagnosis, log analysis, infrastructure optimization, IaC code generation (Terraform, Ansible), CI/CD pipeline configuration. Acts as first-responder during incidents and reduces on-call engineer workload on L1 tasks.
Incident Response Agent
from langgraph.graph import StateGraph, END
from langchain_openai import ChatOpenAI
from langchain_core.tools import tool
from typing import TypedDict, Annotated, Optional
import operator
llm = ChatOpenAI(model="gpt-4o", temperature=0)
class IncidentState(TypedDict):
alert_data: dict
investigation_steps: Annotated[list, operator.add]
root_cause: Optional[str]
severity: Optional[str]
actions_taken: Annotated[list, operator.add]
resolved: bool
escalation_required: bool
@tool
def get_recent_logs(service: str, minutes: int = 30, level: str = "ERROR") -> str:
"""Get recent service logs from Loki/Elasticsearch.
Args:
service: Service name
minutes: Period in minutes
level: Log level (ERROR, WARN, INFO)
"""
logs = loki_client.query(
query=f'{{app="{service}"}} |= "{level}"',
start=f"-{minutes}m",
limit=100,
)
return "\n".join(logs[:50])
@tool
def get_metrics(service: str, metric_names: list[str], minutes: int = 60) -> str:
"""Get service metrics from Prometheus."""
metrics = {}
for metric in metric_names:
result = prometheus.query_range(
query=f'{metric}{{service="{service}"}}',
start=f"-{minutes}m",
step="1m",
)
metrics[metric] = result
return json.dumps(metrics)
@tool
def check_kubernetes_pods(namespace: str, label_selector: str = "") -> str:
"""Check Kubernetes Pod status."""
pods = k8s_client.list_pods(namespace=namespace, label_selector=label_selector)
pod_status = [{
"name": p.metadata.name,
"phase": p.status.phase,
"ready": all(c.ready for c in (p.status.container_statuses or [])),
"restarts": sum(c.restart_count for c in (p.status.container_statuses or [])),
"age_minutes": (datetime.now() - p.metadata.creation_timestamp).seconds // 60,
} for p in pods.items]
return json.dumps(pod_status)
@tool
def restart_deployment(namespace: str, deployment_name: str) -> str:
"""Restart deployment in Kubernetes (rollout restart)."""
k8s_apps.patch_namespaced_deployment(
name=deployment_name,
namespace=namespace,
body={"spec": {"template": {"metadata": {"annotations": {
"kubectl.kubernetes.io/restartedAt": datetime.now().isoformat()
}}}}},
)
return f"Deployment {deployment_name} is restarting"
@tool
def scale_deployment(namespace: str, deployment_name: str, replicas: int) -> str:
"""Scale deployment."""
if replicas > 20:
return "Error: scaling limit exceeded (20 replicas)"
k8s_apps.patch_namespaced_deployment_scale(
name=deployment_name,
namespace=namespace,
body={"spec": {"replicas": replicas}},
)
return f"Deployment {deployment_name} scaled to {replicas} replicas"
# Incident response agent
incident_tools = [get_recent_logs, get_metrics, check_kubernetes_pods, restart_deployment, scale_deployment]
INCIDENT_RESPONSE_PROMPT = """You are a Senior SRE/DevOps Engineer. Investigate incidents autonomously.
When investigating:
1. First, collect data (logs, metrics, pod status)
2. Determine root cause
3. Try to fix automatically if safe (restart, scale up)
4. If manual intervention required — escalate with detailed context
Never do automatically:
- Changes to production databases
- Deployment rollback without explicit instruction
- Scaling to > 10 replicas
- Resource deletion"""
from langgraph.prebuilt import create_react_agent
incident_agent = create_react_agent(
llm.bind_tools(incident_tools),
tools=incident_tools,
state_modifier=INCIDENT_RESPONSE_PROMPT,
)
Log Analysis Agent
class LogAnalyzer:
async def analyze_error_pattern(
self,
service: str,
time_range: str = "1h",
) -> dict:
"""Analyzes error patterns in logs"""
# Get and cluster errors
error_logs = await loki_client.query_errors(service, time_range)
clustered = self.cluster_errors(error_logs)
# LLM analyzes patterns
analysis = await llm.ainvoke(f"""Analyze error patterns:
Top errors (clusters):
{json.dumps(clustered[:10], ensure_ascii=False, indent=2)}
Time pattern: {self.get_time_pattern(error_logs)}
Determine:
1. Root cause of most frequent errors
2. Anomalous patterns (sudden spike, cyclicity)
3. Remediation recommendations""")
return {
"clusters": clustered,
"analysis": analysis.content,
"anomalies": self.detect_anomalies(error_logs),
}
def cluster_errors(self, logs: list[dict]) -> list[dict]:
"""Simple error clustering by fingerprint"""
from collections import Counter
fingerprints = Counter()
examples = {}
for log in logs:
# Normalize error (remove dynamic parts)
fingerprint = re.sub(r'\b\d+\b', 'N', log.get("message", ""))
fingerprint = re.sub(r'[0-9a-f]{8}-[0-9a-f-]{23}', 'UUID', fingerprint)
fingerprints[fingerprint] += 1
if fingerprint not in examples:
examples[fingerprint] = log["message"]
return [
{"fingerprint": fp[:100], "count": count, "example": examples[fp]}
for fp, count in fingerprints.most_common(20)
]
IaC Generator
class InfrastructureCodeGenerator:
async def generate_terraform(
self,
infrastructure_description: str,
cloud_provider: str = "aws",
existing_modules: list[str] = None,
) -> str:
"""Generates Terraform configuration"""
modules_context = f"\nAvailable modules: {existing_modules}" if existing_modules else ""
response = await llm.ainvoke(f"""Generate Terraform configuration for:
{infrastructure_description}
Provider: {cloud_provider}
Requirements:
- Use latest stable provider versions
- Follow best practices: don't hardcode credentials, use variables and outputs
- Add tags for cost allocation
- Include basic security groups / IAM policies
{modules_context}
Return complete HCL code with comments.""")
return response.content
async def generate_ansible_playbook(
self,
task_description: str,
target_os: str = "ubuntu",
idempotency_required: bool = True,
) -> str:
"""Generates Ansible playbook"""
response = await llm.ainvoke(f"""Generate Ansible playbook for:
{task_description}
Target OS: {target_os}
Idempotency: {'required — all tasks must be idempotent' if idempotency_required else 'desired'}
Requirements:
- Use ansible-lint best practices
- Handlers for services
- Check before/after if applicable
- Verifiable — add verify tasks
Return YAML playbook.""")
return response.content
CI/CD Pipeline Generator
async def generate_github_actions_pipeline(
project_type: str, # "python-fastapi", "node-react", "go"
deployment_target: str, # "kubernetes", "lambda", "ecs"
requirements: list[str], # ["tests", "security-scan", "docker", "terraform"]
) -> str:
response = await llm.ainvoke(f"""Generate GitHub Actions workflow for:
Project type: {project_type}
Deployment: {deployment_target}
Requirements: {requirements}
Include:
- Parallel tasks where possible
- Dependency caching
- Correct conditions (push main → deploy prod, PR → tests only)
- Environment protection rules for production
- Notify on failure
Return complete YAML workflow.""")
return response.content
Practical Case Study: Startup, 2 DevOps for 15 Developers
Situation: 2 DevOps engineers, 40+ microservices, night on-call duty exhausted the team. L1 incidents (OOMKilled, high load, slow queries) took 60% of on-call time.
AI DevOps First-Responder:
- Handles PagerDuty alerts autonomously
- Collects diagnostic data (logs, metrics, k8s status)
- Performs safe automatic actions (restart, scale up)
- For complex cases: wakes engineer with full context instead of raw alert
Results:
- L1 incidents closed autonomously: 61%
- Average time to wake engineer at night: reduced by 58%
- Mean Time to Recovery (MTTR): 45 min → 18 min
- DevOps focus: architecture, optimization, not routine restarts
- Night wake-ups: -63%
IaC generation: 180 PR with Terraform/Ansible code in 3 months, 91% accepted without major revisions.
Timeline
- Incident Response agent with K8s tools: 2–3 weeks
- Log Analysis system: 1–2 weeks
- IaC Generator for main resources: 1–2 weeks
- CI/CD Generator + integrations: 1–2 weeks
- PagerDuty/OpsGenie integration: 1 week
- Total: 6–10 weeks







