Alerting and on-call practices ensure that security and reliability issues are detected immediately, routed to the correct team, and resolved as quickly as possible. In DevSecOps environments, alerting must be predictable, actionable, and noise-free so that engineers can respond effectively during incidents. On-call practices define how teams rotate responsibilities, escalate issues, document responses, and maintain resilience under real-world conditions.
Foundations of Alerting
Alerting triggers when monitoring systems detect conditions that require human attention. Good alerting prevents:
• missed outages
• delayed security detection
• alert fatigue
• duplicated escalation
• unclear ownership
Alerting must be tied to clear service-level expectations and defined boundaries.
What Makes a Good Alert
A good alert must be:
• actionable
• urgent
• routed to the correct team
• specific enough to guide investigation
• reproducible
• tied to documented runbooks
Bad alerts create noise and reduce attention for real incidents.
Alert Types in DevSecOps
Reliability Alerts
Triggered by:
• high latency
• pod failures
• CrashLoopBackOff
• scaling failures
• ingress errors
• degraded API performance
Security Alerts
Triggered by:
• Falco events
• API abuse
• failed authorization attempts
• CI/CD anomalies
• image signature failures
• admission controller denials
Infrastructure Alerts
Triggered by:
• node failures
• disk pressure
• network anomalies
• storage issues
Compliance Alerts
Triggered by:
• policy violations
• resource configuration drift
• unapproved registry usage
Alert categories determine which team receives them.
On-call Structure
An on-call system defines:
• who responds
• when they respond
• how incidents are escalated
• how handoffs occur
• how incidents are documented
On-call schedules rotate regularly to ensure consistent coverage.
On-call Roles
Primary On-call
First responder. Handles alerts immediately.
Secondary On-call
Supports primary in complex incidents.
Incident Commander
Coordinates communication during major incidents.
Communications Lead
Handles updates, status pages, and stakeholder notifications.
Clear role separation prevents confusion during real incidents.
Escalation Rules
Escalations are triggered when:
• primary cannot resolve the issue
• incident severity exceeds threshold
• issue affects production for prolonged time
• security breach indicators appear
Escalation paths must be predefined and documented.
SLAs and SLOs for Alerting
Alerting ties into service-level expectations:
• SLA → external promise
• SLO → internal target
• SLI → measured metric
Alerts should fire when SLOs are at risk, not after SLAs are broken.
Runbooks
Runbooks contain:
• steps to investigate
• expected log files
• mitigation actions
• rollback steps
• contact points
• validation steps
Runbooks must be accessible, version-controlled, and updated constantly.
Alert Prioritization
P1 – Critical
Production down, major breach indicator.
P2 – High
Service degraded, potential security compromise.
P3 – Medium
Minor functionality issues or suspicious activity.
P4 – Low
Informational or non-urgent.
Prioritization avoids overloading on-call staff.
Reducing Alert Noise
Noise reduction includes:
• threshold tuning
• suppression during deployments
• grouping repeated alerts
• routing to relevant teams
• eliminating duplicate alerts
• avoiding alerts for expected behavior
Alert hygiene ensures reliability.
Alert Routing
Routing sends alerts to:
• Slack
• Teams
• PagerDuty
• Opsgenie
• Email
• SIEM/SOAR
• SMS calls
Each routing path must match severity requirements.
Full-Length Practical Section
Hands-on tasks for building strong alerting and on-call practices.
Practical 1: Define Escalation Policy in PagerDuty or Opsgenie
Create escalation chain:
• primary → secondary → team lead
• timed escalation windows
• overnight fallback routing
Test policy manually.
Practical 2: Configure Service-level Alerts
Define SLO-based thresholds:
• latency > 300ms
• error rate > 2%
• CPU > 90% for 5 minutes
Turn them into actionable alerts.
Practical 3: Create Security Alerts From Falco
In Sidekick:
output:
slack:
webhookurl: <url>
Send real-time alerts for:
• unexpected shell
• privilege escalation
• host access attempts
Practical 4: Create Alert for Kubernetes Pod Crashes
Prometheus rule:
kube_pod_container_status_restarts_total > 5
Route to on-call engineer.
Practical 5: Alert on Admission Controller Denials
Forward API server audit logs:
• filter events with denied status
• send alerts to SIEM or Slack
Practical 6: Detect CI/CD Pipeline Anomalies
Monitor:
• unexpected workflow runs
• new privileged steps
• new secrets added
Send alerts to on-call security.
Practical 7: Notification Grouping
Configure alert manager to group alerts:
• by service
• by namespace
• by severity
Suppress duplicates.
Practical 8: On-call Rotation Setup
Create weekly rotation:
week1: engineer A
week2: engineer B
week3: engineer C
Document handoff guidelines.
Practical 9: Build Incident Runbook Template
Include:
• immediate steps
• log commands
• rollback steps
• owner contact
• validation checks
Use version control to track updates.
Practical 10: Create Alert for Unauthorized Registry Usage
Prometheus rule:
registry_unapproved_image_total > 0
Route to DevSecOps.
Practical 11: Alert on Suspicious Kubernetes API Calls
From audit logs:
• create clusterrolebinding
• delete secrets
• patch deployments
• exec into pods
Forward alerts to SIEM.
Practical 12: Alert on Host Node Issues
Prometheus:
node_filesystem_free < 10%
node_memory_available < 15%
Ensures node-level stability.
Practical 13: Alert on Attack Indicators
Detect:
• brute-force login attempts
• port scans in cluster
• outbound connections to unusual IPs
Route to security on-call.
Practical 14: Automatic Annotation of Alerts
Add metadata:
• pod name
• namespace
• logs link
• dashboard link
Improves triage speed.
Practical 15: Create Pager Duty Integration for P1 Incidents
Configure:
• phone calls
• SMS
• app push notifications
Ensure 24/7 response.
Practical 16: Producing Weekly On-call Reports
Generate summaries:
• number of alerts
• severity distribution
• noise alerts identified
• runbook updates needed
Review with team.
Practical 17: Run Incident Simulation
Simulate:
• pod crash
• API server outage
• CI pipeline compromise
Review on-call response.
Practical 18: Setup Alert Silencing During Deployments
Use deployment annotations:
alertmanager.io/silence
Prevent false positives.
Practical 19: Test Escalation Paths
Trigger synthetic alerts to ensure primary and secondary responders receive notifications.
Practical 20: Build Complete Alerting & On-call Architecture
Architecture includes:
• Prometheus → Alertmanager → PagerDuty
• Falco → Sidekick → SIEM + Slack
• CI/CD anomaly alerts
• SLO-based service alerts
• automated alert grouping + de-duplication
• runbooks + escalation policies
• on-call rotations
• incident simulations
• post-incident reviews
This provides comprehensive and reliable DevSecOps alerting with strong on-call readiness.
Intel Dump
• alerting detects reliability and security issues in real time
• on-call practices define response, escalation, and responsibility
• alerts must be actionable, urgent, and noise-free
• use SLO-based thresholds, Falco alerts, API audit monitoring, and CI/CD anomaly detection
• practicals included escalation policy setup, pipeline monitoring, admission log alerting, runbook creation, incident simulation, and full alerting architecture