Alerting & On-call Practices

Alerting and on-call practices ensure that security and reliability issues are detected immediately, routed to the correct team, and resolved as quickly as possible. In DevSecOps environments, alerting must be predictable, actionable, and noise-free so that engineers can respond effectively during incidents. On-call practices define how teams rotate responsibilities, escalate issues, document responses, and maintain resilience under real-world conditions.

Foundations of Alerting

Alerting triggers when monitoring systems detect conditions that require human attention. Good alerting prevents:

• missed outages
• delayed security detection
• alert fatigue
• duplicated escalation
• unclear ownership

Alerting must be tied to clear service-level expectations and defined boundaries.

What Makes a Good Alert

A good alert must be:

• actionable
• urgent
• routed to the correct team
• specific enough to guide investigation
• reproducible
• tied to documented runbooks

Bad alerts create noise and reduce attention for real incidents.

Alert Types in DevSecOps

Reliability Alerts

Triggered by:

• high latency
• pod failures
• CrashLoopBackOff
• scaling failures
• ingress errors
• degraded API performance

Security Alerts

Triggered by:

• Falco events
• API abuse
• failed authorization attempts
• CI/CD anomalies
• image signature failures
• admission controller denials

Infrastructure Alerts

Triggered by:

• node failures
• disk pressure
• network anomalies
• storage issues

Compliance Alerts

Triggered by:

• policy violations
• resource configuration drift
• unapproved registry usage

Alert categories determine which team receives them.

On-call Structure

An on-call system defines:

• who responds
• when they respond
• how incidents are escalated
• how handoffs occur
• how incidents are documented

On-call schedules rotate regularly to ensure consistent coverage.

On-call Roles

Primary On-call

First responder. Handles alerts immediately.

Secondary On-call

Supports primary in complex incidents.

Incident Commander

Coordinates communication during major incidents.

Communications Lead

Handles updates, status pages, and stakeholder notifications.

Clear role separation prevents confusion during real incidents.

Escalation Rules

Escalations are triggered when:

• primary cannot resolve the issue
• incident severity exceeds threshold
• issue affects production for prolonged time
• security breach indicators appear

Escalation paths must be predefined and documented.

SLAs and SLOs for Alerting

Alerting ties into service-level expectations:

• SLA → external promise
• SLO → internal target
• SLI → measured metric

Alerts should fire when SLOs are at risk, not after SLAs are broken.

Runbooks

Runbooks contain:

• steps to investigate
• expected log files
• mitigation actions
• rollback steps
• contact points
• validation steps

Runbooks must be accessible, version-controlled, and updated constantly.

Alert Prioritization

P1 – Critical

Production down, major breach indicator.

P2 – High

Service degraded, potential security compromise.

P3 – Medium

Minor functionality issues or suspicious activity.

P4 – Low

Informational or non-urgent.

Prioritization avoids overloading on-call staff.

Reducing Alert Noise

Noise reduction includes:

• threshold tuning
• suppression during deployments
• grouping repeated alerts
• routing to relevant teams
• eliminating duplicate alerts
• avoiding alerts for expected behavior

Alert hygiene ensures reliability.

Alert Routing

Routing sends alerts to:

• Slack
• Teams
• PagerDuty
• Opsgenie
• Email
• SIEM/SOAR
• SMS calls

Each routing path must match severity requirements.


Full-Length Practical Section

Hands-on tasks for building strong alerting and on-call practices.


Practical 1: Define Escalation Policy in PagerDuty or Opsgenie

Create escalation chain:

• primary → secondary → team lead
• timed escalation windows
• overnight fallback routing

Test policy manually.


Practical 2: Configure Service-level Alerts

Define SLO-based thresholds:

• latency > 300ms
• error rate > 2%
• CPU > 90% for 5 minutes

Turn them into actionable alerts.


Practical 3: Create Security Alerts From Falco

In Sidekick:

output:
  slack:
    webhookurl: <url>

Send real-time alerts for:

• unexpected shell
• privilege escalation
• host access attempts


Practical 4: Create Alert for Kubernetes Pod Crashes

Prometheus rule:

kube_pod_container_status_restarts_total > 5

Route to on-call engineer.


Practical 5: Alert on Admission Controller Denials

Forward API server audit logs:

• filter events with denied status
• send alerts to SIEM or Slack


Practical 6: Detect CI/CD Pipeline Anomalies

Monitor:

• unexpected workflow runs
• new privileged steps
• new secrets added

Send alerts to on-call security.


Practical 7: Notification Grouping

Configure alert manager to group alerts:

• by service
• by namespace
• by severity

Suppress duplicates.


Practical 8: On-call Rotation Setup

Create weekly rotation:

week1: engineer A  
week2: engineer B  
week3: engineer C

Document handoff guidelines.


Practical 9: Build Incident Runbook Template

Include:

• immediate steps
• log commands
• rollback steps
• owner contact
• validation checks

Use version control to track updates.


Practical 10: Create Alert for Unauthorized Registry Usage

Prometheus rule:

registry_unapproved_image_total > 0

Route to DevSecOps.


Practical 11: Alert on Suspicious Kubernetes API Calls

From audit logs:

• create clusterrolebinding
• delete secrets
• patch deployments
• exec into pods

Forward alerts to SIEM.


Practical 12: Alert on Host Node Issues

Prometheus:

node_filesystem_free < 10%
node_memory_available < 15%

Ensures node-level stability.


Practical 13: Alert on Attack Indicators

Detect:

• brute-force login attempts
• port scans in cluster
• outbound connections to unusual IPs

Route to security on-call.


Practical 14: Automatic Annotation of Alerts

Add metadata:

• pod name
• namespace
• logs link
• dashboard link

Improves triage speed.


Practical 15: Create Pager Duty Integration for P1 Incidents

Configure:

• phone calls
• SMS
• app push notifications

Ensure 24/7 response.


Practical 16: Producing Weekly On-call Reports

Generate summaries:

• number of alerts
• severity distribution
• noise alerts identified
• runbook updates needed

Review with team.


Practical 17: Run Incident Simulation

Simulate:

• pod crash
• API server outage
• CI pipeline compromise

Review on-call response.


Practical 18: Setup Alert Silencing During Deployments

Use deployment annotations:

alertmanager.io/silence

Prevent false positives.


Practical 19: Test Escalation Paths

Trigger synthetic alerts to ensure primary and secondary responders receive notifications.


Practical 20: Build Complete Alerting & On-call Architecture

Architecture includes:

• Prometheus → Alertmanager → PagerDuty
• Falco → Sidekick → SIEM + Slack
• CI/CD anomaly alerts
• SLO-based service alerts
• automated alert grouping + de-duplication
• runbooks + escalation policies
• on-call rotations
• incident simulations
• post-incident reviews

This provides comprehensive and reliable DevSecOps alerting with strong on-call readiness.


Intel Dump

• alerting detects reliability and security issues in real time
• on-call practices define response, escalation, and responsibility
• alerts must be actionable, urgent, and noise-free
• use SLO-based thresholds, Falco alerts, API audit monitoring, and CI/CD anomaly detection
• practicals included escalation policy setup, pipeline monitoring, admission log alerting, runbook creation, incident simulation, and full alerting architecture

HOME LEARN COMMUNITY DASHBOARD