Skip to content

Alert Rules

Alert rules define the conditions under which DIBOP sends notifications. This page is a detailed reference for creating and managing alert rules.


Rule Structure

Every alert rule has the following components:

Component Description
Name Unique, descriptive name for the rule
Description Explanation of what the rule monitors and why
Type The category of alert (SLA Breach, Execution Failure, etc.)
Scope What the rule applies to (all orchestrations, specific orchestrations, or specific systems)
Condition The threshold that triggers the alert
Notification How and where to send the alert
Cooldown Minimum time between repeated notifications for the same rule
Enabled Whether the rule is currently active

Alert Rule Types

Execution Failure

Fires when any execution of the scoped orchestration(s) fails.

Setting Description Default
Scope All orchestrations, or select specific ones All
Trigger On Any failure, or only critical failures (where error handling did not recover) Any

Consecutive Failures

Fires when a specific number of executions fail in a row for the same orchestration.

Setting Description Default
Scope Select orchestration(s) Required
Threshold Number of consecutive failures 3

Why Consecutive Matters

A single failure might be a transient issue. Three consecutive failures suggest a persistent problem that needs human attention.

SLA Breach

Fires when the success rate drops below a threshold over a rolling time window.

Setting Description Default
Scope All orchestrations, or select specific ones All
Metric Success rate percentage 99%
Window Rolling time window 7 days
Evaluate Every How often the condition is checked 1 hour
Minimum Executions Minimum number of executions in the window for the rule to apply 10

The "minimum executions" setting prevents false alerts for low-volume orchestrations where a single failure could cause a large percentage drop.

System Down

Fires when a connected system fails health checks.

Setting Description Default
Scope All systems, or select specific ones All
Consecutive Failures Number of failed health checks before alerting 2
Check Interval How often health checks run 5 minutes

Latency Threshold

Fires when the average API response time exceeds a limit.

Setting Description Default
Scope Select system(s) Required
Threshold Maximum average latency in milliseconds 5000
Window Rolling time window for the average 1 hour
Minimum Calls Minimum calls in the window for the rule to apply 5

Error Rate Spike

Fires when the error rate exceeds a percentage over a time window.

Setting Description Default
Scope All systems, or select specific ones All
Threshold Maximum error rate percentage 10%
Window Rolling time window 1 hour
Minimum Calls Minimum calls in the window for the rule to apply 10

Notification Configuration

Email

channel: email
recipients:
  - admin@example.com
  - ops-team@example.com
include_trace_link: true
include_suggested_actions: true

Webhook

channel: webhook
url: https://hooks.example.com/dibop-alerts
method: POST
headers:
  Authorization: Bearer <token>
  Content-Type: application/json
retry_on_failure: true
max_retries: 3

Slack

channel: slack
webhook_url: https://hooks.slack.com/services/T00/B00/xxx
channel_name: "#ops-alerts"
mention: "@oncall"

Multiple Channels

A single rule can notify through multiple channels. For example, send an email to the team AND post to Slack AND fire a webhook to PagerDuty.


Cooldown Period

The cooldown period prevents alert fatigue by limiting how often a rule can fire.

Cooldown Meaning
5 minutes After firing, the rule will not fire again for 5 minutes
1 hour After firing, the rule will not fire again for 1 hour
4 hours Suitable for SLA-level alerts that need time to recover

During the cooldown, the condition may continue to be true, but no additional notifications are sent. When the cooldown expires and the condition is still true, a new notification is sent.


Managing Rules

Viewing All Rules

Navigate to MONITOR > Alert Rules to see all configured rules in a table:

Column Description
Name Rule name
Type Alert type
Scope What it applies to
Status Enabled, Disabled, or Silenced
Last Fired When the rule last triggered
Fires (30d) How many times it fired in the last 30 days

Editing a Rule

  1. Click on a rule name to open its editor
  2. Modify any settings
  3. Click Save

Changes take effect immediately on the next evaluation cycle.

Disabling a Rule

Click the toggle in the Status column to disable a rule. Disabled rules are not evaluated and will not fire.

Deleting a Rule

  1. Click the three-dot menu on a rule
  2. Select Delete
  3. Confirm the deletion

Deleting a rule also deletes its firing history. If you may need the rule again, disable it instead.


Suggested Starter Rules

For a new DIBOP deployment, consider these starting rules:

Rule Type Threshold Why
Overall SLA SLA Breach < 99% over 7d Catch systemic issues
Critical Orchestration Failures Consecutive Failures 3 in a row Detect persistent problems
System Health System Down 2 failed checks Know when a system goes offline
Slow API Latency Threshold > 5000ms avg over 1h Detect performance degradation
Error Spike Error Rate Spike > 10% over 1h Catch sudden increases in errors

Start Simple

Begin with 3-5 rules and refine based on experience. It is better to have a few well-tuned rules than many rules that create noise.


Troubleshooting

Issue Cause Solution
Alert not firing Cooldown period still active Wait for cooldown to expire
Alert not firing Rule is disabled or silenced Check the rule status
Alert not firing Minimum executions/calls not met Lower the minimum threshold
Too many alerts Threshold too sensitive Increase the threshold or cooldown
Email not received Email in spam folder Check spam; add DIBOP to allowed senders
Webhook not received URL unreachable from DIBOP Verify the webhook URL is accessible

Next Steps