Alert Rules¶
Alert rules define the conditions under which DIBOP sends notifications. This page is a detailed reference for creating and managing alert rules.
Rule Structure¶
Every alert rule has the following components:
| Component | Description |
|---|---|
| Name | Unique, descriptive name for the rule |
| Description | Explanation of what the rule monitors and why |
| Type | The category of alert (SLA Breach, Execution Failure, etc.) |
| Scope | What the rule applies to (all orchestrations, specific orchestrations, or specific systems) |
| Condition | The threshold that triggers the alert |
| Notification | How and where to send the alert |
| Cooldown | Minimum time between repeated notifications for the same rule |
| Enabled | Whether the rule is currently active |
Alert Rule Types¶
Execution Failure¶
Fires when any execution of the scoped orchestration(s) fails.
| Setting | Description | Default |
|---|---|---|
| Scope | All orchestrations, or select specific ones | All |
| Trigger On | Any failure, or only critical failures (where error handling did not recover) | Any |
Consecutive Failures¶
Fires when a specific number of executions fail in a row for the same orchestration.
| Setting | Description | Default |
|---|---|---|
| Scope | Select orchestration(s) | Required |
| Threshold | Number of consecutive failures | 3 |
Why Consecutive Matters
A single failure might be a transient issue. Three consecutive failures suggest a persistent problem that needs human attention.
SLA Breach¶
Fires when the success rate drops below a threshold over a rolling time window.
| Setting | Description | Default |
|---|---|---|
| Scope | All orchestrations, or select specific ones | All |
| Metric | Success rate percentage | 99% |
| Window | Rolling time window | 7 days |
| Evaluate Every | How often the condition is checked | 1 hour |
| Minimum Executions | Minimum number of executions in the window for the rule to apply | 10 |
The "minimum executions" setting prevents false alerts for low-volume orchestrations where a single failure could cause a large percentage drop.
System Down¶
Fires when a connected system fails health checks.
| Setting | Description | Default |
|---|---|---|
| Scope | All systems, or select specific ones | All |
| Consecutive Failures | Number of failed health checks before alerting | 2 |
| Check Interval | How often health checks run | 5 minutes |
Latency Threshold¶
Fires when the average API response time exceeds a limit.
| Setting | Description | Default |
|---|---|---|
| Scope | Select system(s) | Required |
| Threshold | Maximum average latency in milliseconds | 5000 |
| Window | Rolling time window for the average | 1 hour |
| Minimum Calls | Minimum calls in the window for the rule to apply | 5 |
Error Rate Spike¶
Fires when the error rate exceeds a percentage over a time window.
| Setting | Description | Default |
|---|---|---|
| Scope | All systems, or select specific ones | All |
| Threshold | Maximum error rate percentage | 10% |
| Window | Rolling time window | 1 hour |
| Minimum Calls | Minimum calls in the window for the rule to apply | 10 |
Notification Configuration¶
Email¶
channel: email
recipients:
- admin@example.com
- ops-team@example.com
include_trace_link: true
include_suggested_actions: true
Webhook¶
channel: webhook
url: https://hooks.example.com/dibop-alerts
method: POST
headers:
Authorization: Bearer <token>
Content-Type: application/json
retry_on_failure: true
max_retries: 3
Slack¶
channel: slack
webhook_url: https://hooks.slack.com/services/T00/B00/xxx
channel_name: "#ops-alerts"
mention: "@oncall"
Multiple Channels¶
A single rule can notify through multiple channels. For example, send an email to the team AND post to Slack AND fire a webhook to PagerDuty.
Cooldown Period¶
The cooldown period prevents alert fatigue by limiting how often a rule can fire.
| Cooldown | Meaning |
|---|---|
| 5 minutes | After firing, the rule will not fire again for 5 minutes |
| 1 hour | After firing, the rule will not fire again for 1 hour |
| 4 hours | Suitable for SLA-level alerts that need time to recover |
During the cooldown, the condition may continue to be true, but no additional notifications are sent. When the cooldown expires and the condition is still true, a new notification is sent.
Managing Rules¶
Viewing All Rules¶
Navigate to MONITOR > Alert Rules to see all configured rules in a table:
| Column | Description |
|---|---|
| Name | Rule name |
| Type | Alert type |
| Scope | What it applies to |
| Status | Enabled, Disabled, or Silenced |
| Last Fired | When the rule last triggered |
| Fires (30d) | How many times it fired in the last 30 days |
Editing a Rule¶
- Click on a rule name to open its editor
- Modify any settings
- Click Save
Changes take effect immediately on the next evaluation cycle.
Disabling a Rule¶
Click the toggle in the Status column to disable a rule. Disabled rules are not evaluated and will not fire.
Deleting a Rule¶
- Click the three-dot menu on a rule
- Select Delete
- Confirm the deletion
Deleting a rule also deletes its firing history. If you may need the rule again, disable it instead.
Suggested Starter Rules¶
For a new DIBOP deployment, consider these starting rules:
| Rule | Type | Threshold | Why |
|---|---|---|---|
| Overall SLA | SLA Breach | < 99% over 7d | Catch systemic issues |
| Critical Orchestration Failures | Consecutive Failures | 3 in a row | Detect persistent problems |
| System Health | System Down | 2 failed checks | Know when a system goes offline |
| Slow API | Latency Threshold | > 5000ms avg over 1h | Detect performance degradation |
| Error Spike | Error Rate Spike | > 10% over 1h | Catch sudden increases in errors |
Start Simple
Begin with 3-5 rules and refine based on experience. It is better to have a few well-tuned rules than many rules that create noise.
Troubleshooting¶
| Issue | Cause | Solution |
|---|---|---|
| Alert not firing | Cooldown period still active | Wait for cooldown to expire |
| Alert not firing | Rule is disabled or silenced | Check the rule status |
| Alert not firing | Minimum executions/calls not met | Lower the minimum threshold |
| Too many alerts | Threshold too sensitive | Increase the threshold or cooldown |
| Email not received | Email in spam folder | Check spam; add DIBOP to allowed senders |
| Webhook not received | URL unreachable from DIBOP | Verify the webhook URL is accessible |
Next Steps¶
- Setting Up Alerts -- overview and getting started guide
- Observability Dashboard -- see alert indicators in context
- Platform Metrics -- understand the metrics that feed your rules