Alert Rules¶

Alert rules define the conditions under which DIBOP sends notifications. This page is a detailed reference for creating and managing alert rules.

Rule Structure¶

Every alert rule has the following components:

Component	Description
Name	Unique, descriptive name for the rule
Description	Explanation of what the rule monitors and why
Type	The category of alert (SLA Breach, Execution Failure, etc.)
Scope	What the rule applies to (all orchestrations, specific orchestrations, or specific systems)
Condition	The threshold that triggers the alert
Notification	How and where to send the alert
Cooldown	Minimum time between repeated notifications for the same rule
Enabled	Whether the rule is currently active

Alert Rule Types¶

Execution Failure¶

Fires when any execution of the scoped orchestration(s) fails.

Setting	Description	Default
Scope	All orchestrations, or select specific ones	All
Trigger On	Any failure, or only critical failures (where error handling did not recover)	Any

Consecutive Failures¶

Fires when a specific number of executions fail in a row for the same orchestration.

Setting	Description	Default
Scope	Select orchestration(s)	Required
Threshold	Number of consecutive failures	3

Why Consecutive Matters

A single failure might be a transient issue. Three consecutive failures suggest a persistent problem that needs human attention.

SLA Breach¶

Fires when the success rate drops below a threshold over a rolling time window.

Setting	Description	Default
Scope	All orchestrations, or select specific ones	All
Metric	Success rate percentage	99%
Window	Rolling time window	7 days
Evaluate Every	How often the condition is checked	1 hour
Minimum Executions	Minimum number of executions in the window for the rule to apply	10

The "minimum executions" setting prevents false alerts for low-volume orchestrations where a single failure could cause a large percentage drop.

System Down¶

Fires when a connected system fails health checks.

Setting	Description	Default
Scope	All systems, or select specific ones	All
Consecutive Failures	Number of failed health checks before alerting	2
Check Interval	How often health checks run	5 minutes

Latency Threshold¶

Fires when the average API response time exceeds a limit.

Setting	Description	Default
Scope	Select system(s)	Required
Threshold	Maximum average latency in milliseconds	5000
Window	Rolling time window for the average	1 hour
Minimum Calls	Minimum calls in the window for the rule to apply	5

Error Rate Spike¶

Fires when the error rate exceeds a percentage over a time window.

Setting	Description	Default
Scope	All systems, or select specific ones	All
Threshold	Maximum error rate percentage	10%
Window	Rolling time window	1 hour
Minimum Calls	Minimum calls in the window for the rule to apply	10

Notification Configuration¶

Email¶

channel: email
recipients:
  - admin@example.com
  - ops-team@example.com
include_trace_link: true
include_suggested_actions: true

Webhook¶

channel: webhook
url: https://hooks.example.com/dibop-alerts
method: POST
headers:
  Authorization: Bearer <token>
  Content-Type: application/json
retry_on_failure: true
max_retries: 3

Slack¶

channel: slack
webhook_url: https://hooks.slack.com/services/T00/B00/xxx
channel_name: "#ops-alerts"
mention: "@oncall"

Multiple Channels¶

A single rule can notify through multiple channels. For example, send an email to the team AND post to Slack AND fire a webhook to PagerDuty.

Cooldown Period¶

The cooldown period prevents alert fatigue by limiting how often a rule can fire.

Cooldown	Meaning
5 minutes	After firing, the rule will not fire again for 5 minutes
1 hour	After firing, the rule will not fire again for 1 hour
4 hours	Suitable for SLA-level alerts that need time to recover

During the cooldown, the condition may continue to be true, but no additional notifications are sent. When the cooldown expires and the condition is still true, a new notification is sent.

Managing Rules¶

Viewing All Rules¶

Navigate to MONITOR > Alert Rules to see all configured rules in a table:

Column	Description
Name	Rule name
Type	Alert type
Scope	What it applies to
Status	Enabled, Disabled, or Silenced
Last Fired	When the rule last triggered
Fires (30d)	How many times it fired in the last 30 days

Editing a Rule¶

Click on a rule name to open its editor
Modify any settings
Click Save

Changes take effect immediately on the next evaluation cycle.

Disabling a Rule¶

Click the toggle in the Status column to disable a rule. Disabled rules are not evaluated and will not fire.

Deleting a Rule¶

Click the three-dot menu on a rule
Select Delete
Confirm the deletion

Deleting a rule also deletes its firing history. If you may need the rule again, disable it instead.

Suggested Starter Rules¶

For a new DIBOP deployment, consider these starting rules:

Rule	Type	Threshold	Why
Overall SLA	SLA Breach	< 99% over 7d	Catch systemic issues
Critical Orchestration Failures	Consecutive Failures	3 in a row	Detect persistent problems
System Health	System Down	2 failed checks	Know when a system goes offline
Slow API	Latency Threshold	> 5000ms avg over 1h	Detect performance degradation
Error Spike	Error Rate Spike	> 10% over 1h	Catch sudden increases in errors

Start Simple

Begin with 3-5 rules and refine based on experience. It is better to have a few well-tuned rules than many rules that create noise.

Troubleshooting¶

Issue	Cause	Solution
Alert not firing	Cooldown period still active	Wait for cooldown to expire
Alert not firing	Rule is disabled or silenced	Check the rule status
Alert not firing	Minimum executions/calls not met	Lower the minimum threshold
Too many alerts	Threshold too sensitive	Increase the threshold or cooldown
Email not received	Email in spam folder	Check spam; add DIBOP to allowed senders
Webhook not received	URL unreachable from DIBOP	Verify the webhook URL is accessible

Next Steps¶

Setting Up Alerts -- overview and getting started guide
Observability Dashboard -- see alert indicators in context
Platform Metrics -- understand the metrics that feed your rules