Incidents

Report, track, and communicate service disruptions to your users.

Incidents List

Overview

Incidents are the primary way to communicate service issues to your users. A well-managed incident:

Keeps users informed during outages
Builds trust through transparency
Documents issues for future reference

Incident Lifecycle

Every incident progresses through these statuses:

Status	Description	Typical Duration
Investigating	Issue reported, team is looking into it	5-30 minutes
Identified	Root cause found, working on fix	15-60 minutes
Monitoring	Fix deployed, watching for stability	15-30 minutes
Resolved	Issue fully fixed	Final state

Creating Incidents

Manual Creation

Navigate to Dashboard > Incidents
Click "New Incident"
Fill in the details:

Required fields:

Title: Clear, concise description (e.g., "API Response Delays")
Status: Starting status (usually "Investigating")
Impact: Severity level (Minor, Major, Critical)
Affected Components: Select one or more components
Message: Initial update explaining the situation

Click "Create Incident"

Using Templates

For consistent messaging:

Click "New Incident"
Click "Use Template"
Select a template
Customize the pre-filled content
Create the incident

Create templates for common incident types like "Database Issues", "Network Outage", or "Third-party Provider Down".

Automatic Creation

ENDPOINT components can create incidents automatically:

Edit the ENDPOINT component
Enable "Auto Create Incident"
Set "Failure Threshold" (consecutive failures before incident)
Configure auto-resolve settings

When the threshold is reached:

An incident is created with "Investigating" status
Affected components are set to "Major Outage"
Subscribers are notified

Adding Updates

Keep users informed with regular updates:

Open the incident
Click "Add Update"
Select the new status
Write the update message
Optionally update component status
Click "Post Update"

Update Guidelines

Phase	Frequency	Content
Investigating	Every 15-20 min	What we know, what we're checking
Identified	Every 20-30 min	Root cause, ETA if known
Monitoring	Every 30-60 min	Fix status, stability observations
Resolved	Once	Summary, apology if appropriate

Status Transitions

Typical progression:

Investigating → Identified → Monitoring → Resolved

You can skip statuses (e.g., go directly from Investigating to Resolved for quick fixes).

Resolving Incidents

When the issue is fixed:

Open the incident
Click "Add Update"
Set status to "Resolved"
Write a resolution message:
- Confirm the fix
- Explain what was done
- Apologize if appropriate
Important: Set affected components back to "Operational"
Click "Post Update"

Auto-Resolution

For ENDPOINT components with auto-incidents:

Edit the component
Enable "Auto Resolve"
Set "Recovery Threshold" (consecutive successes before resolving)

The incident resolves automatically when:

Health checks succeed for the recovery threshold
Component returns to Operational

Postmortems

Document major incidents for learning:

Open a resolved incident
Click "Add Postmortem"
Write the analysis:

Summary: Brief description of what happened

Impact: Who was affected and how

Duration
Affected users/requests
Financial impact (if applicable)

Root Cause: Why it happened

Technical explanation
Contributing factors

Timeline: Sequence of events

When detected
Key investigation steps
When fixed

Action Items: How to prevent recurrence

Immediate fixes
Long-term improvements
Process changes

Toggle "Publish" to show on status page
Save

Incident Templates

Create reusable templates:

Navigate to Settings > Templates
Click "New Template"
Configure:
- Name: Template identifier
- Title Pattern: Default incident title
- Impact: Default severity
- Components: Pre-selected components
- Message: Default update text

Template Variables

Use variables in templates:

Variable	Description
`{{component}}`	Affected component name
`{{timestamp}}`	Current date/time
`{{status}}`	Current status

Incident Notifications

When incidents are created or updated:

Event	Who Gets Notified
New incident	All subscribers, on-call team
Update posted	Subscribers opted in to updates
Resolved	All subscribers
Postmortem published	Optional (configurable)

Notification Channels

Subscribers can receive notifications via:

Email
SMS
Webhook
Slack/Discord/Teams (via notification channels)

Filtering Incidents

The incidents page supports:

Status filter: Open, Resolved, All
Impact filter: Minor, Major, Critical
Date range: Filter by creation date
Component filter: Show incidents affecting specific components
Search: Find by title or content

API Access

Create Incident

curl -X POST http://localhost:3000/api/v1/incidents \
  -H "Authorization: Bearer sk_live_xxx" \
  -H "Content-Type: application/json" \
  -d '{
    "title": "API Response Delays",
    "status": "investigating",
    "impact": "major",
    "message": "We are investigating reports of slow API responses.",
    "componentIds": ["component-id-1"],
    "componentStatuses": {
      "component-id-1": "degraded_performance"
    }
  }'

Add Update

curl -X POST http://localhost:3000/api/v1/incidents/{id}/updates \
  -H "Authorization: Bearer sk_live_xxx" \
  -H "Content-Type: application/json" \
  -d '{
    "status": "identified",
    "message": "Root cause identified as database connection issues."
  }'

Resolve

curl -X POST http://localhost:3000/api/v1/incidents/{id}/updates \
  -H "Authorization: Bearer sk_live_xxx" \
  -H "Content-Type: application/json" \
  -d '{
    "status": "resolved",
    "message": "The issue has been resolved.",
    "componentStatuses": {
      "component-id-1": "operational"
    }
  }'

Best Practices

Writing Incident Titles

Be specific but concise
Include affected service/area
Avoid jargon

Good: "API - Elevated Response Times" Bad: "Issue with the thing"

Communication Tone

Be professional but human
Acknowledge user impact
Avoid blame language
Thank users for patience

Timing

Create incidents quickly when issues are detected
Don't wait until you have all answers
Update regularly during active incidents
Resolve promptly when fixed