English
User Guide
Incidents

Incidents

Report, track, and communicate service disruptions to your users.

Incidents List

Overview

Incidents are the primary way to communicate service issues to your users. A well-managed incident:

  • Keeps users informed during outages
  • Builds trust through transparency
  • Documents issues for future reference

Incident Lifecycle

Every incident progresses through these statuses:

StatusDescriptionTypical Duration
InvestigatingIssue reported, team is looking into it5-30 minutes
IdentifiedRoot cause found, working on fix15-60 minutes
MonitoringFix deployed, watching for stability15-30 minutes
ResolvedIssue fully fixedFinal state

Creating Incidents

Manual Creation

  1. Navigate to Dashboard > Incidents
  2. Click "New Incident"
  3. Fill in the details:

Required fields:

  • Title: Clear, concise description (e.g., "API Response Delays")
  • Status: Starting status (usually "Investigating")
  • Impact: Severity level (Minor, Major, Critical)
  • Affected Components: Select one or more components
  • Message: Initial update explaining the situation
  1. Click "Create Incident"

Using Templates

For consistent messaging:

  1. Click "New Incident"
  2. Click "Use Template"
  3. Select a template
  4. Customize the pre-filled content
  5. Create the incident

Create templates for common incident types like "Database Issues", "Network Outage", or "Third-party Provider Down".

Automatic Creation

ENDPOINT components can create incidents automatically:

  1. Edit the ENDPOINT component
  2. Enable "Auto Create Incident"
  3. Set "Failure Threshold" (consecutive failures before incident)
  4. Configure auto-resolve settings

When the threshold is reached:

  • An incident is created with "Investigating" status
  • Affected components are set to "Major Outage"
  • Subscribers are notified

Adding Updates

Keep users informed with regular updates:

  1. Open the incident
  2. Click "Add Update"
  3. Select the new status
  4. Write the update message
  5. Optionally update component status
  6. Click "Post Update"

Update Guidelines

PhaseFrequencyContent
InvestigatingEvery 15-20 minWhat we know, what we're checking
IdentifiedEvery 20-30 minRoot cause, ETA if known
MonitoringEvery 30-60 minFix status, stability observations
ResolvedOnceSummary, apology if appropriate

Status Transitions

Typical progression:

Investigating → Identified → Monitoring → Resolved

You can skip statuses (e.g., go directly from Investigating to Resolved for quick fixes).

Resolving Incidents

When the issue is fixed:

  1. Open the incident
  2. Click "Add Update"
  3. Set status to "Resolved"
  4. Write a resolution message:
    • Confirm the fix
    • Explain what was done
    • Apologize if appropriate
  5. Important: Set affected components back to "Operational"
  6. Click "Post Update"

Auto-Resolution

For ENDPOINT components with auto-incidents:

  1. Edit the component
  2. Enable "Auto Resolve"
  3. Set "Recovery Threshold" (consecutive successes before resolving)

The incident resolves automatically when:

  • Health checks succeed for the recovery threshold
  • Component returns to Operational

Postmortems

Document major incidents for learning:

  1. Open a resolved incident
  2. Click "Add Postmortem"
  3. Write the analysis:

Summary: Brief description of what happened

Impact: Who was affected and how

  • Duration
  • Affected users/requests
  • Financial impact (if applicable)

Root Cause: Why it happened

  • Technical explanation
  • Contributing factors

Timeline: Sequence of events

  • When detected
  • Key investigation steps
  • When fixed

Action Items: How to prevent recurrence

  • Immediate fixes
  • Long-term improvements
  • Process changes
  1. Toggle "Publish" to show on status page
  2. Save

Incident Templates

Create reusable templates:

  1. Navigate to Settings > Templates
  2. Click "New Template"
  3. Configure:
    • Name: Template identifier
    • Title Pattern: Default incident title
    • Impact: Default severity
    • Components: Pre-selected components
    • Message: Default update text

Template Variables

Use variables in templates:

VariableDescription
{{component}}Affected component name
{{timestamp}}Current date/time
{{status}}Current status

Incident Notifications

When incidents are created or updated:

EventWho Gets Notified
New incidentAll subscribers, on-call team
Update postedSubscribers opted in to updates
ResolvedAll subscribers
Postmortem publishedOptional (configurable)

Notification Channels

Subscribers can receive notifications via:

  • Email
  • SMS
  • Webhook
  • Slack/Discord/Teams (via notification channels)

Filtering Incidents

The incidents page supports:

  • Status filter: Open, Resolved, All
  • Impact filter: Minor, Major, Critical
  • Date range: Filter by creation date
  • Component filter: Show incidents affecting specific components
  • Search: Find by title or content

API Access

Create Incident

curl -X POST http://localhost:3000/api/v1/incidents \
  -H "Authorization: Bearer sk_live_xxx" \
  -H "Content-Type: application/json" \
  -d '{
    "title": "API Response Delays",
    "status": "investigating",
    "impact": "major",
    "message": "We are investigating reports of slow API responses.",
    "componentIds": ["component-id-1"],
    "componentStatuses": {
      "component-id-1": "degraded_performance"
    }
  }'

Add Update

curl -X POST http://localhost:3000/api/v1/incidents/{id}/updates \
  -H "Authorization: Bearer sk_live_xxx" \
  -H "Content-Type: application/json" \
  -d '{
    "status": "identified",
    "message": "Root cause identified as database connection issues."
  }'

Resolve

curl -X POST http://localhost:3000/api/v1/incidents/{id}/updates \
  -H "Authorization: Bearer sk_live_xxx" \
  -H "Content-Type: application/json" \
  -d '{
    "status": "resolved",
    "message": "The issue has been resolved.",
    "componentStatuses": {
      "component-id-1": "operational"
    }
  }'

Best Practices

Writing Incident Titles

  • Be specific but concise
  • Include affected service/area
  • Avoid jargon

Good: "API - Elevated Response Times" Bad: "Issue with the thing"

Communication Tone

  • Be professional but human
  • Acknowledge user impact
  • Avoid blame language
  • Thank users for patience

Timing

  • Create incidents quickly when issues are detected
  • Don't wait until you have all answers
  • Update regularly during active incidents
  • Resolve promptly when fixed

Related Documentation