English
Getting Started
Your First Incident

Your First Incident

Learn how to create, update, and resolve incidents in ReliaPulse.

Overview

This tutorial covers the complete incident lifecycle:

  1. Creating an incident
  2. Adding updates
  3. Resolving the incident
  4. Writing a postmortem

The Incident Lifecycle

   ┌─────────────────────────────────────────────────────────────┐
   │                                                             │
   │  ┌───────────┐   ┌────────────┐   ┌───────────┐   ┌────────┤
   │  │Investigating│ → │ Identified │ → │ Monitoring│ → │Resolved│
   │  └───────────┘   └────────────┘   └───────────┘   └────────┤
   │                                                             │
   │     Report         Root cause       Fix applied      Issue   │
   │     issue          found            watching         fixed   │
   │                                                             │
   └─────────────────────────────────────────────────────────────┘

Create an Incident

When you discover an issue:

  1. Navigate to Dashboard > Incidents

  2. Click "New Incident"

  3. Fill in the details:

    Basic Information:

    • Title: API Response Times Elevated
    • Status: Investigating
    • Impact: Choose the severity level

    Affected Components:

    • Select API (or your component)
    • Set component status to Degraded Performance

    Initial Message:

    We are investigating reports of slow API response times.
    Some users may experience delays when making requests.
    We will provide updates as we learn more.
  4. Click "Create Incident"

The incident immediately appears on your public status page, and subscribers receive notifications.

Add an Update (Identified)

Once you've found the root cause:

  1. Open the incident from the incidents list
  2. Click "Add Update"
  3. Fill in the update:
    • Status: Identified
    • Message:
    We have identified the root cause as a database connection pool
    exhaustion. Our team is working on increasing the pool size
    and implementing additional connection management.
  4. Optionally update component status (keep as Degraded Performance)
  5. Click "Post Update"

Add an Update (Monitoring)

After applying a fix:

  1. Click "Add Update" again
  2. Fill in the update:
    • Status: Monitoring
    • Message:
    A fix has been deployed to increase database connection pool
    capacity. Response times are returning to normal levels.
    We are monitoring the system to ensure stability.
  3. Click "Post Update"

Resolve the Incident

Once the issue is fully resolved:

  1. Click "Add Update"
  2. Fill in the resolution:
    • Status: Resolved
    • Message:
    This incident has been resolved. API response times have
    returned to normal levels and have been stable for the
    past 30 minutes.
    
    We apologize for any inconvenience caused.
  3. Important: Update component status back to Operational
  4. Click "Post Update"

The incident is now marked as resolved and moves to the incident history.

Write a Postmortem

For significant incidents, add a postmortem:

  1. Open the resolved incident

  2. Click "Add Postmortem"

  3. Write a thorough analysis:

    Summary:

    On [date], users experienced elevated API response times for
    approximately 45 minutes due to database connection pool exhaustion.

    Impact:

    - Duration: 45 minutes
    - Users affected: ~15% of API requests
    - Services impacted: API, Web Application

    Root Cause:

    A recent deployment increased concurrent request handling without
    proportionally increasing the database connection pool size.
    During peak traffic, connections were exhausted, causing requests
    to queue and timeout.

    Timeline:

    14:23 - Monitoring alerts for elevated response times
    14:25 - Engineering notified, investigation begins
    14:35 - Root cause identified as connection pool exhaustion
    14:45 - Pool size increase deployed to production
    14:55 - Response times normalized
    15:08 - Incident resolved after stability monitoring

    Action Items:

    - [ ] Add connection pool metrics to monitoring dashboard
    - [ ] Create deployment checklist for resource requirements
    - [ ] Implement connection pool auto-scaling
  4. Toggle "Publish Postmortem" to show on status page

  5. Click "Save"

Best Practices

Communication Style

Do:

  • Be clear and concise
  • Use simple language, avoid jargon
  • Provide estimated times when possible
  • Update frequently during active incidents

Don't:

  • Make promises you can't keep
  • Blame individuals or teams
  • Use overly technical language
  • Leave users without updates for long periods

Update Frequency

Incident PhaseUpdate Frequency
InvestigatingEvery 15-20 minutes
IdentifiedEvery 20-30 minutes
MonitoringEvery 30-60 minutes
ResolvedFinal update only

Incident Templates

Use templates for consistent messaging:

  1. Navigate to Settings > Templates
  2. Create templates for common incident types:
    • Network issues
    • Database problems
    • Third-party outages
    • Planned maintenance

Templates save time during high-pressure situations and ensure consistent communication.

Automatic Incidents

ENDPOINT components can automatically create incidents when health checks fail:

  1. Edit your ENDPOINT component
  2. Enable "Auto Create Incident"
  3. Set the failure threshold (e.g., 3 consecutive failures)
  4. Configure auto-resolve behavior

When the monitor detects failures:

  • An incident is created with Investigating status
  • Affected component is set to Major Outage
  • When recovered, incident is resolved automatically

Next Steps