Your First Incident
Learn how to create, update, and resolve incidents in ReliaPulse.
Overview
This tutorial covers the complete incident lifecycle:
- Creating an incident
- Adding updates
- Resolving the incident
- Writing a postmortem
The Incident Lifecycle
┌─────────────────────────────────────────────────────────────┐
│ │
│ ┌───────────┐ ┌────────────┐ ┌───────────┐ ┌────────┤
│ │Investigating│ → │ Identified │ → │ Monitoring│ → │Resolved│
│ └───────────┘ └────────────┘ └───────────┘ └────────┤
│ │
│ Report Root cause Fix applied Issue │
│ issue found watching fixed │
│ │
└─────────────────────────────────────────────────────────────┘Create an Incident
When you discover an issue:
-
Navigate to Dashboard > Incidents
-
Click "New Incident"
-
Fill in the details:
Basic Information:
- Title:
API Response Times Elevated - Status:
Investigating - Impact: Choose the severity level
Affected Components:
- Select
API(or your component) - Set component status to
Degraded Performance
Initial Message:
We are investigating reports of slow API response times. Some users may experience delays when making requests. We will provide updates as we learn more. - Title:
-
Click "Create Incident"
The incident immediately appears on your public status page, and subscribers receive notifications.
Add an Update (Identified)
Once you've found the root cause:
- Open the incident from the incidents list
- Click "Add Update"
- Fill in the update:
- Status:
Identified - Message:
We have identified the root cause as a database connection pool exhaustion. Our team is working on increasing the pool size and implementing additional connection management. - Status:
- Optionally update component status (keep as
Degraded Performance) - Click "Post Update"
Add an Update (Monitoring)
After applying a fix:
- Click "Add Update" again
- Fill in the update:
- Status:
Monitoring - Message:
A fix has been deployed to increase database connection pool capacity. Response times are returning to normal levels. We are monitoring the system to ensure stability. - Status:
- Click "Post Update"
Resolve the Incident
Once the issue is fully resolved:
- Click "Add Update"
- Fill in the resolution:
- Status:
Resolved - Message:
This incident has been resolved. API response times have returned to normal levels and have been stable for the past 30 minutes. We apologize for any inconvenience caused. - Status:
- Important: Update component status back to
Operational - Click "Post Update"
The incident is now marked as resolved and moves to the incident history.
Write a Postmortem
For significant incidents, add a postmortem:
-
Open the resolved incident
-
Click "Add Postmortem"
-
Write a thorough analysis:
Summary:
On [date], users experienced elevated API response times for approximately 45 minutes due to database connection pool exhaustion.Impact:
- Duration: 45 minutes - Users affected: ~15% of API requests - Services impacted: API, Web ApplicationRoot Cause:
A recent deployment increased concurrent request handling without proportionally increasing the database connection pool size. During peak traffic, connections were exhausted, causing requests to queue and timeout.Timeline:
14:23 - Monitoring alerts for elevated response times 14:25 - Engineering notified, investigation begins 14:35 - Root cause identified as connection pool exhaustion 14:45 - Pool size increase deployed to production 14:55 - Response times normalized 15:08 - Incident resolved after stability monitoringAction Items:
- [ ] Add connection pool metrics to monitoring dashboard - [ ] Create deployment checklist for resource requirements - [ ] Implement connection pool auto-scaling -
Toggle "Publish Postmortem" to show on status page
-
Click "Save"
Best Practices
Communication Style
Do:
- Be clear and concise
- Use simple language, avoid jargon
- Provide estimated times when possible
- Update frequently during active incidents
Don't:
- Make promises you can't keep
- Blame individuals or teams
- Use overly technical language
- Leave users without updates for long periods
Update Frequency
| Incident Phase | Update Frequency |
|---|---|
| Investigating | Every 15-20 minutes |
| Identified | Every 20-30 minutes |
| Monitoring | Every 30-60 minutes |
| Resolved | Final update only |
Incident Templates
Use templates for consistent messaging:
- Navigate to Settings > Templates
- Create templates for common incident types:
- Network issues
- Database problems
- Third-party outages
- Planned maintenance
Templates save time during high-pressure situations and ensure consistent communication.
Automatic Incidents
ENDPOINT components can automatically create incidents when health checks fail:
- Edit your ENDPOINT component
- Enable "Auto Create Incident"
- Set the failure threshold (e.g., 3 consecutive failures)
- Configure auto-resolve behavior
When the monitor detects failures:
- An incident is created with
Investigatingstatus - Affected component is set to
Major Outage - When recovered, incident is resolved automatically
Next Steps
- Learn about monitors - Automate incident creation
- Set up notifications - Alert your team
- Configure on-call - Escalate to the right people