Datadog Integration
Connect ReliaPulse with Datadog for application and infrastructure metrics.
Overview
The Datadog integration allows you to:
- Display Datadog metrics on your status page
- Create METRIC components based on Datadog queries
- Set thresholds for automatic status updates
- Track multi-series metrics with tag grouping
Prerequisites
- A Datadog account
- An API key with metrics read permissions
- An Application key
Setup
1. Create Datadog API Keys
- Log in to Datadog
- Navigate to Organization Settings > API Keys
- Click "New Key"
- Copy the API key
2. Create Application Key
- Navigate to Organization Settings > Application Keys
- Click "New Key"
- Copy the application key
Application keys are tied to a specific user. Create a service account for production use.
3. Add Integration in ReliaPulse
- Navigate to Settings > Integrations
- Click "Add Integration"
- Select Datadog
- Configure:
| Field | Description |
|---|---|
| Name | Display name (e.g., "Datadog Production") |
| API Key | Your Datadog API key |
| App Key | Your Datadog application key |
| Site | Datadog site (US1, US3, US5, EU1, AP1) |
- Click "Test Connection"
- Click "Save"
Datadog Sites
| Site | URL | Region |
|---|---|---|
| US1 | datadoghq.com | US |
| US3 | us3.datadoghq.com | US |
| US5 | us5.datadoghq.com | US |
| EU1 | datadoghq.eu | EU |
| AP1 | ap1.datadoghq.com | Asia Pacific |
Creating Metrics Queries
Basic Query
- Go to the integration settings
- Click "Metrics" tab
- Click "Add Query"
- Configure:
| Field | Value |
|---|---|
| Name | CPU Usage |
| Query | avg:system.cpu.user{*} |
| Polling Interval | 60 seconds |
| Warning Threshold | 70 |
| Critical Threshold | 90 |
- Save
Query Syntax
Datadog queries follow this pattern:
<aggregation>:<metric>{<scope>}Examples:
avg:system.cpu.user{*} # Average CPU across all hosts
sum:http.requests{service:api}.as_count() # Request count for API service
avg:aws.rds.dbload{*} # RDS database load
p95:trace.request.duration{*} # P95 request durationAggregation Functions
| Function | Description |
|---|---|
avg | Average value |
sum | Sum of values |
min | Minimum value |
max | Maximum value |
count | Number of points |
Scope (Tags)
Filter by tags:
avg:system.cpu.user{host:web-1} # Specific host
avg:system.cpu.user{env:production} # Production environment
avg:system.cpu.user{env:prod,service:api} # Multiple tagsMulti-Series Metrics
Track metrics split by tags:
Enabling Multi-Series
- Edit a metrics query
- Enable "Multi-Series Mode"
- Configure:
| Field | Description |
|---|---|
| Group By Tags | Comma-separated tag names |
| Aggregation | How to aggregate (AVG, SUM) |
| Max Series | Maximum series to track |
Example: Per-Host CPU
Query: avg:system.cpu.user{*}
Group By Tags: hostThis creates a separate series for each host:
web-1: 45%web-2: 52%api-1: 38%
Query Transformation
ReliaPulse automatically appends by {tags} to your query:
Your query: avg:system.cpu.user{*}
With groupByTags: host,env
Effective query: avg:system.cpu.user{*} by {host,env}Series Discovery
- Save the query with multi-series enabled
- Click "Discover Series"
- System queries Datadog and creates series entries
Common Metrics
Infrastructure
| Metric | Query |
|---|---|
| CPU Usage | avg:system.cpu.user{*} |
| Memory Usage | avg:system.mem.used{*} |
| Disk Usage | avg:system.disk.in_use{*} |
| Network In | sum:system.net.bytes_rcvd{*}.as_rate() |
AWS
| Metric | Query |
|---|---|
| RDS CPU | avg:aws.rds.cpuutilization{*} |
| RDS Connections | avg:aws.rds.database_connections{*} |
| Lambda Errors | sum:aws.lambda.errors{*}.as_count() |
| ELB Latency | avg:aws.elb.latency{*} |
APM
| Metric | Query |
|---|---|
| Request Rate | sum:trace.http.request{*}.as_rate() |
| Error Rate | sum:trace.http.request.errors{*}.as_rate() |
| P95 Latency | p95:trace.http.request.duration{*} |
Thresholds
Set thresholds for automatic status updates:
| Threshold | Effect |
|---|---|
| Warning | Component status → Degraded |
| Critical | Component status → Major Outage |
Threshold Direction
By default, "above threshold" is bad. For metrics where lower is worse:
- Set critical lower than warning
- System detects inverted thresholds
Troubleshooting
Authentication Failed
- Verify API key is correct
- Check application key permissions
- Confirm correct site selected
- Ensure keys haven't been revoked
No Data Returned
- Verify metric name is correct
- Check scope tags exist in Datadog
- Confirm metric has recent data points
- Try the query in Datadog UI first
Delayed Data
Cloud metrics (AWS, GCP, Azure) often have 5-10 minute delays.
ReliaPulse uses a 30-minute time window to account for cloud metric delays. If data is still missing:
- Check Datadog for data availability
- Increase polling interval
- Verify metric is actively reporting
Empty Series
If multi-series mode returns no series:
- Verify groupByTags match actual Datadog tags
- Check tags exist on the metric
- Try querying without groupBy first
API Integration
Create Datadog Integration
curl -X POST https://your-domain.com/api/v1/integrations \
-H "Authorization: Bearer sk_live_xxx" \
-H "Content-Type: application/json" \
-d '{
"name": "Datadog Production",
"type": "DATADOG",
"config": {
"apiKey": "your-api-key",
"appKey": "your-app-key",
"site": "US1"
}
}'Create Metrics Query
curl -X POST https://your-domain.com/api/v1/integrations/{id}/metrics \
-H "Authorization: Bearer sk_live_xxx" \
-H "Content-Type: application/json" \
-d '{
"name": "API CPU Usage",
"query": "avg:system.cpu.user{service:api}",
"pollingInterval": 60,
"warningThreshold": 70,
"criticalThreshold": 90,
"isMultiSeries": true,
"groupByTags": ["host"]
}'Best Practices
- Use specific scopes - Narrow queries to relevant data
- Set appropriate intervals - 60 seconds for most metrics
- Group related metrics - Use tags consistently
- Monitor polling - Check worker logs for errors
- Test queries first - Validate in Datadog UI
Related Documentation
- Metrics - Using metrics in ReliaPulse
- Components - METRIC component type
- Integrations Overview - Other integrations