English
Integrations
Datadog

Datadog Integration

Connect ReliaPulse with Datadog for application and infrastructure metrics.

Overview

The Datadog integration allows you to:

  • Display Datadog metrics on your status page
  • Create METRIC components based on Datadog queries
  • Set thresholds for automatic status updates
  • Track multi-series metrics with tag grouping

Prerequisites

  • A Datadog account
  • An API key with metrics read permissions
  • An Application key

Setup

1. Create Datadog API Keys

  1. Log in to Datadog
  2. Navigate to Organization Settings > API Keys
  3. Click "New Key"
  4. Copy the API key

2. Create Application Key

  1. Navigate to Organization Settings > Application Keys
  2. Click "New Key"
  3. Copy the application key

Application keys are tied to a specific user. Create a service account for production use.

3. Add Integration in ReliaPulse

  1. Navigate to Settings > Integrations
  2. Click "Add Integration"
  3. Select Datadog
  4. Configure:
FieldDescription
NameDisplay name (e.g., "Datadog Production")
API KeyYour Datadog API key
App KeyYour Datadog application key
SiteDatadog site (US1, US3, US5, EU1, AP1)
  1. Click "Test Connection"
  2. Click "Save"

Datadog Sites

SiteURLRegion
US1datadoghq.comUS
US3us3.datadoghq.comUS
US5us5.datadoghq.comUS
EU1datadoghq.euEU
AP1ap1.datadoghq.comAsia Pacific

Creating Metrics Queries

Basic Query

  1. Go to the integration settings
  2. Click "Metrics" tab
  3. Click "Add Query"
  4. Configure:
FieldValue
NameCPU Usage
Queryavg:system.cpu.user{*}
Polling Interval60 seconds
Warning Threshold70
Critical Threshold90
  1. Save

Query Syntax

Datadog queries follow this pattern:

<aggregation>:<metric>{<scope>}

Examples:

avg:system.cpu.user{*}                    # Average CPU across all hosts
sum:http.requests{service:api}.as_count() # Request count for API service
avg:aws.rds.dbload{*}                     # RDS database load
p95:trace.request.duration{*}             # P95 request duration

Aggregation Functions

FunctionDescription
avgAverage value
sumSum of values
minMinimum value
maxMaximum value
countNumber of points

Scope (Tags)

Filter by tags:

avg:system.cpu.user{host:web-1}           # Specific host
avg:system.cpu.user{env:production}       # Production environment
avg:system.cpu.user{env:prod,service:api} # Multiple tags

Multi-Series Metrics

Track metrics split by tags:

Enabling Multi-Series

  1. Edit a metrics query
  2. Enable "Multi-Series Mode"
  3. Configure:
FieldDescription
Group By TagsComma-separated tag names
AggregationHow to aggregate (AVG, SUM)
Max SeriesMaximum series to track

Example: Per-Host CPU

Query: avg:system.cpu.user{*}
Group By Tags: host

This creates a separate series for each host:

  • web-1: 45%
  • web-2: 52%
  • api-1: 38%

Query Transformation

ReliaPulse automatically appends by {tags} to your query:

Your query: avg:system.cpu.user{*}
With groupByTags: host,env
Effective query: avg:system.cpu.user{*} by {host,env}

Series Discovery

  1. Save the query with multi-series enabled
  2. Click "Discover Series"
  3. System queries Datadog and creates series entries

Common Metrics

Infrastructure

MetricQuery
CPU Usageavg:system.cpu.user{*}
Memory Usageavg:system.mem.used{*}
Disk Usageavg:system.disk.in_use{*}
Network Insum:system.net.bytes_rcvd{*}.as_rate()

AWS

MetricQuery
RDS CPUavg:aws.rds.cpuutilization{*}
RDS Connectionsavg:aws.rds.database_connections{*}
Lambda Errorssum:aws.lambda.errors{*}.as_count()
ELB Latencyavg:aws.elb.latency{*}

APM

MetricQuery
Request Ratesum:trace.http.request{*}.as_rate()
Error Ratesum:trace.http.request.errors{*}.as_rate()
P95 Latencyp95:trace.http.request.duration{*}

Thresholds

Set thresholds for automatic status updates:

ThresholdEffect
WarningComponent status → Degraded
CriticalComponent status → Major Outage

Threshold Direction

By default, "above threshold" is bad. For metrics where lower is worse:

  • Set critical lower than warning
  • System detects inverted thresholds

Troubleshooting

Authentication Failed

  1. Verify API key is correct
  2. Check application key permissions
  3. Confirm correct site selected
  4. Ensure keys haven't been revoked

No Data Returned

  1. Verify metric name is correct
  2. Check scope tags exist in Datadog
  3. Confirm metric has recent data points
  4. Try the query in Datadog UI first

Delayed Data

Cloud metrics (AWS, GCP, Azure) often have 5-10 minute delays.

ReliaPulse uses a 30-minute time window to account for cloud metric delays. If data is still missing:

  1. Check Datadog for data availability
  2. Increase polling interval
  3. Verify metric is actively reporting

Empty Series

If multi-series mode returns no series:

  1. Verify groupByTags match actual Datadog tags
  2. Check tags exist on the metric
  3. Try querying without groupBy first

API Integration

Create Datadog Integration

curl -X POST https://your-domain.com/api/v1/integrations \
  -H "Authorization: Bearer sk_live_xxx" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Datadog Production",
    "type": "DATADOG",
    "config": {
      "apiKey": "your-api-key",
      "appKey": "your-app-key",
      "site": "US1"
    }
  }'

Create Metrics Query

curl -X POST https://your-domain.com/api/v1/integrations/{id}/metrics \
  -H "Authorization: Bearer sk_live_xxx" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "API CPU Usage",
    "query": "avg:system.cpu.user{service:api}",
    "pollingInterval": 60,
    "warningThreshold": 70,
    "criticalThreshold": 90,
    "isMultiSeries": true,
    "groupByTags": ["host"]
  }'

Best Practices

  1. Use specific scopes - Narrow queries to relevant data
  2. Set appropriate intervals - 60 seconds for most metrics
  3. Group related metrics - Use tags consistently
  4. Monitor polling - Check worker logs for errors
  5. Test queries first - Validate in Datadog UI

Related Documentation