How-To: Using Service Pattern Discovery and Dashboard Templates

Introduction

This guide walks you through setting up baseline monitoring for your service using the platform team's discovery dashboards and templates. By the end, you'll have a customized dashboard showing your service's health and a set of alerts tuned to your service's actual behavior patterns.

The whole process takes about an hour if you're methodical about it. Most of that time is waiting for the discovery dashboard to show you enough data to make informed decisions about thresholds.

Before You Start: This guide assumes you've read the explanation document "Understanding ALB Service Metrics and Baseline Monitoring". That document covers the why behind everything here — how the metrics pipeline works, what baselines mean statistically, and the rationale for different alert patterns. This guide focuses on the how.

Prerequisites

Before diving in, make sure you have:

Grafana access with permissions to create dashboards and alerts in your team's folder
A service running behind ALB that's been deployed for at least a few days (ideally 1-2 weeks for good baseline data)
Your service's target group pattern — we'll show you how to find this in the first section
30-60 minutes of uninterrupted time to do this properly

Finding Your Service Label

Every metric in this system uses a service label derived from the ALB target group name. Before you can query anything, you need to know what pattern to match.

Step 1: Find Your Target Group in AWS

Open the AWS Console and navigate to EC2 → Target Groups
Search for your service name (try partial matches like analytic or ebook)
Look for target groups following the pattern k8s-<namespace>-<service>-<random_id>

AWS Console Target Groups list with search filter applied

Step 2: Extract the Service Pattern

From the target group name, identify the portion you'll use for regex matching.

For example, if your target group is:

k8s-ebook-analytic-4fe95c5902

Your service pattern for queries would be analytic (the unique identifying portion).

Don't include the random suffix in your pattern. Target groups can be recreated with different IDs during infrastructure changes. Use just the stable part of the name that identifies your service.

Step 3: Verify the Pattern Works

Open Grafana and run a quick test query in the Explore view:

Go to Explore (compass icon in the left sidebar)
Select the Prometheus datasource
Enter this query, replacing YOUR_SERVICE with your pattern:

alb:requests:rate5m{service=~".*YOUR_SERVICE.*", environment="prod"}

If you see data, you've got the right pattern. If not, try variations or check that your service has been receiving traffic.

Grafana Explore view showing a successful query with data returned

Phase 1: Discovering Your Service's Baselines

This is the most important phase. You're going to analyze your service's historical behavior to understand what "normal" looks like. Rush this, and you'll end up with thresholds that either never fire or fire constantly.

Step 1: Open the Service Pattern Discovery Dashboard

Navigate to: Dashboards → Platform Team → Service Pattern Discovery (Fast)

Or use this direct link: Service Pattern Discovery (Fast)

Service Pattern Discovery (Fast) dashboard location in the Grafana folder structure

Step 2: Configure the Dashboard Variables

At the top of the dashboard, you'll see dropdown selectors for filtering the data:

service: Enter your service pattern (e.g., analytic, nortonau, metadata)
environment: Select prod — you want baselines from production behavior
prometheus_datasource: Should default correctly, but verify it's pointing to the right Prometheus instance

Variable selectors at the top of the dashboard with example values filled in

Step 3: Set the Analysis Time Range

Click the time picker in the top-right corner and select a range that gives you meaningful data:

Minimum: 7 days (enough to see weekly patterns)
Recommended: 14 days (captures two full weekly cycles) for this one you will have to input the date range manually
Maximum useful: 30 days (beyond this, Prometheus retention may limit data)

Why 14 days? Services often behave differently on different days of the week. A single week might not show you that Friday traffic is always lower, or that Monday mornings have a spike. Two weeks smooths out anomalies and shows you the real patterns.

Time picker with "Last 14 days" selected

Step 4: Analyze Traffic Patterns

Start with the "Heartbeat - Request Rate Over Time" panel. This shows your service's traffic volume over the selected period.

What to look for:

Look at the legend table below the graph. It shows calculated statistics:

Stat	What It Tells You
Mean	Your average traffic level — the center of "normal"
Max	Peak traffic observed — useful for capacity planning
Min	Lowest traffic — check if zeros are expected (overnight?)
Std Dev	How much traffic varies — low means consistent, high means bursty

Record these values. You'll use them to calculate thresholds.

Heartbeat panel with the legend table visible showing Mean, Max, Min, and Std Dev values

Identify patterns:

Do you see daily cycles? (Higher during business hours, lower at night)
Do you see weekly cycles? (Different on weekends vs weekdays)
Are there regular spikes? (Batch jobs, scheduled tasks)

These patterns affect how you'll configure alerts. A traffic drop alert that fires at 2 AM when traffic is supposed to be low is just noise.

Step 5: Analyze Latency Patterns

Scroll to the "Latency Percentiles" section. This shows P50, P95, and P99 response times.

Focus on P95 — this is typically the metric you'll alert on. It represents the experience of the slowest 5% of requests while filtering out extreme outliers.

From the legend, record:

P95 Mean: Your baseline latency
P95 Max: Worst observed latency (was there an incident during this period?)
P95 Std Dev: How much latency varies

Latency Percentiles panel showing P50, P95, and P99 lines with legend statistics

Check for anomalies in your analysis window. If there was an outage or incident during the 14 days you're analyzing, those data points will skew your statistics. Consider adjusting your time range to exclude known incidents, or mentally account for them when setting thresholds.

Step 6: Analyze Error Patterns

Look at the "Error Rate Analysis" section showing 5xx and 4xx error percentages.

For 5xx errors:

What's the typical rate? Many services run at 0% most of the time
Are there patterns? Errors during deployments? Scheduled maintenance windows?

For 4xx errors:

These are often higher than 5xx (client errors like 404s, 401s)
Authentication services naturally see more 4xx (failed login attempts)

Record the mean and maximum for both.

Error Rate Analysis panel showing 5xx and 4xx percentage lines

Step 7: Document Your Baselines

Before moving on, write down what you've learned. You'll need this when configuring alerts.

## [Your Service Name] Baseline Analysis

**Analysis Period:** [start date] to [end date]
**Environment:** Production

### Traffic

- Mean: **\_** req/s
- Max: **\_** req/s
- Std Dev: **\_** req/s
- Pattern Notes: [daily cycles? weekend differences?]

### Latency (P95)

- Mean: **\_** ms
- Max: **\_** ms
- Std Dev: **\_** ms

### Error Rates

- 5xx Mean: **\_** %
- 5xx Max: **\_** %
- 4xx Mean: **\_** %

### Calculated Thresholds

- Traffic Drop Warning: **\_** req/s (mean - 2σ)
- Traffic Spike Warning: **\_** req/s (mean + 2σ)
- P95 Latency Warning: **\_** ms (mean + 2σ)
- P95 Latency Critical: **\_** ms (mean + 3σ)
- 5xx Error Warning: **\_** %

Phase 2: Creating Your Service Dashboard

Now that you understand your service's baseline behavior, let's create a dashboard you can use for day-to-day monitoring.

Step 1: Open the Service Dashboard Template

Navigate to: Dashboards → Platform Team → Service Dashboard Template

Or use this direct link: Service Dashboard Template

Service Dashboard Template dashboard location

Step 2: Clone the Dashboard

Click the Settings icon (gear) in the top-right corner
Click "Save As..."
Enter a name for your dashboard: [ServiceName] Service Dashboard
- Example: Analytics Service Dashboard
Select your team's folder from the dropdown
Click Save

Save dashboard dialog with name and folder fields

Why clone instead of edit? The template is shared. If you modify it directly, you'll break it for everyone else. Always clone first, then customize your copy.

Step 3: Update the Service Variable Default

Your cloned dashboard still has placeholder values. Let's fix that.

Click Settings (gear icon) → Variables
Find the service variable and click on it
Update the default value to your service pattern
Click Update, then Save dashboard

Variables settings page with the service variable selected

Step 4: Verify Data Appears

Go back to the dashboard view. All panels should now show data for your service. If any panels show "No data":

Check that your service pattern is correct
Verify the environment variable matches where your service runs
Confirm the time range includes recent data

Step 5: Configure Panel Thresholds

This is where your baseline analysis pays off. For each key panel, you'll add visual thresholds that color the graph based on value ranges.

For the Request Rate panel:

Click the panel title → Edit
In the right sidebar, find Thresholds
Add thresholds based on your baseline:
- Base (green): Up to your normal max
- Yellow: Above your "spike" threshold
- Red: Significantly above normal (potential incident)
Click Apply

Panel edit view showing the Thresholds configuration section

For the P95 Latency panel:

Edit the panel
Add thresholds:
- Green: Below your warning threshold
- Yellow: Between warning and critical
- Red: Above critical threshold
Apply changes

Repeat for Error Rate panels with appropriate thresholds from your analysis.

Step 6: Remove the Setup Instructions Panel

The template includes an instructional panel at the top that you no longer need.

Hover over the "Setup Instructions" panel
Click the panel menu (three dots)
Select Remove
Save the dashboard

Click Save dashboard
Optionally, star the dashboard for quick access
Share the dashboard link with your team

Phase 3: Creating Alert Rules

Dashboards show you what's happening now. Alerts tell you when something's wrong even when you're not looking. Let's create alerts based on your baselines.

Understanding Alert Structure

Before we create alerts, here's how Grafana alerts work:

Query: Fetches the metric data (same queries your dashboard uses)
Expression: Reduces the data to a single value and applies conditions
Evaluation: How often to check and how long a condition must persist
Notification: Where to send the alert

Step 1: Create a Service Down Alert

This is your most critical alert. Zero traffic almost always means something is broken.

Navigate to Alerting → Alert rules
Click + New alert rule
Configure the alert:

Rule name: Service Down - [YourServiceName]

Query A:

sum(alb:requests:rate1m{service=~".*YOUR_SERVICE.*", environment="prod"})

Add a Reduce expression:
- Expression: B
- Function: Last
- Input: A
- Mode: Strict
Add a Math expression:
- Expression: $B == 0

$Reduce and Math expressions configured$

Set evaluation behavior:
- Folder: Select or create your team's alert folder
- Evaluation group: Create a new group or use existing
- Pending period: 3m (fires after 3 minutes of zero traffic)
Add annotations:
- Summary: [YourServiceName] is not receiving any traffic
- Description: No requests have been recorded for the past 3 minutes. The service may be down or unreachable.
Add labels:
- severity: critical
Configure notifications:
- Select your team's notification channel
Click Save rule and exit

Step 2: Create a Traffic Drop Alert (Peak Hours)

This alert detects when traffic drops below expected levels during business hours. We'll use time-based filtering to avoid false positives overnight.

Rule name: Traffic Drop (Peak Hours) - [YourServiceName]

Query A:

(
  sum(alb:requests:rate1m{service=~".*YOUR_SERVICE.*", environment="prod"})
  and
  hour() >= 9 and hour() < 18
  and
  day_of_week() != 0 and day_of_week() != 6
)

Understanding the time filter:

hour() >= 9 and hour() < 18 — Only evaluate between 9 AM and 6 PM UTC
day_of_week() != 0 and day_of_week() != 6 — Exclude Sunday (0) and Saturday (6)

Adjust these to match your service's expected traffic patterns and timezone.

Expressions:

Reduce: Last value of A
Threshold: IS BELOW [your traffic drop threshold from baseline]

Pending period: 10m — Traffic dips happen; wait 10 minutes to confirm it's sustained.

Severity: warning

Complete traffic drop alert configuration

Step 3: Create a P95 Latency Warning Alert

Rule name: P95 Latency Warning - [YourServiceName]

Query A:

alb:latency:p95{service=~".*YOUR_SERVICE.*", environment="prod"}

Expressions:

Reduce: Last value of A
Threshold: IS ABOVE [your P95 warning threshold, e.g., 0.5 for 500ms]

Latency is in seconds, not milliseconds. The ALB logs record response time in seconds. A 500ms threshold is 0.5, not 500.

Pending period: 10m

Severity: warning

Step 4: Create a P95 Latency Critical Alert

Same as above, but with a higher threshold and shorter pending period.

Rule name: P95 Latency Critical - [YourServiceName]

Threshold: IS ABOVE [your P95 critical threshold, e.g., 1.5 for 1.5s]

Pending period: 5m — Faster notification for severe degradation

Severity: critical

Step 5: Create Error Rate Alerts

Rule name: Error Rate Warning - [YourServiceName]

Query A:

alb:error_rate_5xx:percent{service=~".*YOUR_SERVICE.*", environment="prod"}

Threshold: IS ABOVE 1 (1% error rate)

Pending period: 10m

Severity: warning

Rule name: Error Rate Critical - [YourServiceName]

Threshold: IS ABOVE 5 (5% error rate)

Pending period: 5m

Severity: critical

Step 6: Link Alerts to Dashboard Panels

For easier investigation, link each alert to its corresponding dashboard panel.

Edit the alert rule
In the Link dashboard and panel section:
- Select your service dashboard
- Select the relevant panel
Save the rule

When the alert fires, the notification will include a direct link to the panel visualization.

Dashboard and panel linking configuration

Recommended Alert Set Summary

Here's a complete set of alerts for a typical service:

Alert	Metric	Threshold	Pending	Severity
Service Down	`rate == 0`	Equals 0	3m	Critical
Traffic Drop (Peak)	`rate < baseline`	Below threshold	10m	Warning
Traffic Spike	`rate > baseline + 3σ`	Above threshold	5m	Warning
P95 Warning	`p95 > warning`	Above threshold	10m	Warning
P95 Critical	`p95 > critical`	Above threshold	5m	Critical
P99 Elevated	`p99 > threshold`	Above threshold	15m	Warning
5xx Warning	`error% > 1`	Above 1%	10m	Warning
5xx Critical	`error% > 5`	Above 5%	5m	Critical
4xx Elevated	`error% > 10`	Above 10%	15m	Info

Of course you can add more alerts to the dashboard if you want to, but these are the ones that we consider most important.

Validation and Testing

Before you call this done, verify everything works correctly.

Verify Dashboard Functionality

Set the time range to "Last 24 hours"
Confirm all panels display data
Check that threshold colors appear correctly on graphs
Test the variable dropdowns — switching environments should refresh data

Verify Alert Rules

Navigate to Alerting → Alert rules
Find your newly created rules
Check that each shows status "Normal" or "Pending" (not "Error")
Click into each rule and verify the preview shows reasonable values

Alert Rules Verification with all rules showing "Normal" or "Pending" status

If an alert shows "Error" status, there's likely a syntax problem in your query or expression. Click into the rule, check the error message, and fix the issue before proceeding.

Test an Alert (Optional)

If you want to verify the notification pipeline works:

Temporarily lower a threshold to trigger the alert
Wait for the pending period to elapse
Confirm you receive the notification
Restore the original threshold

Troubleshooting

No Data in Dashboard Panels

Possible causes:

Service variable doesn't match your target group pattern
Environment variable doesn't match where your service runs
Time range is outside Prometheus retention (older than ~30 days)
Service genuinely has no traffic

To diagnose:

Go to Explore
Run the raw query with explicit filters
Check if data exists for different time ranges
Verify your service pattern against AWS target groups

Gaps in Data Visualization

For longer time ranges (7+ days), you might see gaps in the graph. This is expected behavior due to ALB log batching — see the explanation document for details.

Workarounds:

Use the 5-minute rate metrics (rate5m) instead of 1-minute for trend analysis
Enable "Connect null values" in the panel options
Accept that some granularity is lost at longer time scales

Alert Shows "No Data"

Possible causes:

Query returns no results (check your filters)
Prometheus datasource is unreachable
Metric doesn't exist for your service

To diagnose:

Copy the query from the alert rule
Run it in Explore
If no data, debug the query
If data exists, check the datasource UID in the alert matches Explore

Alert Never Fires

Possible causes:

Threshold is set too high (or too low for "below" conditions)
Pending period is longer than the anomaly duration
Time-based filters exclude the time when issues occur

To diagnose:

Look at the alert rule's evaluation history
Compare actual metric values against your threshold
Check if the condition was met but resolved before pending period elapsed

Alert Fires Too Often

Possible causes:

Threshold is too aggressive
Pending period is too short
Normal traffic patterns weren't accounted for

To fix:

Re-analyze baselines with a longer time range
Increase thresholds (mean + 3σ instead of mean + 2σ)
Lengthen pending periods
Add time-of-day filters for known low-traffic periods

Maintenance

Baselines aren't set-and-forget. Services change over time.

Quarterly Review

Every 3 months (or after significant service changes):

Re-run baseline analysis with fresh data
Compare new baselines against current thresholds
Adjust thresholds if patterns have shifted
Document changes

After Major Changes

If your service undergoes significant changes (new features, traffic shifts, architecture changes):

Wait 1-2 weeks for new patterns to stabilize
Re-analyze baselines
Update thresholds accordingly

Alert Hygiene

Periodically review your alert history:

Are there alerts that fire frequently but don't indicate real problems? Tune them.
Are there incidents that weren't caught by alerts? Add coverage.
Are alert descriptions still accurate? Update them.

Quick Reference

Metrics Available

Metric	Description
`alb:requests:rate1m`	Request rate (1-min window)
`alb:requests:rate5m`	Request rate (5-min window, smoother)
`alb:latency:p50`	Median response time
`alb:latency:p95`	95th percentile response time
`alb:latency:p99`	99th percentile response time
`alb:error_rate_5xx:percent`	Server error percentage
`alb:error_rate_4xx:percent`	Client error percentage
`alb:bytes_sent:rate1m`	Outbound throughput
`alb:bytes_received:rate1m`	Inbound throughput

For complete details on each metric, see the Explanation Document.

Useful PromQL Patterns

Filter by service and environment:

alb:requests:rate1m{service=~".*YOUR_SERVICE.*", environment="prod"}

Business hours only:

your_metric and hour() >= 9 and hour() < 18 and day_of_week() != 0 and day_of_week() != 6

Average over time range:

avg_over_time(alb:latency:p95{service=~".*YOUR_SERVICE.*"}[$__range])

Standard deviation over time range:

stddev_over_time(alb:latency:p95{service=~".*YOUR_SERVICE.*"}[$__range])

Introduction​

Prerequisites​

Finding Your Service Label​

Step 1: Find Your Target Group in AWS​

Step 2: Extract the Service Pattern​

Step 3: Verify the Pattern Works​

Phase 1: Discovering Your Service's Baselines​

Step 1: Open the Service Pattern Discovery Dashboard​

Step 2: Configure the Dashboard Variables​

Step 3: Set the Analysis Time Range​

Step 4: Analyze Traffic Patterns​

Step 5: Analyze Latency Patterns​

Step 6: Analyze Error Patterns​

Step 7: Document Your Baselines​

Phase 2: Creating Your Service Dashboard​

Step 1: Open the Service Dashboard Template​

Step 2: Clone the Dashboard​

Step 3: Update the Service Variable Default​

Step 4: Verify Data Appears​

Step 5: Configure Panel Thresholds​

Step 6: Remove the Setup Instructions Panel​

Step 7: Save and Share​

Phase 3: Creating Alert Rules​

Understanding Alert Structure​

Step 1: Create a Service Down Alert​

Step 2: Create a Traffic Drop Alert (Peak Hours)​

Step 3: Create a P95 Latency Warning Alert​

Step 4: Create a P95 Latency Critical Alert​

Step 5: Create Error Rate Alerts​

Step 6: Link Alerts to Dashboard Panels​

Recommended Alert Set Summary​

Validation and Testing​

Verify Dashboard Functionality​

Verify Alert Rules​

Test an Alert (Optional)​

Troubleshooting​

No Data in Dashboard Panels​

Gaps in Data Visualization​

Alert Shows "No Data"​

Alert Never Fires​

Alert Fires Too Often​

Maintenance​

Quarterly Review​

After Major Changes​

Alert Hygiene​

Quick Reference​

Metrics Available​

Useful PromQL Patterns​

Dashboard Links​

Getting Help​

References​

Introduction

Prerequisites

Finding Your Service Label

Step 1: Find Your Target Group in AWS

Step 2: Extract the Service Pattern

Step 3: Verify the Pattern Works

Phase 1: Discovering Your Service's Baselines

Step 1: Open the Service Pattern Discovery Dashboard

Step 2: Configure the Dashboard Variables

Step 3: Set the Analysis Time Range

Step 4: Analyze Traffic Patterns

Step 5: Analyze Latency Patterns

Step 6: Analyze Error Patterns

Step 7: Document Your Baselines

Phase 2: Creating Your Service Dashboard

Step 1: Open the Service Dashboard Template

Step 2: Clone the Dashboard

Step 3: Update the Service Variable Default

Step 4: Verify Data Appears

Step 5: Configure Panel Thresholds

Step 6: Remove the Setup Instructions Panel

Step 7: Save and Share

Phase 3: Creating Alert Rules

Understanding Alert Structure

Step 1: Create a Service Down Alert

Step 2: Create a Traffic Drop Alert (Peak Hours)

Step 3: Create a P95 Latency Warning Alert

Step 4: Create a P95 Latency Critical Alert

Step 5: Create Error Rate Alerts

Step 6: Link Alerts to Dashboard Panels

Recommended Alert Set Summary

Validation and Testing

Verify Dashboard Functionality

Verify Alert Rules

Test an Alert (Optional)

Troubleshooting

No Data in Dashboard Panels

Gaps in Data Visualization

Alert Shows "No Data"

Alert Never Fires

Alert Fires Too Often

Maintenance

Quarterly Review

After Major Changes

Alert Hygiene

Quick Reference

Metrics Available

Useful PromQL Patterns

Dashboard Links

Getting Help

References