Automating Alerts with KQL in Azure Monitor: A Step-by-Step Guide
Catching Performance Gremlins Before They Wreck Your Azure Party
Monitoring application performance and infrastructure health is critical for maintaining reliable systems. Azure Monitor provides a powerful platform to collect, analyze, and act on telemetry data, and its Kusto Query Language (KQL) enables precise, customizable queries to detect issues. In this blog post, I’ll walk you through setting up automated alerts in Azure Monitor using KQL to track application performance or infrastructure issues, ensuring you can respond proactively to potential problems.
Why Use KQL for Alerts in Azure Monitor?
Azure Monitor aggregates metrics, logs, and traces from your Azure resources, applications, and infrastructure. KQL, the query language used in Azure Data Explorer and Azure Monitor, allows you to write sophisticated queries to filter and analyze this data. By combining KQL with Azure Monitor’s alerting capabilities, you can:
Detect anomalies in application performance (e.g., high error rates, slow response times).
Monitor infrastructure health (e.g., CPU spikes, memory exhaustion).
Automate notifications to teams via email, SMS, or integrations like Microsoft Teams or PagerDuty.
Reduce manual monitoring by triggering alerts based on specific conditions.
Let’s dive into the process of setting up an alert using KQL in Azure Monitor.
Prerequisites
Before you begin, ensure you have:
An Azure subscription with access to Azure Monitor.
A resource (e.g., an Azure App Service, VM, or Kubernetes cluster) generating telemetry data.
Log Analytics workspace configured to store logs and metrics.
Basic familiarity with KQL (don’t worry, I’ll provide example queries).
Permissions to create alerts in Azure Monitor.
Step 1: Understand Your Monitoring Needs
First, identify what you want to monitor. Common scenarios include:
Application performance:
High error rates in an API (e.g., HTTP 500 errors).
Slow response times for user requests.
Increased latency in a web app.
Infrastructure issues:
CPU usage exceeding 80% for a virtual machine.
Disk space running low on a server.
Memory leaks in a containerized application.
For this walkthrough, let’s assume we’re monitoring an Azure App Service and want to trigger an alert if the average response time exceeds 2 seconds over a 5-minute period.
Step 2: Write a KQL Query
KQL queries are the foundation of your alert. You’ll write a query to analyze telemetry data in your Log Analytics workspace and return results when the condition is met.
Example Scenario
We want to monitor the requests table in Azure Monitor Logs, which captures HTTP request data for our App Service. The goal is to detect when the average response time (duration) exceeds 2000 milliseconds (2 seconds) over a 5-minute window.
Here’s a KQL query to achieve this:
requests
| where timestamp > ago(5m)
| summarize AvgResponseTime = avg(duration) by bin(timestamp, 5m)
| where AvgResponseTime > 2000
Query Breakdown
requests: The table containing HTTP request data.
where timestamp > ago(5m): Filters data from the last 5 minutes.
summarize AvgResponseTime = avg(duration) by bin(timestamp, 5m): Calculates the average response time in 5-minute intervals.
where AvgResponseTime > 2000: Returns results only when the average response time exceeds 2000ms.
Testing the Query
Go to the Azure Portal > Log Analytics workspace > Logs.
Paste the KQL query and run it.
Verify that the query returns results only when the condition is met (e.g., response time > 2 seconds). If no results are returned, the condition isn’t currently met, which is expected in a healthy system.
Step 3: Create an Alert Rule
Now that you have a working KQL query, let’s create an alert rule in Azure Monitor.
Navigate to Azure Monitor
In the Azure Portal, go to Monitor > Alerts > Create > Alert rule.
Select the scope (e.g., your App Service or Log Analytics workspace) and click Next.
Define the Condition
Under Condition, click Add condition.
Choose Log search alert as the signal type.
In the Search query field, paste your KQL query:
requests
| where timestamp > ago(5m)
| summarize AvgResponseTime = avg(duration) by bin(timestamp, 5m)
| where AvgResponseTime > 2000
Configure the Measure:
Set Aggregation type to Count of rows.
Set Aggregation granularity to 5 minutes (to match the query’s time window).
Set the Threshold:
Operator: Greater than.
Threshold value: 0 (since any returned rows indicate the condition is met).
Set the Evaluation frequency:
Check every: 5 minutes.
Look-back period: 5 minutes (to align with the query’s time range).
Configure Actions
Under Actions, click Add action groups.
Create a new action group or select an existing one.
Define the action:
Email/SMS: Send an email or SMS to your team.
Webhook: Trigger a webhook for tools like Microsoft Teams or PagerDuty.
Azure Function: Run automated remediation (e.g., scale up resources).
Save the action group.
Set Alert Details
Provide an Alert rule name (e.g., HighResponseTimeAlert).
Add a Description (e.g., “Alerts when App Service response time exceeds 2 seconds”).
Select the Severity (e.g., Sev 2 for moderate issues).
Choose the Resource group and Region for the alert rule.
Enable the alert rule upon creation.
Review and Create
Review the configuration and click Create. Your alert rule is now active!
Step 4: Test the Alert
To ensure the alert works, simulate the condition (if possible) or monitor it in a real-world scenario. For example:
Introduce artificial latency in your App Service (e.g., in a test environment).
Check the Alerts tab in Azure Monitor to confirm the alert fires when response time exceeds 2 seconds.
Verify that notifications are sent to the configured action group (e.g., email or Teams).
If the alert doesn’t fire, revisit your KQL query or alert rule settings to ensure the threshold and evaluation frequency are correct.
Step 5: Refine and Expand
Once your alert is working, consider refining it or adding more alerts for other scenarios. Here are some additional KQL query examples:
Monitor HTTP 500 Errors
Alert if the number of HTTP 500 errors exceeds 10 in a 5-minute period:
requests
| where timestamp > ago(5m)
| where resultCode == "500"
| summarize ErrorCount = count() by bin(timestamp, 5m)
| where ErrorCount > 10
Monitor VM CPU Usage
Alert if CPU usage on a virtual machine exceeds 80% for 5 minutes:
Perf
| where ObjectName == "Processor" and CounterName == "% Processor Time"
| where timestamp > ago(5m)
| summarize AvgCPU = avg(CounterValue) by bin(timestamp, 5m), Computer
| where AvgCPU > 80
Monitor Disk Space
Alert if free disk space on a server falls below 10%:
InsightsMetrics
| where Name == "FreeSpacePercentage"
| where timestamp > ago(5m)
| summarize AvgFreeSpace = avg(Val) by bin(timestamp, 5m), Computer
| where AvgFreeSpace < 10
Best Practices for KQL Alerts
Optimize Queries: Ensure KQL queries are efficient to avoid performance issues in large datasets. Use filters like where timestamp > ago() early in the query.
Set Appropriate Thresholds: Avoid false positives by testing thresholds in your environment.
Use Action Groups: Centralize notification logic by reusing action groups across multiple alerts.
Leverage Severity Levels: Assign appropriate severity (e.g., Sev 0 for critical, Sev 4 for informational) to prioritize responses.
Monitor Alert Health: Periodically review fired alerts to ensure they’re still relevant and not generating noise.
Integrate with Automation: Use Azure Logic Apps or Azure Functions to automate responses, such as restarting a service or scaling resources.
Troubleshooting Common Issues
No data returned by query: Ensure your resource is sending telemetry to the Log Analytics workspace and the table (e.g., requests) is populated.
Alerts not firing: Check the threshold, evaluation frequency, and look-back period. Verify the query returns results when the condition is met.
Too many alerts: Adjust the threshold or aggregation granularity to reduce noise.
Notifications not received: Confirm the action group is correctly configured and test the notification channel.
TLDR
Automating alerts with KQL in Azure Monitor empowers you to proactively monitor application performance and infrastructure health. By writing targeted KQL queries and configuring alert rules, you can detect issues early and notify the right teams—or even trigger automated remediation. Whether you’re tracking slow response times, server errors, or resource exhaustion, Azure Monitor’s flexibility makes it a powerful tool for maintaining system reliability.
Start small with a single alert, test it thoroughly, and expand to cover more scenarios as you gain confidence. With KQL and Azure Monitor, you’re well-equipped to keep your systems running smoothly.