Statistical Aggregations for Anomaly Detection Using KQL
Unmasking Outliers: How KQL’s Statistical Superpowers Catch Sneaky Anomalies
Anomaly detection is a critical tool for identifying unusual patterns that could signal issues like security breaches, system failures, or operational inefficiencies. While basic aggregations like averages (using avg()) are commonplace in introductory KQL tutorials, they often fall short for robust outlier identification. Kusto Query Language (KQL), the querying powerhouse behind Azure Data Explorer and Azure Monitor, offers advanced statistical functions such as stdev(), variance(), percentiles_array(), and avgif() to delve deeper into data distributions. These help quantify variability and condition-based summaries, enabling more nuanced anomaly detection.
Beyond static summaries, KQL shines in time-series analysis with functions like series_decompose_anomalies(), which decomposes data into trends, seasonal patterns, and residuals to spot deviations in monitoring pipelines. This post explores these tools, with practical examples applied to metrics like network traffic and login attempts—scenarios where outliers might indicate threats or anomalies.
Beyond Averages: Key Statistical Aggregations in KQL
Simple averages provide a central tendency but ignore spread and conditions. Advanced aggregations address this by measuring dispersion (via standard deviation and variance), distribution points (percentiles), and conditional averages. These are typically used with the summarize operator to group and compute metrics.
Standard Deviation (stdev()) and Variance (variance())
Standard deviation measures how much values deviate from the mean, while variance is its squared form—both are essential for identifying outliers. High variance indicates erratic behavior, perfect for flagging unusual spikes in metrics.
Syntax for stdev(): stdev(expression) Returns the standard deviation of the expression across the group.
Syntax for variance(): variance(expression) Returns the variance of the expression across the group.
Consider monitoring login attempts in SigninLogs (a common table in Azure Monitor). To detect anomalies, calculate the average and standard deviation of attempts per hour, then flag hours where attempts exceed the average plus three standard deviations (a common threshold for outliers):
SigninLogs
| where TimeGenerated > ago(1d)
| summarize AvgAttempts = avg(count()), StdevAttempts = stdev(count()), VarianceAttempts = variance(count()) by bin(TimeGenerated, 1h)
| extend UpperThreshold = AvgAttempts + 3 * StdevAttempts
| where count_ > UpperThreshold
| project TimeGenerated, count_, AvgAttempts, StdevAttempts, VarianceAttempts, UpperThreshold
This query groups logins by hour, computes the metrics, and identifies potential brute-force attacks if attempts spike unusually. Similarly, for network traffic in CommonSecurityLog:
CommonSecurityLog
| where TimeGenerated > ago(1d)
| summarize AvgBytes = avg(SentBytes + ReceivedBytes), StdevBytes = stdev(SentBytes + ReceivedBytes), VarianceBytes = variance(SentBytes + ReceivedBytes) by bin(TimeGenerated, 5m)
| extend AnomalyThreshold = AvgBytes + 3 * StdevBytes
| where (SentBytes + ReceivedBytes) > AnomalyThreshold
| project TimeGenerated, TotalBytes = SentBytes + ReceivedBytes, AvgBytes, StdevBytes, VarianceBytes
High variance here might reveal data exfiltration or DDoS attempts.
Percentiles (percentiles_array())
Percentiles help understand data distribution by showing values below which a given percentage of observations fall. percentiles_array() returns an array of multiple percentiles, useful for spotting outliers in skewed datasets.
Syntax: percentiles_array(expression, percentiles) Where percentiles is an array of values (0-100). Returns an array of approximates.
For login attempts, compute the 95th and 99th percentiles to identify extreme values:
SigninLogs
| where TimeGenerated > ago(7d)
| summarize Percentiles = percentiles_array(count(), dynamic([95, 99])) by UserPrincipalName
| mv-expand Percentile95 = Percentiles[0], Percentile99 = Percentiles[1]
| where count_ > Percentile99
| project UserPrincipalName, count_, Percentile95, Percentile99
This flags users with login counts in the top 1%, potentially indicating compromised accounts. For network traffic:
NetworkLogs
| summarize PercentilesTraffic = percentiles_array(DataTransferred, dynamic([50, 90, 99])) by bin(TimeGenerated, 1h)
| mv-expand Median = PercentilesTraffic[0], P90 = PercentilesTraffic[1], P99 = PercentilesTraffic[2]
| where DataTransferred > P99
| project TimeGenerated, DataTransferred, Median, P90, P99
Outliers above the 99th percentile could signal unusual bursts.
Conditional Averaging (avgif())
avgif() computes averages only for rows meeting a condition, ideal for focused analysis like successful vs. failed logins.
Syntax: avgif(expression, predicate) Averages the expression where the predicate is true.
Example for login attempts, averaging only failed ones:
SigninLogs
| summarize FailedAvg = avgif(count(), ResultType == "Failure") by bin(TimeGenerated, 1h)
| where FailedAvg > 50 // Custom threshold for alerting
For network traffic, average bytes only for outbound traffic:
CommonSecurityLog
| summarize OutboundAvg = avgif(SentBytes, CommunicationDirection == "OUTBOUND") by DeviceVendor
| where OutboundAvg > 1000000 // Flag high outbound averages
These aggregations provide a foundation for anomaly detection by highlighting deviations in summaries.
Advanced Time-Series Anomaly Detection with series_decompose_anomalies()
For time-series data in monitoring pipelines, series_decompose_anomalies() goes further by decomposing a series into baseline (expected), anomalies, and scores. It's built for sequential data like metrics over time.
Syntax: series_decompose_anomalies(Series, [Threshold, Seasonality, Trend, Test_points, AD_method, Seasonality_threshold])
Series: Numeric array (e.g., from make-series).
Threshold: Anomaly sensitivity (default 1.5).
Seasonality: Period detection (-1 for auto, or integer bins).
Trend: 'avg', 'linefit', or 'none'.
Test_points: Points to exclude for forecasting.
AD_method: 'ctukey' or 'tukey' for outlier detection.
Seasonality_threshold: Score threshold for auto-seasonality (default 0.6). Returns: Anomaly flags (+1/-1/0), scores, and baseline.
Building Custom Thresholds in Monitoring Pipelines
In pipelines, use this to create dynamic thresholds based on historical data. For login attempts:
let starttime = 7d;
let timeframe = 1h;
SigninLogs
| make-series Logins = count() default=0 on TimeGenerated from ago(starttime) to now() step timeframe
| extend (Anomalies, Score, Baseline) = series_decompose_anomalies(Logins, 2.0, -1, 'linefit') // Custom threshold 2.0 for stricter detection
| mv-expand Logins to typeof(double), TimeGenerated to typeof(datetime), Anomalies to typeof(double), Score to typeof(double), Baseline to typeof(long)
| where Anomalies != 0
| project TimeGenerated, Logins, Baseline, Score, Anomalies
Here, a threshold of 2.0 detects stronger anomalies, with 'linefit' for trend.
For network traffic volume, a comprehensive pipeline query:
let starttime = 14d;
let timeframe = 1h;
let scorethreshold = 5; // Custom score threshold
CommonSecurityLog
| where isnotempty(DestinationIP) and isnotempty(SourceIP)
| make-series Total=count() on TimeGenerated from ago(starttime) to now() step timeframe by DeviceVendor
| extend (Anomalies, Score, Baseline) = series_decompose_anomalies(Total, scorethreshold, -1, 'linefit')
| mv-expand Total to typeof(double), TimeGenerated to typeof(datetime), Anomalies to typeof(double), Score to typeof(double), Baseline to typeof(long)
| where Anomalies > 0 and Score > scorethreshold // Custom filtering
| project DeviceVendor, TimeGenerated, Total, Baseline, Anomalies, Score
This builds a monitoring flow: aggregate traffic, decompose for anomalies, and alert on high scores. Adjust scorethreshold based on your environment's noise level for custom thresholds.
TLDR
By leveraging stdev(), variance(), percentiles_array(), and avgif(), you can uncover outliers in metrics like network traffic and login attempts that averages alone miss. For time-series data, series_decompose_anomalies() enables sophisticated decomposition and custom thresholding in automated pipelines, turning raw logs into proactive insights. Experiment with these in Azure Data Explorer or Sentinel to refine your detection strategies—start small, iterate on thresholds, and integrate with alerts for real impact.