Similarity and Distance Metrics in KQL Queries: Advanced Techniques for Threat Intelligence Matching
When Exact Matches Fail, KQL's Fuzzy Logic Prevails
Exact matches aren't always enough. Attackers often use subtle variations in URLs, domains, or indicators of compromise (IOCs) to evade detection—think phishing sites with slightly altered domain names or mutated malware signatures. This is where similarity and distance metrics come into play. Kusto Query Language (KQL), the powerhouse behind Azure Data Explorer, Azure Monitor, and Microsoft Sentinel, offers built-in tools for set-based similarity and can be extended for more advanced string comparisons.
In this post, we'll explore how to leverage KQL's jaccard_index function combined with to_utf8 (or its modern alias unicode_codepoints_from_string) for detecting similar URLs or IOCs in scenarios like web crawling or phishing detection. We'll also discuss extending KQL with Levenshtein distance or custom fuzzy matching using Python plugins for entity resolution tasks, such as matching user names or file hashes across datasets.
Whether you're a threat hunter, data analyst, or security engineer, these techniques can level up your queries for more resilient detections.
Understanding the Jaccard Index in KQL
The Jaccard index, also known as Jaccard similarity coefficient, measures the similarity between two finite sets. It's calculated as the size of the intersection divided by the size of the union of the sets:
Values range from 0 (no similarity) to 1 (identical sets). In KQL, the jaccard_index function takes two dynamic arrays as inputs and computes this metric, treating the arrays as sets (duplicates are ignored).
This is particularly useful for threat intelligence because IOCs like URLs or domains can be broken down into sets of characters, n-grams (substrings of length n), or tokens. For example, in phishing detection, you might compare a suspicious URL against known malicious ones to flag high-similarity variants.
Basic Syntax and Example
Here's the basic syntax:
jaccard_index(set1: dynamic, set2: dynamic)A simple example with numeric sets:
print set1 = dynamic([1, 2, 3]), set2 = dynamic([2, 3, 4])
| extend similarity = jaccard_index(set1, set2)Output: similarity = 0.5 (intersection: {2,3}; union: {1,2,3,4}).
Using Jaccard Index with to_utf8 for URL or IOC Similarity
To apply Jaccard to strings like URLs, convert them to arrays of Unicode codepoints using to_utf8 (deprecated alias for unicode_codepoints_from_string). This treats the string as a set of characters, allowing you to compute character-level similarity.
Why to_utf8? It encodes the string into an array of integers representing Unicode codepoints, which jaccard_index can handle directly.
Example: Basic String Similarity
Suppose you're analyzing web logs for phishing attempts. Compare a suspicious URL like "micros0ft-login.com" against a known one "microsoft-login.com":
let suspicious_url = "micros0ft-login.com";
let known_url = "microsoft-login.com";
print suspicious_chars = to_utf8(suspicious_url),
known_chars = to_utf8(known_url)
| extend similarity = jaccard_index(suspicious_chars, known_chars)This might yield a high similarity score (e.g., ~0.85) due to shared characters, flagging it for review.
In a real query over a table:
WebLogs
| where Url contains "login"
| extend url_chars = to_utf8(Url)
| extend known_malicious_chars = to_utf8("evilphish.com")
| extend similarity = jaccard_index(url_chars, known_malicious_chars)
| where similarity > 0.7
| project Url, similarityThis could detect variants in web crawler data or proxy logs.
Enhancing with N-Grams for Better Accuracy
Character-level Jaccard works for rough matches but ignores order. For more robust URL/IOC similarity (e.g., in phishing detection), use shingling: break strings into overlapping n-grams and compute Jaccard on those sets.
KQL doesn't have a built-in n-gram function, but you can generate them using range and substring:
let url = "microsoft.com";
let n = 3; // trigram
let shingles = pack_array(
substring(url, 0, n),
substring(url, 1, n),
substring(url, 2, n),
// ... extend for full length using mv-apply or a loop
substring(url, strlen(url) - n, n)
);For dynamic generation in a query:
let get_shingles = (s: string, n: int) {
range(0, strlen(s) - n, 1)
| mv-apply start = Column1 to typeof(int) on (
extend shingle = substring(s, start, n)
| summarize shingles = make_set(shingle)
)
| project shingles
};
let suspicious = toscalar(get_shingles("micros0ft.com", 3));
let known = toscalar(get_shingles("microsoft.com", 3));
print similarity = jaccard_index(suspicious, known)This captures sequential similarity better, useful for detecting typo-squatted domains in threat intel feeds.
In practice, apply this in Microsoft Sentinel for hunting similar IOCs across SigninLogs or SecurityEvents.
Extending to Levenshtein Distance and Custom Fuzzy Matching via Plugins
While Jaccard excels at set-based similarity, it doesn't account for edit operations (insertions, deletions, substitutions). That's where Levenshtein distance (edit distance) shines: it quantifies the minimum changes needed to transform one string into another.
KQL lacks a native Levenshtein function, but you can implement it using the Python plugin. This runs sandboxed Python code as a user-defined function (UDF), leveraging libraries like NumPy and Pandas (pre-installed in the sandbox).
Enabling and Using the Python Plugin
First, ensure the Python plugin is enabled in your Azure Data Explorer cluster (it's on by default in many setups). The syntax for a Python UDF:
.create function MyLevenshtein(str1: string, str2: string): (distance: int)
{
python(
nameof(str1, str2), nameof(distance),)
}This defines a pure-Python Levenshtein implementation (no external libs needed).
Example: Entity Resolution with Levenshtein
For resolving similar entities (e.g., user names in logs):
UserLogs
| extend known_user = "john.doe@example.com"
| invoke MyLevenshtein(UserEmail, known_user)
| where distance < 3 // Allow up to 2 edits (typos)
| project UserEmail, distanceThis could match "jhon.doe@example.com" or "john.d0e@example.com" in threat hunting for account compromise.
Custom Fuzzy Matching
For more advanced fuzzy logic (e.g., combining Levenshtein with phonetic matching like Soundex), extend the Python script:
Use fuzzy library? The sandbox doesn't include it, so stick to built-ins or implement.
Or integrate with NumPy for vectorized distances.
Example enhancement:
Add a fuzzy ratio in the Python code using a normalized Levenshtein (1 - distance / max_length).
This plugin approach is powerful for entity resolution in large datasets, like merging IOC lists from different sources.
Practical Applications in Threat Intelligence
Phishing Detection: Use Jaccard on URL shingles to flag similar domains in email logs.
IOC Matching: Compare file hashes or IP ranges with fuzzy thresholds to catch variants.
Entity Resolution: Deduplicate threat actor aliases or user identities across intel feeds.
Remember, high similarity doesn't always mean malice—tune thresholds based on your data to avoid false positives.
TLDR
KQL's jaccard_index with to_utf8 provides a lightweight way to implement set-based similarity for quick threat matching, while Python plugins unlock advanced metrics like Levenshtein for deeper analysis. These tools make KQL a versatile language for cybersecurity workflows.
Experiment in your Azure environment, and share your detections in the comments! For more on KQL functions, check the official docs.
Note: Code examples are illustrative; test in your setup for performance on large datasets.




