Exploring KQL: External Data Integration and Custom IOC Matching for Enhanced Threat Intelligence
How they work with practical examples, and discuss the hurdles
In the world of data analytics and cybersecurity, Kusto Query Language (KQL) stands out as a powerful tool for querying large datasets in platforms like Azure Data Explorer and Microsoft Sentinel. While KQL excels at handling ingested logs and telemetry, integrating external data sources opens up new possibilities for real-time enrichment—particularly in threat hunting scenarios. This blog post dives into two key features: the externaldata operator for pulling in data from external storage like CSV or JSON files in Azure Blobs, and dynamic lookups using the lookup operator for matching against custom Indicators of Compromise (IOCs). These techniques are especially useful for threat intelligence enrichment but come with challenges around security, data formatting, and performance that make them less commonly explored. We’ll cover how they work, provide practical examples, and discuss the hurdles to watch out for.
The externaldata Operator: Bringing in External Threat Intel
The externaldata operator allows you to treat external files as temporary tables within your KQL queries, pulling data directly from storage artifacts without needing to ingest it first. This is ideal for reference datasets like threat intelligence feeds, which might include lists of malicious IPs, domains, or file hashes stored in CSV or JSON formats on Azure Blob Storage or Azure Data Lake.
Syntax and Basics
The basic syntax defines the schema upfront and points to one or more storage URIs:
externaldata (columnName:columnType [, ...]) [storageConnectionString [, ...]] [with (propertyName=propertyValue [, ...])]Schema: Specify columns and types (e.g., ip:string for an IP list).
Connection Strings: Use URIs with authentication like Shared Access Signatures (SAS) or managed identities—avoid embedding credentials directly for security reasons.
Properties: Options like format=”csv” or ignoreFirstRecord=true to skip headers.
Supported formats include CSV, JSON, TXT, Parquet, and more, with automatic detection based on file extensions where possible. However, it’s limited to small datasets (up to 100 MB) and isn’t suitable for large-scale ingestion—use custom log ingestion for that instead.
Example: Enriching Logs with Threat Intel from Blobs
Imagine you have a CSV file in Azure Blob Storage containing known malicious IPs for threat enrichment. Here’s a query to match successful sign-ins from those IPs:
let BlockList = externaldata(ip:string)
[
h@”https://storageaccount.blob.core.windows.net/container/blocked-ips.csv?SAS_TOKEN”
]
with(format=”csv”, ignoreFirstRecord=true)
| where ip matches regex @”^(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)$”;
SigninLogs
| where IPAddress in (BlockList)
| where ResultType == “0”
| project Timestamp, UserPrincipalName, IPAddress, LocationThis pulls the IP list dynamically and filters sign-in logs for matches, enabling quick IOC detection.
For JSON, you can use ingestion mappings to parse nested structures:
externaldata(Timestamp:datetime, TenantId:guid, MethodName:string)
[
h@”https://storageaccount.blob.core.windows.net/container/events.json?SAS_TOKEN”
]
with(format=”multijson”, ingestionMapping=’[{”Column”:”Timestamp”,”Properties”:{”Path”:”$.timestamp”}}, ...]’)This is handy for complex threat feeds.
In practice, for threat hunting, you might load domain lists from public GitHub repos or curated blobs to scan inbound emails:
let domainList = externaldata(domain:string)
[@”https://raw.githubusercontent.com/tsirolnik/spam-domains-list/master/spamdomains.txt”]
with(format=”txt”);
EmailEvents
| where EmailDirection == “Inbound”
| extend EmailDomain = tostring(split(SenderMailFromAddress, ‘@’)[1])
| join kind=inner (domainList) on $left.EmailDomain == $right.domain
| project Timestamp, SenderMailFromAddress, EmailDomain, SubjectExclude trusted domains with a datatable to reduce false positives.
Challenges with externaldata
Security is a big one: Always use read-only SAS tokens or managed identities to avoid exposing sensitive credentials in queries. Format issues arise if the schema doesn’t match the file—mismatched columns or invalid data can cause query failures. Dynamic URLs (e.g., date-appended files) require scripting to normalize and upload to stable blobs. Plus, if the storage is firewalled, the operator won’t work at all. These factors make it tricky for production use, explaining why it’s not as widely adopted.
Dynamic Lookups: Real-Time Matching Against Watchlists
Once you’ve loaded external data, the lookup operator shines for efficient, real-time matching. It extends a “fact” table (your main logs) with columns from a smaller “dimension” table (like a watchlist of IOCs), assuming the dimension is compact for performance.
Syntax and Kinds
LeftTable | lookup [kind=(leftouter|inner)] (RightTable) on AttributesKinds: leftouter (default) includes all left rows with nulls for misses; inner only includes matches.
Attributes: Match on columns, e.g., IPAddress or qualified like $left.IP == $right.maliciousIP.
The right table should be under tens of MBs—check with summarize sum(estimate_data_size(*)).
Example: Matching IOCs with Lookups
Combine with externaldata for dynamic watchlist matching. Load a hash list and lookup against device files:
let MaliciousHashes = externaldata(sha256:string)
[@”https://storageaccount.blob.core.windows.net/container/emotet-hashes.txt”]
with(format=”txt”, ignoreFirstRecord=true);
DeviceFileEvents
| lookup kind=inner (MaliciousHashes) on $left.SHA256 == $right.sha256
| project Timestamp, FileName, SHA256, DeviceNameThis enriches file events with matches from an external IOC list in real-time.
For network log enrichment, lookup against Azure IP ranges stored externally:
let AzureSubnets = externaldata(Subnet:string, Region:string)
[@”https://storageaccount.blob.core.windows.net/container/azure-ips.csv?SAS_TOKEN”]
with(format=”csv”);
AzureNetworkAnalytics_CL
| lookup kind=leftouter (AzureSubnets) on $left.DestinationIP == $right.Subnet
| where Region == “”
| project Timestamp, SourceIP, DestinationIP, FlowTypeThis tags non-Azure traffic for focused threat analysis.
Challenges with Lookups
Performance drops if the right table is too large, and it’s not optimized for massive fact tables—use join for those. Security ties back to the external source, and format mismatches can break the lookup. In threat scenarios, over-reliance on external feeds risks blind spots if they’re incomplete.
TLDR: Why Bother with These Techniques?
Despite the hurdles—credential management, precise formatting, size limits, and the need for scripting to handle dynamic sources—integrating external data via externaldata and lookup unlocks powerful threat intelligence capabilities in KQL. It allows for custom IOC matching without bloating your ingested data, enabling faster detection of anomalies like malicious sign-ins or spam domains. If you’re in cybersecurity, start small with public feeds and secure blobs to experiment. With careful setup, these tools can significantly boost your threat hunting game in Microsoft Sentinel or Azure Data Explorer.


