This post is part of an ongoing series to educate about new and known security vulnerabilities against AI.
The full series index (including code, queries, and detections) is located here:
https://aka.ms/MustLearnAISecurity
The book version (pdf) of this series is located here: https://github.com/rod-trent/OpenAISecurity/tree/main/Must_Learn/Book_Version
The book will be updated when each new part in this series is released.
What is a Data Poisoning attack?
A Data Poisoning attack is a type of malicious activity aimed at machine learning models. A successful attack results in incorrect or misleading information being fed into the training data. The objective of this attack is to skew the model's learning process, causing it to make incorrect predictions or classifications.
As you can imagine from the description, data protection is key to protecting against this method of attack. While external forces are definitely a threat, more than often this type of attack is the result of an internal threat enacted by someone with either proper or hacked credentials.
How it works
As noted just prior, access to the data source used for training is the primary element of this attack and generally follows these steps:
Model Targeting: The attacker first identifies a target model that they wish to manipulate.
Injecting Poisoned Data: The attacker then injects poisoned data into the training set. This data is carefully crafted to look normal but contains misleading features or labels that are intended to mislead the learning algorithm.
Training on Poisoned Data: The targeted model is trained or retrained using the contaminated training data. The model learns from both the authentic and poisoned data, which can subtly or substantially alter its behavior.
Exploiting the Compromised Model: Once the model has been trained on the poisoned data, it may behave in ways that benefit the attacker. For example, it might systematically misclassify certain types of inputs, or it could leak sensitive information.
Types of Data Poisoning attacks
Data poisoning against AI is an ongoing and evolving area of security. While both the methods used to conduct these attacks and the techniques to defend against them continue to evolve, it’s still essential knowledge. Currently, the following types of attacks have been identified and categorized.
Targeted Attacks: These attacks are aimed at specific misclassification or a particular wrong behavior of the model. The attacker may want the model to misclassify images of a certain type or favor one class over another.
Random Attacks: These attacks aren't targeted at any particular misbehavior. Instead, they aim to reduce the overall performance of the model by injecting random noise or incorrect labels into the training data.
Why it matters
Data poisoning attacks can have serious consequences, such as:
Loss of Integrity: The model may lose its reliability and start making incorrect predictions or decisions.
Loss of Confidentiality: In some cases, attackers may use data poisoning to infer sensitive information about the training data or the individuals involved in the training process.
Reputation Damage: If a poisoned model is widely used, it may lead to the erosion of trust in both the system and the organization responsible for it.
Why it might happen
Other than providing information for nefarious and dangerous purposes, this type of attack is generally considered more frequently for political purposes through the delivery of “fake” information to alter or steer election results. But imagine if an attacker recategorized “not safe for work” images so that they were viewable to get someone fired.
Real-world Examples
One example of a data poisoning attack against AI is manipulating the training data of the model to corrupt its learning process. This can be done by intentionally inserting incorrect, misleading, or manipulated data into the model's training dataset to skew its behavior and outputs. An example of this would be to add incorrect labels to images in a facial recognition dataset to manipulate the system into purposely misidentifying faces.
Another example is the manipulation of images to deceive image classification models. An early example of this is Tay, Microsoft's Twitter chatbot released in 2016. Twitter intended for Tay to be a friendly bot that Twitter users could interact with. However, within 24 hours of its release, Tay was transformed into a racist and sexist bot due to data poisoning attacks.
How to mitigate
Defending against data poisoning attacks can be complex, but some general strategies include:
Monitoring Data Access: Using a monitoring mechanism, record user logins and access. Use a Watchlist of trusted users to monitor against.
Monitoring Data Application Activity: Using the same monitoring mechanism, set a baseline for normal activity (time, schedule) and alert on outliers.
Data Validation and Cleaning: Regularly reviewing and cleaning the training data to detect and remove any anomalies or inconsistencies.
Robust Learning Algorithms: Designing algorithms that can detect and mitigate the effects of anomalous data.
Monitoring Model Behavior: Continuously monitoring the model's behavior and performance can help detect unexpected changes that might indicate a poisoning attack.
How to monitor
Continuously monitoring and logging data access and data application activities are necessary to detect and respond to potential security incidents quickly. Monitoring should produce a based model of accurate prompts and any outliers should be identified and resolved through ongoing mitigation.
Monitoring can be accomplished through a data aggregator that analyzes for outliers. A good example is a modern SIEM, like Microsoft Sentinel, which enables organizations to collect and analyze data and then create custom detections from alerts to notify security teams when prompts are outside norms or organization policies.
For the growing library of queries, detections, and more for Microsoft Sentinel see: OpenAISecurity/Security/Sentinel at main · rod-trent/OpenAISecurity (github.com)
What to capture
It should be noted that “hallucinations” can sometimes be mistaken for Data Poisoning or Prompt Injection attacks. This is why monitoring for activity and outliers is so important to identify an actual attack versus a misconfiguration.
For more on hallucinations, see: Using Azure AI Studio to Reduce Hallucinations
Once you’ve identified the data available in the log stream, you can start to focus on the specific pieces of artifact (evidence) that will be useful in capturing potential attackers and creating detections.
Here’s a few things to consider capturing:
IP Addresses (internal and external)
Logins: anomalous activity, time elements
Potentially compromised accounts (general access, data application access)
Human and non-human accounts
Geographical data - this is important to match up to known threats (nation state or otherwise)
Data modeling success AND failures
Microsoft Sentinel users, see: Monitor Azure Open AI Deployments with Microsoft Sentinel
[Want to discuss this further? Hit me up on Twitter or LinkedIn]
[Subscribe to the RSS feed for this blog]
[Subscribe to the Weekly Microsoft Sentinel Newsletter]
[Subscribe to the Weekly Microsoft Defender Newsletter]
[Subscribe to the Weekly Azure OpenAI Newsletter]
[Learn KQL with the Must Learn KQL series and book]
[Learn AI Security with the Must Learn AI Security series and book]