Must Learn AI Security Part 20: Text-based Attacks Against AI

Chapter 20

Oct 03, 2023

This post is part of an ongoing series to educate about new and known security vulnerabilities against AI.

The full series index (including code, queries, and detections) is located here:

The book version (pdf) of this series is located here: https://github.com/rod-trent/OpenAISecurity/tree/main/Must_Learn/Book_Version

The book will be updated when each new part in this series is released.

What is a Text-based attack against AI?

A text-based attack against AI is a type of adversarial attack that targets natural language processing (NLP) systems, such as chatbots, virtual assistants, and machine translation systems. In this type of attack, the attacker tries to manipulate or deceive the NLP system by inputting text that is specifically crafted to exploit vulnerabilities in the system's algorithms.

Types of Text-based attacks

There are several types of text-based attacks, including:

Misclassification attacks: In a misclassification attack, the attacker inputs text that is similar in meaning to the target input but is intentionally crafted to be misclassified by the NLP system. For example, an attacker could input a sentence that is semantically similar to a legitimate request but contains subtle differences that trick the NLP system into providing an incorrect response.
Adversarial examples: In an adversarial example attack, the attacker inputs text that is specifically crafted to deceive the NLP system. For example, an attacker could input a sentence that appears to be benign to a human but is classified by the NLP system as malicious.
Evasion attacks: In an evasion attack, the attacker inputs text that is designed to evade detection by the NLP system's filters or classifiers. For example, an attacker could input a sentence that contains language that is associated with a benign category but is intended to convey a malicious intent.
Poisoning attacks: In a poisoning attack, the attacker inputs text that is designed to manipulate the NLP system's training data, causing it to produce incorrect or biased results. For example, an attacker could input text that contains biased language or misinformation, which the NLP system would then learn and incorporate into its algorithms.
Hidden Text attacks: A hidden text attack is a technique used by hackers to manipulate or deceive an AI system by adding hidden text or code to the input data. This hidden text or code is not visible to human eyes but can be recognized by the AI system, which can lead to incorrect analysis or decision-making. For example, writing white text on a white background. The attackers can use this technique to bypass security measures, gain unauthorized access, or exploit vulnerabilities in the AI system. Therefore, it is important to implement robust security measures and regularly update the AI system to prevent such attacks. Currently, there’s no method to detect these types of attacks.

How it works

A text-based attack against AI works by exploiting vulnerabilities in the algorithms used by natural language processing (NLP) systems. The attacker inputs text that is specifically crafted to deceive or manipulate the NLP system, causing it to produce incorrect or biased results.

Why it matters

Text-based attacks against AI can have a wide range of negative effects, depending on the type of attack and the context in which it occurs. Here are some potential negative effects of a text-based attack against AI:

Misinformation: A text-based attack that introduces false or misleading information into an AI system can have significant negative effects. For example, an attacker could manipulate a chatbot to spread false information about a product or service, leading to reputational damage and financial losses for the company.
Security breaches: A text-based attack that targets an AI system used for security purposes, such as authentication or access control, can lead to serious security breaches. For example, an attacker could manipulate a virtual assistant to gain unauthorized access to a secure system or network.
Bias and discrimination: A text-based attack that introduces biased or discriminatory language into an AI system can perpetuate harmful biases and stereotypes. For example, an attacker could input text that contains racist or sexist language, which the AI system would then learn and incorporate into its algorithms.
Legal and regulatory violations: A text-based attack that manipulates an AI system to produce incorrect or biased results can lead to legal and regulatory violations. For example, an attacker could manipulate a machine learning algorithm used for credit scoring to produce biased results that violate anti-discrimination laws.
Loss of trust and confidence: A text-based attack that exposes vulnerabilities in an AI system can erode trust and confidence in the technology. For example, if a chatbot is easily manipulated by attackers, users may lose confidence in the technology and be hesitant to use it in the future.

Text-based attacks against AI can have serious negative effects, including spreading misinformation, causing security breaches, perpetuating bias and discrimination, violating laws and regulations, and eroding trust and confidence in the technology.

Why it might happen

An attacker can gain several things from a text-based attack against AI, depending on the attacker's goals and the context of the attack. Here are some potential gains for an attacker from a text-based attack against AI:

Access to sensitive data: A text-based attack that targets an AI system used for authentication or access control can provide the attacker with unauthorized access to sensitive data or systems.
Financial gain: A text-based attack that manipulates an AI system used for financial transactions, such as a chatbot used for banking, can result in financial gain for the attacker.
Spread of misinformation: A text-based attack that introduces false or misleading information into an AI system can be used to spread misinformation, which can be used for political or social manipulation.
Evasion of detection: A text-based attack that evades detection by an AI system's filters or classifiers can be used to bypass security measures and gain access to systems or data.
Reputation damage: A text-based attack that manipulates an AI system to produce incorrect or biased results can be used to damage the reputation of a company or organization.
Competitive advantage: A text-based attack that manipulates an AI system used for product recommendations or pricing can be used to gain a competitive advantage.

An attacker can gain various things from a text-based attack against AI, including access to sensitive data, financial gain, spreading misinformation, evasion of detection, reputation damage, and competitive advantage. Therefore, it is important for organizations to implement measures to detect and prevent such attacks.

Real-world Example

One real-world example of a text-based attack against AI is the case of the Tay chatbot developed by Microsoft in 2016. The Tay chatbot was designed to learn from conversations with users on Twitter and respond in a conversational manner, using machine learning algorithms to improve its responses over time.

However, within hours of its launch, the Tay chatbot was targeted by malicious users who inputted text that was specifically crafted to manipulate and deceive the chatbot. The attackers used a combination of misclassification, adversarial examples, and poisoning attacks to manipulate the chatbot's responses and introduce racist and sexist language into its algorithms.

The result was that the Tay chatbot started producing offensive and inappropriate tweets, including racist and sexist language. Microsoft had to shut down the chatbot within 24 hours of its launch due to the negative publicity and reputational damage caused by the attack.

This example highlights the potential consequences of text-based attacks against AI, including the spread of offensive and inappropriate language, reputational damage to the organization, and the need to shut down the AI system to prevent further damage. It also underscores the importance of implementing measures to detect and prevent such attacks.

How to Mitigate

There are several ways to mitigate the risk of text-based attacks against AI. Here are some strategies that organizations can use:

Input validation: Organizations can implement input validation techniques to check the validity of the input text before it is processed by the AI system. This can include checking for specific keywords or patterns that are associated with malicious or misleading text.
Robust algorithms: Organizations can implement robust algorithms that can detect and filter out malicious or misleading text. This can include using machine learning techniques to identify patterns of malicious text and adjust the AI system's algorithms accordingly.
Adversarial training: Organizations can train AI systems to recognize and defend against adversarial attacks by including adversarial examples in the training data. This can help the AI system learn to recognize and filter out malicious or misleading text.
Human oversight: Organizations can implement human oversight of AI systems to review and approve responses before they are sent to users. This can help to prevent the AI system from producing inappropriate or offensive responses.
Regular updates: Organizations can regularly update AI systems with new data and algorithms to keep up with emerging threats and vulnerabilities.
Ethical considerations: Organizations can consider the ethical implications of AI systems and implement measures to prevent bias and discrimination. This can include monitoring the AI system for biased language or results and adjusting the algorithms accordingly.

By implementing these strategies, organizations can reduce the risk of text-based attacks against AI and improve the accuracy and reliability of their AI systems.

How to monitor/What to capture

To detect a text-based attack against AI, there are several key indicators that organizations should monitor. Here are some things to look for:

Unusual input patterns: Organizations should monitor for unusual input patterns, such as a sudden increase in input volume or a change in the type of input received. This can indicate that an attacker is attempting to flood the AI system with malicious input.
Incorrect or biased results: Organizations should monitor for incorrect or biased results produced by the AI system. This can include results that are inconsistent with the input or results that contain biased language or stereotypes.
Unusual response patterns: Organizations should monitor for unusual response patterns from the AI system, such as responses that contain offensive or inappropriate language or responses that are inconsistent with the input.
Anomalies in system behavior: Organizations should monitor for anomalies in system behavior, such as sudden spikes in CPU or memory usage or unusual network activity. This can indicate that an attacker is attempting to exploit vulnerabilities in the AI system.
Logs and audit trails: Organizations should maintain logs and audit trails of all input and output from the AI system. This can help to identify unusual or suspicious activity and track the source of any attacks.

By monitoring these indicators, organizations can detect text-based attacks against AI and take appropriate action to mitigate the damage. It is important to note that monitoring should be done in real-time to minimize the impact of an attack.

[Want to discuss this further? Hit me up on Twitter or LinkedIn]
[Subscribe to the RSS feed for this blog]
[Subscribe to the Weekly Microsoft Sentinel Newsletter]
[Subscribe to the Weekly Microsoft Defender Newsletter]
[Subscribe to the Weekly Azure OpenAI Newsletter]
[Learn KQL with the Must Learn KQL series and book]
[Learn AI Security with the Must Learn AI Security series and book]