In this part of the AI series, we will cover the topic of adversarial machine learning and attempt to create a taxonomy of AI attacks. We will also discuss mitigation strategies in Part #4.
Q1. What is Adversarial Machine Learning?
Adversarial machine learning is a technique employed by threat actors to manipulate machine learning models. Various attack types and methods exist within machine learning systems, which organizations should be aware of and consider how to prevent and detect them.
Let’s begin with the taxonomy – the types of adversarial attacks. There are various taxonomies available, but in this classification, I have primarily based it on ENISA and NIST‘s taxonomies, segmenting the threats into two categories: Predictive AI (PredAI) and Generative AI (GenAI).
Q2. What is PredAI?
PredAI systems are designed to make predictions or forecasts based on input data. They excel at tasks such as classification, regression, and pattern recognition. They typically take existing data as input and use it to make predictions about future data points or outcomes.
Examples of predictive AI applications include spam email detection, recommendation systems (e.g., suggesting products or content based on user behavior), and medical diagnosis.
Q3. What is GenAI?
Generative AI focuses on the creation of new content, generating outputs that are “original” and “novel”. Generative AI models have the ability to generate realistic images, compose music, write text, and even design virtual world. It leverages different techniques such as Generative Adversarial Networks (GANs), Variational Auto encoders (VAEs), and Autoregressive Models to learn patterns and distributions from existing data and generate new samples.
PredAI Adversarial Attacks:
1. Poisoning Attacks [Attack Stage= Training Time]
Poisoning attacks are among the earliest threats that can occur during training phase of a machine learning model’s lifecycle. In this type of attack, malicious data is inserted into the training dataset, subtly altering the learning process. Consequently, the model learns incorrect associations, leading to erroneous predictions when deployed.
The inaccuracies resulting from this tainted data could have significant real-world consequences for a machine learning model. For instance, if a model trained on poisoned data were used for medical diagnosis, it could potentially have severe implications for both patients and businesses.
Artists, writers, and content creators have started to use data poisoning as a means to defend against or stop the unauthorized use of their data.
A new tool called “Nightshade empowers artists to subtly alter pixels in their artwork, making changes that are virtually invisible to the human eye. This confuses AI programs, causing them to interpret the image as something entirely different from what viewers see. For example, a picture of a dog could appear to AI as a cat, or a car might seem like a cow, as explained in the MIT Technology Review‘s preview of Nightshade.
- Real Life Attacks:
Some real-life cases of include VirusTotal poisoning. In the case investigation, it was revealed that many samples of a particular ransomware family were submitted through a popular virus-sharing platform within a short amount of time. Based on string similarity, the samples were all equivalent, and based on code similarity, they were between 98% and 74% similar. Interestingly enough, the compile time was the same for all the samples. After more digging, researchers discovered that someone used ‘metame,’ a metamorphic code manipulating tool, to manipulate the original file into mutant variants. The variants would not always be executable, but they are still classified as the same ransomware family.
2. Evasion Attacks [Attack Stage= Deployment Time]
In an evasion attack, the adversary’s goal is to generate adversarial examples. These examples are defined as testing samples whose classification can be changed to an arbitrary class of the attacker’s choice at deployment time with only minimal perturbations. In some cases, the attacker has access to information, such as the model and its parameters, that enables them to directly create adversarial examples. In this white-boxing setup, the adversary can utilize the model’s gradient to determine the most effective perturbation to apply to the input data in order to evade the model
In the following famous research example, researchers were able to change GoogLeNet’s classification of an image by adding an imperceptibly small vector. The vector consisted of elements equal to the sign of the elements of the gradient of the cost function with respect to the input.
- Real Life Attacks:
Some real-life cases include Proof Pudding (CVE-2019-20634) case where ML researchers evaded ProofPoint’s email protection system by first building a copy-cat email protection ML model, and using the insights to bypass the live system. More specifically, the insights allowed researchers to craft malicious emails that received preferable scores, going undetected by the system.
3. Privacy Attacks (Data, Membership and Model Disclosure) [Attack Stage= All Lifecycle]
Privacy attacks related to data and model disclosure typically refer to situations where an individual’s or an organization’s sensitive information, data, or proprietary machine learning models are exposed or compromised in some way.
Membership inference attacks also expose private information about an individual, similar to reconstruction or memorization attacks. These attacks remain a significant concern when releasing aggregate information or machine learning models trained on user data. In certain situations, determining that an individual is part of the training set already carries privacy implications, such as in a medical study involving patients with a rare disease. Furthermore, membership inference can serve as a building block for launching data extraction attacks.
- Real Life Attacks:
Some real cases include ChatGPT Plugin Privacy Leak where researchers uncovered an indirect prompt injection vulnerability within ChatGPT, where an attacker can feed malicious websites through ChatGPT plugins to take control of a chat session and exfiltrate the history of the conversation. As a result of this attack, users may be vulnerable to Personal Identifiable Information (PII) leakage from the extracted chat session.
4. AI Supply Chain Attacks [Attack Stage= All Lifecycle]
Given that AI is fundamentally software, it inherits numerous vulnerabilities commonly associated with traditional software supply chains. AI supply chain threat usually refers to the compromise of a component or developing tool of the ML application including the software, data, and model supply chains, as well as network and storage systems.
- Real Life Attacks:
In some real-life cases, a malicious binary was uploaded to the Python Package Index (PyPI) code repository in late 2022. This malicious binary had the same name as a PyTorch dependency, and the PyPI package manager (pip) inadvertently installed this malicious package instead of the legitimate one.
This type of supply chain attack, commonly referred to as “dependency confusion,” led to the exposure of sensitive information on Linux machines that had installed the affected versions of PyTorch-nightly through pip.
Generative AI (GenAI) Attacks:
While many attack types in the PredAI taxonomy are applicable to General Artificial Intelligence (GenAI), such as model and data poisoning, evasion, model extraction, and more, it is crucial to emphasize that a significant portion of recent research in GenAI security has been devoted to addressing novel security issues in LLMs (Large Language Models).
- LLM Prompt Injection [Attack Stage: Deployment]
An adversary can create malicious prompts as inputs to a Language Model (LLM) in order to manipulate the LLM into behaving in unintended ways. These manipulations, often referred to as ‘prompt injections,’ are typically designed to make the model disregard some of its original instructions and instead follow the adversary’s commands.
Prompt injections can serve as the initial access point to the LLM, providing the adversary with a foothold to execute further steps in their operation. They are often engineered to bypass the LLM’s defenses or enable the adversary to issue privileged commands. The effects of a prompt injection can persist throughout an interactive session with the LLM.
Malicious prompts may be injected directly by the adversary (Direct) with the intention of either using the LLM to generate harmful content or gaining a foothold on the system for further actions. Alternatively, prompts can be injected indirectly (Indirect) as part of the LLM’s normal operation, where it ingests the malicious prompt from another data source. This type of injection can also be exploited by the adversary to establish a foothold on the system or target the LLM’s user.
- Real Life Attacks:
Indirect Prompt Injection: Turning Bing Chat into a Data Pirate case: The researchers demonstrated practical viability against both real-world systems, such as Bing’s GPT-4-powered Chat and code completion engines, as well as synthetic applications built on GPT-4. They showed how processing retrieved prompts can function as arbitrary code execution, manipulate the application’s functionality, and control whether and how other APIs are called.
Defences against Adversarial Attacks:
The challenge of developing defenses against adversarial attacks persists due to our limited understanding of why these attacks are so effective, despite considerable research efforts. Many traditional regularization techniques intended to enhance the robustness of neural networks, such as dropout and weight decay, have proven ineffective in this context. In fact, only a handful of methods have shown substantial benefits thus far. In our upcoming AI Series #4, we will delve into the topic of defending against adversarial attacks.