Securing AI Models - Gradient Divergence

In our previous AI series, we delved into the realm of adversarial machine learning and introduced a system for classifying AI attacks. In this fourth installment of our AI Series, our focus will shift towards methods to mitigate these attacks.

Let’s follow the same structure..

Previously, we categorized AI-related attacks into two main categories: Generative AI (GenAI) and Predictive AI (PredAI). Let’s follow the same structure and begin by revisiting potential threats in PredAI and GenAI and then proceed to discuss proactive methods for preventing threats and model’s misuse.

Article content — https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.100-2e2023.pdf

PREDAI-V01 – Data Poisoning Attacks:

The aim of data poisoning attack is to manipulate the model’s learning process subtly, causing it to associate certain inputs –> incorrect outputs. Data poisoning attacks occur when an attacker introduces misleading or false data into a system’s training set, causing the model to make inaccurate predictions or decisions.

Defensive Measures- [No easy fix]:

The main problem with data poisoning is that it’s not easy to fix. Models are retrained with newly collected data at certain intervals, depending on their intended use and the owner/developer’s preference. Since poisoning usually happens over time, and over some number of training cycles, it can be hard to tell when prediction accuracy starts to shift.

Given the challenges associated with fixing poisoned models, model developers still need to prioritize measures that can either block/prevent attack attempts or detect malicious inputs before the next training cycle occurs. There are several commonly used techniques against data poisoning, including:

Data sanitization and validation: Ensuring that training data undergoes thorough validation and verification before being used to train the model. This can be achieved by implementing (i) data validation checks and (ii) employing multiple data labelers to validate the accuracy of data labeling.
Diverse Data Sources: Utilizing multiple, diverse sources of data can dilute the impact of poisoned data, thereby reducing the effectiveness of potential attacks.
Provenance Tracking: Maintaining a transparent and traceable record of data sources, modifications, and access patterns can assist in post-hoc analysis in cases of suspected poisoning.
Anomaly detection: Employing anomaly detection techniques to identify any abnormal behavior in the training data, such as sudden changes in data distribution or data labeling. These techniques can help detect data poisoning attacks at an early stage.
Robust Learning: Implementing robust model training techniques to make AI models more resilient against data poisoning attacks. Methods such as:

(i) Regularization,

(ii) Ensemble learning,

(iii) Adversarial training

can improve model generalization and aid in the detection or rejection of poisoned samples.

PREDAI-V02 -Evasion Attacks:

In a model evasion attack, an adversary deliberately manipulates input data in a way that is often imperceptible to humans but is designed to deceive the model.

Defensive Measures- [3 main classes]:

From the wide range of proposed defenses against adversarial evasion attacks, three main classes have proved resilient and have the potential to provide mitigation against evasion attacks:

Adversarial training: Adversarial training is a general technique that enhances the training process by incorporating adversarial examples. These adversarial examples are generated iteratively during training while using their correct labels. The more potent the adversarial attacks used to create these examples, the greater the resilience of the trained model becomes. It’s worth noting that adversarial training often results in models with richer semantic meaning compared to standard models. However, this benefit typically comes at the expense of reduced accuracy when dealing with clean data. Furthermore, adversarial training is computationally expensive due to the iterative generation of these adversarial examples during training.
Randomized smoothing: The main idea behind randomized smoothing is to add random noise to the input data and then evaluate the model’s predictions over multiple noisy samples. By aggregating these predictions, randomized smoothing aims to make the model more robust to perturbations in the input data.
Formal verification: Another method for certifying the adversarial robustness of a neural network is based on techniques from FORMAL METHODS. Normal verification techniques have significant potential for certifying neural network robustness, but their main limitations are their lack of scalability, computational cost, and restriction in the type of supported operations.

PREDAI-V03-Privacy Attacks:

Attackers may have an interest in acquiring information related to the training data, resulting in data privacy attacks, or obtaining information about the machine learning model, leading to model privacy attacks. These attackers could have different objectives when compromising the privacy of either the training data or the model, which include:

(i) data reconstruction, involving the inference of content or features of the training data,

(ii) membership-inference attacks, which entail inferring the presence of specific data within the training set, and

(iii) model extraction attacks, where attackers seek to extract information about the model itself.

Defensive Measures-[No Silver Bullet]

Differential privacy provides a rigorous notion of privacy, protecting against membership inference and data reconstruction attacks. To achieve the best balance between privacy and utility, empirical privacy auditing is recommended to complement the theoretical analysis of private training algorithms.

Other mitigation techniques against model extraction, such as limiting user queries to the model, detecting suspicious queries to the model, or creating more robust architectures to prevent side channel attacks exist in the literature. However, these techniques can be circumvented by motivated and well-resourced attackers and should be used with caution.

PREDAI-V04-AI Supply Chain Attacks:

Since AI is software, it inherits many of the vulnerabilities of the traditional software supply chain. However, many practical GenAI and PredAI tasks begin with open-source models or data that have typically been out of scope for traditional cybersecurity.

Defensive Measures [Software Supply Chain is complicated]:

AI Supply Chain Attacks involve unauthorized modifications or replacements of machine learning libraries or models, including associated data, within a system. These attacks pose a significant security risk to AI-based systems. To prevent them, there exists series of essential measures.

Package Verification: The package verification process involves verifying the digital signatures of packages before installation to ensure their integrity, selecting secure package repositories like Anaconda that adhere to strict security standards and vet packages for safety, regularly updating all packages to patch known vulnerabilities, thus reducing potential entry points for attackers, and utilizing package verification tools such as PEP 476 and Secure Package Install to enhance the security of package installations.
Isolation: Employing virtual environments to isolate packages and ease the detection and removal of malicious ones.
Code Review: Consistently conducting code reviews on all used packages and libraries to identify and mitigate malicious code.
Educating Developers: Lastly, it is crucial to educate developers about the risks associated with AI Supply Chain Attacks and emphasize the significance of package verification before installation, in order to uphold the integrity and security of AI systems.

GENAI/LLM Attacks

LLM specific, The OWASP has a living document that’s constantly updated with new insights from a community of nearly 500 international security experts, practitioners, and organizations in the field. It has gone through a few iterations, reflecting the rapidly evolving landscape of AI security and LLM vulnerabilities.

While numerous attack techniques within the PredAI classification are applicable to General Artificial Intelligence (GenAI), such as model and data poisoning, evasion, model extraction, and more, it is essential to underscore that a substantial portion of recent research in GenAI security has been dedicated to addressing emerging security concerns specifically associated with Large Language Models (LLMs).

GEN-AI-V01- LLM Prompt Injection

An adversary possesses the capability to create malicious prompts as inputs to an LLM with the intention of coercing the LLM into behaving in unintended ways. These manipulative actions, commonly referred to as ‘prompt injections,’ are typically crafted to divert the model from its original instructions, compelling it to follow the adversary’s directives.

A “Prompt Injection Vulnerability” arises when an attacker adeptly manipulates a large language model (LLM) through meticulously crafted inputs, leading the LLM to unknowingly execute the attacker’s objectives. Such manipulation can manifest either directly, by compromising the system prompt, or indirectly, through the manipulation of external inputs.

How to Prevent:

Prompt injection vulnerabilities can occur because Large Language Models (LLMs) don’t separate instructions and external data clearly. Since LLMs understand natural language, they treat both types of input as if they come from users. This makes it challenging to completely prevent prompt injections in LLMs, but there are measures that can help reduce their impact:

Enforce Privilege Control: To bolster security, implement stringent privilege control measures for the LLM, including the allocation of distinct API tokens for functions like plugins, data access, and fine-grained permission control at the function level. Adhere to the principle of least privilege, granting the LLM access only to the resources necessary for its designated tasks while tightly regulating its access to backend systems.
Include Human Oversight (If Necessary): Include human oversight, when deemed necessary, for critical actions such as sending emails or deleting records. This practice involves engaging a human in the process to review and approve the action. By doing so, it mitigates the risk of indirect prompt injections leading to unauthorized actions being taken on behalf of users without their consent or awareness.
Segregate External Content: To mitigate risks, it is essential to clearly distinguish and label untrusted content when incorporating it into user prompts, thereby minimizing its impact. For instance, when making OpenAI API calls, employ ChatML to specify the origin of the prompt input provided to the LLM.
Establish Trust Boundaries: Establish well-defined trust boundaries between the LLM, external sources, and extensible functionalities such as plugins or downstream processes. Regard the LLM as an untrusted entity and consistently prioritize user control over decision-making procedures. Recognize the potential for a compromised LLM to act as an intermediary, potentially altering or concealing information before presenting it to the user. Implement visual cues to flag potentially untrustworthy responses for user awareness and vigilance.
Periodically Monitor Input and Output: Continuously check that the input and output of the LLM align with expectations. Although this doesn’t directly prevent prompt injections, it helps in detecting weaknesses and addressing them.

GEN-AI-V02- Insecure Plugin Design

LLM plugins serve as extensions that, when activated, are automatically invoked by the model during user interactions. The execution of these plugins is primarily controlled by the model integration platform, often leaving the application with limited control, especially in scenarios where the model is hosted by a third party. Additionally, plugins frequently lack proper input validation or type-checking for handling context-size limitations when dealing with free-text inputs from the model.

This deficiency opens the door for potential attackers to craft malicious requests targeting the plugin, potentially leading to a wide range of adverse consequences, including but not limited to remote code execution.

Defensive Measures:

Insecure plugin design can be prevented by ensuring that plugins:

Enforce strict parameterized input
Use appropriate authentication and authorization mechanisms
Require manual user intervention and approval for sensitive actions
Are thoroughly and continuously tested for security vulnerabilities

Discussion and Remaining Challenges

The Scale Challenge: Data plays a crucial role in the training of models. As models increase in complexity, the volume of training data also increases proportionally. This trend is evident in the development of large language models (LLMs). While most of today’s LLMs do not disclose comprehensive information about their data sources, those that do reveal the vast amount of data consumed during training. The latest multi-modal GenAI systems further intensify this demand by requiring substantial data for each modality.

It’s important to note that there is no single organization or even a nation that possesses all the data used to train a specific LLM. Data repositories consist of labels and links to other servers that house the actual data samples. This dynamic nature challenges the traditional concept of corporate cybersecurity perimeters and introduces new, challenging risks. Recently released open-source data manipulation tools, initially created with good intentions to protect artists’ copyright, can become detrimental if misused by individuals with malicious intent.

Another issue related to scale is the ability to generate synthetic content on a large scale on the internet. While watermarking can help address this issue to some extent, the existence of powerful, open, or ungoverned models presents opportunities for generating significant volumes of unmarked synthetic content. This can adversely affect subsequent LLMs, potentially leading to model collapse.

Addressing these scale-related challenges is of utmost importance for the future development of foundational models, particularly if we aim to ensure their alignment with human values.