Machine learning AI models are so good at tasks like computer vision and natural language processing that they’ve exceeded human performance for many tasks. But these models can be fooled by small perturbations to their input data, causing them to, for example, classify an image wrongly or even fail to detect an existing object in the image. Let’s find out how these “adversarial examples” work and why they are so effective against current state-of-the-art models.
Adversarial Example Generation
Let’s start with a task where it’s easy for adversarial examples to be generated — image classification, where a model tries to pick out a category that can suitably describe an input image. While the model gets it right most of the time on a normal dataset, once you add some specifically constructed noise, it rarely returns the correct classification. The specifically constructed noise could be in the form of a small patch, a one-pixel perturbation, or even an imperceptible noise applied through the entire image. However, these small changes can cause an image classification model to perform worse than if it had just randomly guessed.
How are these perturbations generated? Adversarial example generation is split into two different categories — white-box attacks and black-box attacks. White-box attacks involve adversaries that have complete knowledge of the model’s architecture and weights, while the adversaries in black-box attacks do not know anything apart from the output returned by the model.
White-Box Attacks
The simplest and the most introductory form of adversarial attack is the FGSM (Fast Gradient Sign Method), which is a white-box attack. Basically, after calculating the loss when a model classifies an image, it backpropagates the loss back to a matrix the size of the image (the noise), and takes the sign of each value of the matrix. Multiply the noise by a small epsilon to make the perturbation small, and it could still successfully fool the model most of the time by trying to maximize the loss value. Here’s the formula for the FGSM:
where adv_x is the adversarial image, x is the original image, and that blob inside the sign function is the loss between the model outputs and the true label.
The FGSM is powerful for such a simple attack, but there are white-box attacks that are even stronger. PGD (Projected Gradient Descent) is an attack that updates the noise for multiple iterations before returning an adversarial image that causes misclassification. You can think of it as an iterative gradient-based method, and the more iterations it runs through, the more powerful the adversarial attack (and thus the more likely the model under attack misclassifies).
Other more sophisticated attacks, such as the C&W attack, involve the use of optimization problems in the creation of adversarial samples.
Black-Box Attacks
After reading all this about white-box attacks, you might be thinking that a natural defense would be to prevent the model architecture and gradients from being accessed, and then these attacks would not have any information for them to succeed. But adversarial examples can still be generated, even if the model is a black box — you know nothing about it other than the input and the output.
One form of black-box attack involves a property of adversarial attacks known as transferability — that adversarial images that work on one model often work well in similar models. By repeatedly querying the original model, a new model can be trained that mimics the behavior of the original model. This new model, also known as the surrogate model, can then be attacked with white-box adversarial attacks, and these adversarial samples often work well to cause the original model to misclassify.
Other types of black-box attacks do not make use of surrogate models at all. For example, the one-pixel attack uses differential evolution to find which pixel to attack and the new value of the pixel to fool the model. Start with a random set of one-pixel perturbations. For each perturbation, generate a new perturbation randomly close to the original one. If the new perturbation outperforms the old one (for example, by reducing the confidence value of the correct class), then the new one is introduced into the population and discards the old perturbation. Otherwise, the new perturbation is discarded. And often, a one-pixel change can be found that causes the model to misclassify.
Targeted Attacks and Untargeted Atacks
Other than white-box and black-box attacks, adversarial attacks can also be divided into targeted and untargeted attacks. Targeted attacks cause a model to return a specific class (often by maximizing the confidence of the target class or minimizing the loss between the model output and the target class). On the other hand, untargeted attacks just aim to confuse the model by maximizing the loss value between the model output and the correct class or minimizing the confidence of the correct class.
Real-World Threats of Adversarial Examples
You might think that adding random noises or suddenly changing pixels of an image would be unfeasible, as an attacker would need to access the insides of a system to cause this misclassification. By that time a better way to fool the models would be to feed a different image to the system. But these types of adversarial attacks, which showcase the fragility of these neural networks, are not all. Physical adversarial attacks exist for real-world object detection systems used in, for example, self-driving cars. By posting specially crafted stickers or patches onto signs or objects, it could fool an object detection system, even if lighting or positional transformations are made such that a random-noise adversarial attack would not work in the real world.
Other than computer vision systems, adversarial examples can also be generated against NLP (natural language processing) systems. For example, attacks can be constructed such that the adversarial statement is still grammatically valid and has the same meaning as the original one, but it can cause models to misinterpret what that means. This could be used to bypass spam filters to deliver spam in social engineering cyberattacks. Moreover, with specially crafted prompts to an LLM (large language model), one could make such a model return content that it isn’t supposed to return.
In some sense, these attacks on language systems could be even more dangerous than those made on computer vision systems; while adversarial stickers or patches can be detected and removed by humans, there is often no one other than the vulnerable AI system monitoring what it is outputting.
Defenses Against Adversarial Attacks
With the threat of adversarial attacks causing machine learning models to misbehave, how do we strengthen these models to defend them against these adversarial perturbations? A natural approach is adversarial training. Basically, other than the “clean” samples we take from datasets, we also use adversarial examples to train the model. This is done so that the model can learn to return correct predictions, even if the data contains adversarial noises. But there is often a trade-off between clean accuracy and adversarial accuracy, meaning that adversarial training often decreases the accuracy of the model on the clean data. This simple approach remains one of the best defense mechanisms we have on adversarial samples, but adversarially trained models might still be vulnerable to adaptive attacks.
Other defenses have been proposed, such as gradient masking and simply making the model weights unavailable to the public. This appears to work well against white-box attacks like FGSM and PGD at first, but it doesn’t defend against black-box attacks. One could run an off-the-shelf black box attack to attack the model, or with enough effort, make a surrogate model — so that white-box attacks on the surrogate model can be transferred to the original model.
Moreover, defenses involving flagging adversarial examples, in which adversarial prompts are rejected, are proposed. These defenses involve classifiers on top of an off-the-shelf model, but these classifiers are not perfect and are also vulnerable to adversarial examples. Sometimes, in the context of adversarial examples, even one successful attack is sufficient to cause significant harm — and an attacker could use an adaptive attack, or simply try many adversarial perturbations so that the classifier is bypassed and the vulnerable main model generates unsafe content or does harmful actions. (I remember there’s a paper talking about this, cite that paper)
How we should make models more robust is still a very active area of research. As new defenses are proposed, new attacks that break these defenses are generated. This evokes the question: what is the theoretical reason that adversarial samples like these exist?
What Adversarial Attacks Tell Us About Machine Learning
What does this phenomenon tell us about how machine learning models work? One theory is that they’re learning features that are not useful to humans, but useful to machine learning models. These models are simply learning statistical distributions in the data (like between the images and the correct classification labels in image classification) after all, so these non-robust features, as the authors call them, might just be a mathematical artifact, an emergent phenomenon playing out in the training process.
A more fundamental reason for the existence of adversarial samples could be the curse of dimensionality. Machine learning models can fit a curve very easily if the data points have just one dimension. You can easily construct a model that fits a sinewave for example, and it fits pretty well with minimal training.
But most of the tasks we’re doing with machine learning involve very high-dimensional data. For every pixel in an image, it’s 3 new dimensions (the red, green, and blue value). A language model takes in even more dimensions of data, with each token adding thousands of new dimensions to the data. The points representing the training data are so far apart that it’s very hard to fit the data in the general case. It just happens that most of the training data and most of the data we encounter in the real world fit into a small section of the high-dimensional space, known as the manifold, that enables models to make accurate predictions.
And if you zoom out that sinewave fit, you will see that the model does not offer any predictive power outside the range of training data it is offered. And it might just happen that the manifold could be so thin and unstable — these models even make incorrect predictions on clean samples sometimes — that it is easy to nudge these images or text inputs so that the model misclassifies.
Blue: sin(x); Orange: model predictions
What this means could be that models just perceive information a different way from humans. Even humans are vulnerable to adversarial samples, but these perturbations are not as subtle as changing a few pixels from a clean image. To make a truly robust model on any task, we might need to make a model that excels on all fronts in all aspects of daily life, from processing images to dealing with sound and text, to build common sense about the world we live in. But by that time, we might be more worried about the massive stakes brought on by artificial general intelligence.
References
- Heaven, D. (2019, October 9). Why deep-learning AIs are so easy to fool. Retrieved December 31, 2024, from https://tallinzen.net/media/readings/fooling_deep_learning.pdf
- Zhuang, H. et al. (2019, January 15). The Limitations of Adversarial Training and the Blind-Spot Attack. Retrieved December 31, 2024, from https://arxiv.org/pdf/1901.04684
- Ilyas, A et al. (2019, August 12). Adversarial Examples Are Not Bugs, They Are Features. Retrieved December 31, 2024, from https://arxiv.org/pdf/1905.02175