Adversarial Training

Deep neural networks are known to be vulnerable to adversarial examples: small, imperceptible perturbations to the input that cause the model to make incorrect predictions with high confidence. Adversarial training is the process of training the model on these perturbed examples to improve robustness.

The Goal: Robust Optimization

Standard training minimizes the expected loss:

\min_\theta \mathbb{E}_{(x,y) \sim \mathcal{D}} [L(f_\theta(x), y)]

Adversarial training solves a min-max problem: finding parameters $\theta$ that minimize the loss on the worst-case perturbation $\delta$ within a small region $\epsilon$ :

\min_\theta \mathbb{E}_{(x,y) \sim \mathcal{D}} \left[ \max_{\|\delta\| \le \epsilon} L(f_\theta(x + \delta), y) \right]

Fast Gradient Sign Method (FGSM)

Finding the exact "worst-case" perturbation is hard. The Fast Gradient Sign Method (FGSM) approximates it by taking a single step in the direction of the gradient of the loss with respect to the input.

The adversarial perturbation $\delta$ is given by:

\delta = \epsilon \cdot \text{sign}(\nabla_x J(\theta, x, y))

where:

$\epsilon$ is the magnitude of the perturbation (e.g., 0.01).
$\nabla_x J(\theta, x, y)$ is the gradient of the loss with respect to the input image $x$ .
sign takes the element-wise sign (-1 or +1).

The adversarial example is then $x_{adv} = x + \delta$ .

Implementation in JAX

In JAX, we can compute gradients with respect to inputs just as easily as parameters.

import jax
import jax.numpy as jnp
import optax
from flax import linen as nn

def fgsm_attack(state, batch, epsilon=0.1):
    """
    Generate adversarial examples using FGSM.
    
    Args:
        state: TrainState containing model parameters.
        batch: Dictionary with 'image' and 'label'.
        epsilon: Perturbation magnitude.
    """
    
    def loss_fn(x):
        # We only need the loss value to compute gradients wrt x
        logits = state.apply_fn({'params': state.params}, x)
        loss = optax.softmax_cross_entropy_with_integer_labels(
            logits=logits,
            labels=batch['label']
        ).mean()
        return loss
    
    # Compute gradient of loss wrt input image
    grad_fn = jax.grad(loss_fn)
    grads = grad_fn(batch['image'])
    
    # Create perturbation: epsilon * sign(gradient)
    perturbation = epsilon * jnp.sign(grads)
    
    # Add perturbation to original image
    adversarial_images = batch['image'] + perturbation
    
    # Clip to valid value range (e.g., 0 to 1 for images)
    adversarial_images = jnp.clip(adversarial_images, 0.0, 1.0)
    
    return adversarial_images