Video Style Transfer: Coherence vs Quality

Dilawar Mahmood

2020-12-14

Machine Learning, Mathematics, Performance Engineering

Shrek styled with La Muse

I got obsessed with those AI art videos where every frame looks like a painting, and somehow the video doesn’t flicker like a strobe light. As soon as I tried to build one myself, I found out why: keeping consecutive frames consistent with each other turns out to be much harder than just running style transfer on individual images. We built a video style transfer system at NTNU and ran into pretty much every problem in the literature.

Why it’s harder than it looks

If you apply style transfer frame by frame, each frame gets processed independently and the resulting video flickers. The actual challenge is to make consecutive frames look consistent with one another while still applying the artistic style.

The math

Style transfer separates content and style using features from a convolutional network. For any frame \(I_t\), you extract content features \(F_c\) and style features \(F_s\). The full video loss has three terms:

\[L_{total} = \alpha L_c + \beta L_s + \gamma L_{temporal}\]

where \(L_c\) is content loss, \(L_s\) is style loss, and \(L_{temporal}\) is what keeps things smooth between frames. The Greek letters are weights that balance how much you care about each piece.

Making it smooth over time

The temporal loss \(L_{temporal}\) is the term that prevents flickering. Given two consecutive stylized frames \(S(I_{t-1})\) and \(S(I_t)\), you compute:

\[L_{temporal} = \sum_{i,j} \| S(I_t) - S(I_{t-1}) \|\]

You’re penalizing the algorithm whenever consecutive frames look too different. The version we ended up using is more nuanced, since you want changes that come from real motion in the scene to pass through, just not changes that come from the network being inconsistent. Optical flow tells you how pixels actually moved, and you compare against the warped previous frame instead of the raw previous frame:

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch import Tensor

def temporal_loss(prev_frame: Tensor, curr_frame: Tensor, flow: Tensor, reduction: str = 'mean') -> Tensor:
    warped_prev = warp_frame(prev_frame, flow)
    diff = torch.abs(curr_frame - warped_prev)
    
    if reduction == 'mean':
        return torch.mean(diff)
    elif reduction == 'sum':
        return torch.sum(diff)
    else:
        raise ValueError(f"Unsupported reduction method: {reduction}")

def warp_frame(frame: Tensor, flow: Tensor) -> Tensor:
    B, C, H, W = frame.shape
    
    grid_y, grid_x = torch.meshgrid(
        torch.arange(H, device=frame.device),
        torch.arange(W, device=frame.device),
        indexing='ij'
    )
    
    flow_grid = torch.stack([
        2 * (grid_x + flow[:, 0]) / (W - 1) - 1,
        2 * (grid_y + flow[:, 1]) / (H - 1) - 1
    ], dim=-1)
    
    return F.grid_sample(
        frame,
        flow_grid.permute(0, 2, 3, 1),
        mode='bilinear',
        padding_mode='border',
        align_corners=True
    )

That gives you a soft version of “the same point should look the same in both frames,” which is exactly what you want for stable video.

Approaches we tried

Frame-by-frame Gatys

We started with the simplest thing, applying Gatys-style optimization-based style transfer to each frame independently. Easy to implement, looks like a strobe when you string the frames together. We kept it as a baseline and a quick way to confirm any consistency improvements were actually helping.

Optical flow constraints (Ruder)

Ruder’s approach uses optical flow to track pixel motion between frames, then penalizes the network when the stylized output doesn’t follow the same motion. Much better consistency, much more complex code, much slower because of the flow estimation step.

Feedforward networks with instance normalization

Optimization-based methods are slow because they re-optimize for every frame. Johnson et al. showed that you can train a feedforward network that produces stylized output in a single pass, and instance normalization is a key ingredient that makes that work well:

class InstanceNorm(nn.Module):
    def __init__(self, dim: int, eps: float = 1e-8) -> None:
        super().__init__()
        self.scale = nn.Parameter(torch.ones(dim))
        self.shift = nn.Parameter(torch.zeros(dim))
        self.eps = eps
        
    def forward(self, x: Tensor) -> Tensor:
        mean = x.mean(dim=(2, 3), keepdim=True)
        std = x.std(dim=(2, 3), keepdim=True) + self.eps
        return self.scale[None, :, None, None] * (x - mean) / std + self.shift[None, :, None, None]

The intuition is that instance normalization removes per-sample contrast and brightness, which makes style transfer much more stable. Once that’s in place you can train a single network that handles arbitrary content frames in real time.

Wasserstein distance for style matching

To measure how well the style transferred we used the Wasserstein distance between feature distributions. The notation looks intimidating, but it’s really just measuring how much “work” you’d need to do to transform one distribution into the other:

\[W(P, Q) = \inf_{\gamma \in \Pi(P, Q)} \mathbb{E}_{(x, y) \sim \gamma} [\| x - y \|]\]

In practice we computed it on Gram matrices of feature activations. The transport-based perspective is sometimes more stable than the older Frobenius-norm-on-Gram-matrices approach, especially when the style image and content image have very different feature statistics.

What worked best

The combination that gave us the best balance between speed and quality was Johnson’s feedforward network with instance normalization, plus Ruder’s temporal constraints used during training. The feedforward part keeps inference fast. The temporal constraints during training teach the network to be consistent with itself over time, so we don’t have to do anything special at inference.

The trade-offs become a lot clearer once you’ve actually tried each piece. Frame-by-frame methods are simple but flicker. Optical flow methods are smooth but slow. Feedforward methods are fast but rigid about the style they can apply. Each approach has a regime where it’s the right answer.

Our implementation is on GitHub with the different approaches in separate modules if you want to experiment with them. The thing that surprised me most about the project was how much of the difficulty lives in temporal consistency rather than in the style transfer itself. Once you start thinking of the video as a single object with motion in it, the math you need to write down changes shape entirely.