Video Style Transfer: Coherence vs Quality

Shrek styled with La Muse

I got obsessed with those AI art videos where every frame looks like a painting but the video doesn’t flicker like crazy. Turns out keeping things smooth between frames is way harder than just applying style transfer to individual images. Here’s what we learned building our own video style transfer system at NTNU.

The Core Problem

Video style transfer isn’t just “apply style transfer to each frame and call it done.” If you do that, you get horrible flickering because each frame gets processed independently. The real challenge is making frames look consistent with each other while still maintaining the artistic style.

How the Math Works

The basic idea is separating content and style using convolutional neural networks. For any frame $I_t$, you extract content features $F_c$ and style features $F_s$. The loss function combines three things:

$$L_{total} = \alpha L_c + \beta L_s + \gamma L_{temporal}$$

where $L_c$ is content loss, $L_s$ is style loss, and $L_{temporal}$ keeps things smooth between frames. The Greek letters are just weights to balance how much you care about each part.

Keeping Things Smooth Over Time

The temporal loss $L_{temporal}$ is what prevents the flickering. Given two consecutive frames $I_{t-1}$ and $I_t$, you compute:

$$L_{temporal}=\sum_{i,j}{||S(I_t)-S(I_{t-1})||}$$

where $S(I)$ is the stylized output. Basically, you penalize the algorithm when consecutive frames look too different.

Here’s how we implemented the temporal loss:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
from typing import Tuple
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch import Tensor

def temporal_loss(
prev_frame: Tensor,
curr_frame: Tensor,
flow: Tensor,
reduction: str = 'mean'
) -> Tensor:
"""
Compute temporal consistency loss between consecutive frames.

Args:
prev_frame: Previous stylized frame tensor of shape (B, C, H, W)
curr_frame: Current stylized frame tensor of shape (B, C, H, W)
flow: Optical flow tensor of shape (B, 2, H, W)
reduction: Reduction method for loss computation ('mean' or 'sum')

Returns:
Scalar tensor containing temporal consistency loss

Raises:
ValueError: If reduction method is not 'mean' or 'sum'
"""
warped_prev = warp_frame(prev_frame, flow)
diff = torch.abs(curr_frame - warped_prev)

if reduction == 'mean':
return torch.mean(diff)
elif reduction == 'sum':
return torch.sum(diff)
else:
raise ValueError(f"Unsupported reduction method: {reduction}")

def warp_frame(
frame: Tensor,
flow: Tensor
) -> Tensor:
"""
Warp frame according to optical flow using grid sampling.

Args:
frame: Input frame tensor of shape (B, C, H, W)
flow: Optical flow tensor of shape (B, 2, H, W)

Returns:
Warped frame tensor of same shape as input frame
"""
B, C, H, W = frame.shape

# Create sampling grid
grid_y, grid_x = torch.meshgrid(
torch.arange(H, device=frame.device),
torch.arange(W, device=frame.device),
indexing='ij'
)

# Apply flow to grid
flow_grid = torch.stack([
2 * (grid_x + flow[:, 0]) / (W - 1) - 1,
2 * (grid_y + flow[:, 1]) / (H - 1) - 1
], dim=-1)

# Perform grid sampling
return F.grid_sample(
frame,
flow_grid.permute(0, 2, 3, 1),
mode='bilinear',
padding_mode='border',
align_corners=True
)

What We Tried

The Naive Approach (Gatys Method)

We started by just applying style transfer frame by frame. Simple to implement, but the results flicker like crazy. Not usable for actual videos.

Adding Temporal Constraints (Ruder’s Method)

This uses optical flow to track how pixels move between frames, then penalizes the algorithm when the stylized output doesn’t follow the same motion. Way better results, but much more complex.

Making It Fast with Instance Normalization

The original methods are painfully slow because they optimize for each frame. Instance normalization lets you train a feedforward network that can process frames in real time:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
class InstanceNorm(nn.Module):
"""
Instance Normalization layer for style transfer.

This layer normalizes feature maps independently across spatial dimensions
and applies learnable affine transformation parameters.

Attributes:
scale: Learnable scaling parameter
shift: Learnable shifting parameter
eps: Small constant for numerical stability
"""
def __init__(self, dim: int, eps: float = 1e-8) -> None:
super().__init__()
self.scale = nn.Parameter(torch.ones(dim))
self.shift = nn.Parameter(torch.zeros(dim))
self.eps = eps

def forward(self, x: Tensor) -> Tensor:
"""
Apply instance normalization to input tensor.

Args:
x: Input tensor of shape (batch_size, channels, height, width)

Returns:
Normalized and transformed output tensor of same shape as input
"""
mean = x.mean(dim=(2, 3), keepdim=True)
std = x.std(dim=(2, 3), keepdim=True) + self.eps
return self.scale[None, :, None, None] * (x - mean) / std + self.shift[None, :, None, None]

Using Wasserstein Distance for Style Matching

For measuring how well the style transferred, we used the Wasserstein distance between feature distributions. The math looks scary but it’s just a way to measure how different two distributions are:

$$W(P,Q) = \inf_{\gamma\in\Pi(P,Q)}\mathbb{E}_{(x,y)\sim\gamma}[|x-y|]$$

where $\Pi(P,Q)$ represents all possible ways to match points between the two distributions.

What Actually Worked

The best approach ended up being Johnson’s feedforward network with instance normalization, combined with Ruder’s temporal constraints. This gave us the right balance between speed and quality.

The key insight was that you can’t just focus on making individual frames look good. You have to think about the video as a whole and make sure the motion stays consistent.

What We Learned

Building this taught us that video style transfer is fundamentally different from image style transfer. The temporal consistency problem is really hard, and there’s always a tradeoff between quality and speed.

The optical flow approach works well but adds a lot of complexity. Instance normalization was a game changer for speed, but you lose some of the flexibility of the optimization based methods.

If you want to check out our implementation, it’s all on GitHub. We organized the different approaches in separate modules so you can experiment with each one.

The project was a good reminder that making something work in theory is very different from making it work well in practice. Video style transfer looks simple until you actually try to build it.