
I got obsessed with those AI art videos where every frame looks like a painting but the video doesn’t flicker like crazy. Turns out keeping things smooth between frames is way harder than just applying style transfer to individual images. We built our own video style transfer system at NTNU and ran into all sorts of problems.
Why it’s harder than it looks
If you just apply style transfer to each frame separately, you get horrible flickering because each frame gets processed independently. The real challenge is making frames look consistent with each other while still maintaining the artistic style.
The math behind it
The basic idea is separating content and style using convolutional neural networks. For any frame $I_t$, you extract content features $F_c$ and style features $F_s$. The loss function has three parts:
$$L_{total} = \alpha L_c + \beta L_s + \gamma L_{temporal}$$
where $L_c$ is content loss, $L_s$ is style loss, and $L_{temporal}$ keeps things smooth between frames. The Greek letters are just weights to balance how much you care about each part.
Making it smooth over time
The temporal loss $L_{temporal}$ is what prevents the flickering. Given two consecutive frames $I_{t-1}$ and $I_t$, you compute:
$$L_{temporal}=\sum_{i,j}{||S(I_t)-S(I_{t-1})||}$$
where $S(I)$ is the stylized output. Basically, you penalize the algorithm when consecutive frames look too different.
Here’s how we implemented the temporal loss:
1 | from typing import Tuple |
Different approaches we tried
Frame-by-frame (Gatys Method)
We started by just applying style transfer frame by frame. Simple to implement, but the results flicker like crazy. Not usable for actual videos.
Optical flow constraints (Ruder’s Method)
This uses optical flow to track how pixels move between frames, then penalizes the algorithm when the stylized output doesn’t follow the same motion. Way better results, but much more complex.
Speeding things up with Instance Normalization
The original methods are painfully slow because they optimize for each frame. Instance normalization lets you train a feedforward network that can process frames in real time:
1 | class InstanceNorm(nn.Module): |
Using Wasserstein Distance for Style Matching
For measuring how well the style transferred, we used the Wasserstein distance between feature distributions. The math looks scary but it’s just a way to measure how different two distributions are:
$$W(P,Q) = \inf_{\gamma\in\Pi(P,Q)}\mathbb{E}_{(x,y)\sim\gamma}[|x-y|]$$
where $\Pi(P,Q)$ represents all possible ways to match points between the two distributions.
What worked best
Johnson’s feedforward network with instance normalization, combined with Ruder’s temporal constraints, gave us the right balance between speed and quality. You can’t just focus on making individual frames look good — you have to think about the video as a whole and make sure the motion stays consistent.
Video style transfer is fundamentally different from image style transfer. The temporal consistency problem is really hard, and there’s always a tradeoff between quality and speed. The optical flow approach works well but adds a lot of complexity. Instance normalization was a game changer for speed, but you lose some of the flexibility of the optimization based methods.
Our implementation is on GitHub with different approaches in separate modules if you want to experiment.
Making something work in theory is very different from making it work well in practice. Video style transfer looks simple until you actually try to build it.