
I got obsessed with those AI art videos where every frame looks like a painting but the video doesn’t flicker like crazy. Turns out keeping things smooth between frames is way harder than just applying style transfer to individual images. Here’s what we learned building our own video style transfer system at NTNU.
The Core Problem
Video style transfer isn’t just “apply style transfer to each frame and call it done.” If you do that, you get horrible flickering because each frame gets processed independently. The real challenge is making frames look consistent with each other while still maintaining the artistic style.
How the Math Works
The basic idea is separating content and style using convolutional neural networks. For any frame $I_t$, you extract content features $F_c$ and style features $F_s$. The loss function combines three things:
$$L_{total} = \alpha L_c + \beta L_s + \gamma L_{temporal}$$
where $L_c$ is content loss, $L_s$ is style loss, and $L_{temporal}$ keeps things smooth between frames. The Greek letters are just weights to balance how much you care about each part.
Keeping Things Smooth Over Time
The temporal loss $L_{temporal}$ is what prevents the flickering. Given two consecutive frames $I_{t-1}$ and $I_t$, you compute:
$$L_{temporal}=\sum_{i,j}{||S(I_t)-S(I_{t-1})||}$$
where $S(I)$ is the stylized output. Basically, you penalize the algorithm when consecutive frames look too different.
Here’s how we implemented the temporal loss:
1 | from typing import Tuple |
What We Tried
The Naive Approach (Gatys Method)
We started by just applying style transfer frame by frame. Simple to implement, but the results flicker like crazy. Not usable for actual videos.
Adding Temporal Constraints (Ruder’s Method)
This uses optical flow to track how pixels move between frames, then penalizes the algorithm when the stylized output doesn’t follow the same motion. Way better results, but much more complex.
Making It Fast with Instance Normalization
The original methods are painfully slow because they optimize for each frame. Instance normalization lets you train a feedforward network that can process frames in real time:
1 | class InstanceNorm(nn.Module): |
Using Wasserstein Distance for Style Matching
For measuring how well the style transferred, we used the Wasserstein distance between feature distributions. The math looks scary but it’s just a way to measure how different two distributions are:
$$W(P,Q) = \inf_{\gamma\in\Pi(P,Q)}\mathbb{E}_{(x,y)\sim\gamma}[|x-y|]$$
where $\Pi(P,Q)$ represents all possible ways to match points between the two distributions.
What Actually Worked
The best approach ended up being Johnson’s feedforward network with instance normalization, combined with Ruder’s temporal constraints. This gave us the right balance between speed and quality.
The key insight was that you can’t just focus on making individual frames look good. You have to think about the video as a whole and make sure the motion stays consistent.
What We Learned
Building this taught us that video style transfer is fundamentally different from image style transfer. The temporal consistency problem is really hard, and there’s always a tradeoff between quality and speed.
The optical flow approach works well but adds a lot of complexity. Instance normalization was a game changer for speed, but you lose some of the flexibility of the optimization based methods.
If you want to check out our implementation, it’s all on GitHub. We organized the different approaches in separate modules so you can experiment with each one.
The project was a good reminder that making something work in theory is very different from making it work well in practice. Video style transfer looks simple until you actually try to build it.