
I built a face recognition system during COVID lockdown and learned way more about ResNet than I expected. Your phone unlocking when it sees your face seems like magic, but it’s actually just some clever math and neural networks doing their thing. Here’s what I figured out while building my own version (check out the code).
Why Face Recognition?
Face recognition is everywhere now. Your phone, security systems, even some coffee shops use it. The cool thing about modern systems is they handle all the messy real world stuff like different lighting, weird angles, and even when part of your face is covered. That’s where ResNet comes in handy.
How Convolutions Work
Let’s start with the basics. Your input image is just a bunch of numbers arranged in a 3D grid: height × width × channels (like RGB). A convolution is basically sliding a small filter across this image and doing some math:
$$(\mathbf{X} * \mathbf{K})(i,j) = \sum_{p=0}^{k_h-1} \sum_{q=0}^{k_w-1} \sum_{c=0}^{C-1} \mathbf{X}(i+p,j+q,c) \cdot \mathbf{K}(p,q,c)$$
This looks scary, but it’s just saying: take your filter, multiply it with a patch of the image, add everything up, and that’s your output for that spot. Early layers learn simple stuff like edges, and deeper layers combine these into more complex features like eyes and noses.
Making Things Nonlinear
After each convolution, we usually apply ReLU (Rectified Linear Unit):
$$\text{ReLU}(z) = \max(0,z)$$
This just means “if it’s negative, make it zero.” Without this, our network would just be doing fancy matrix multiplication, which can’t learn complex patterns.
Multiple Channels
Real networks have many filters per layer. So instead of one output, you get:
$$\mathbf{Y} = f(\mathbf{X} * \mathbf{K}_1, \mathbf{X} * \mathbf{K}_2, \ldots, \mathbf{X} * \mathbf{K}_N)$$
Each filter learns to detect different features, and stacking them gives the network its power.
Why ResNet Is Clever
The problem with deep networks used to be that gradients would disappear as they traveled back through all those layers during training. ResNet solved this with skip connections. Instead of just passing data through layers sequentially, it adds shortcuts:
$$\mathbf{X} = \mathbf{Z} + F(\mathbf{Z})$$
Here, $F$ does all the convolution work, but we also add the original input $\mathbf{Z}$ directly to the output. This gives gradients a direct path back to earlier layers.
A typical ResNet block looks like:
$$F(\mathbf{X}) = W_2\sigma(\text{BN}(W_1\mathbf{X}))$$
$$\mathbf{X}_{\text{out}} = \mathbf{X} + F(\mathbf{X})$$
The math is just saying: do some convolutions and batch normalization, then add the result to what you started with.
Building a Face Recognition System
Getting the Data Ready
First, you need to find faces in images. There are lots of ways to do this, from old school Haar cascades to modern CNN detectors. Once you find a face, you crop it out and maybe align it so all the faces are oriented the same way.
Turning Faces Into Numbers
The real magic happens when you pass cropped faces through your ResNet. Instead of trying to classify faces directly, you usually take the output from a deep layer as a compact representation of that face. This gives you a vector of numbers that captures what makes that face unique.
Making Decisions
For identifying specific people, you can add a classification layer on top:
$$\hat{y} = \text{softmax}(\mathbf{z}) = \frac{e^\mathbf{z}}{\sum_{j=1}^K e^{z_j}}$$
Train it with cross entropy loss:
$$\mathcal{L} = -\sum_{i=1}^K y_i \log(\hat{y}_i)$$
Or you can compare face embeddings directly using distance metrics for more flexible matching.
Some Actual Code
Here’s a simplified ResNet for face recognition. Real frameworks like TensorFlow already have ResNet50, but building a smaller version helps you understand what’s happening:
1 | import tensorflow as tf |
The residual_block
function is where the skip connection magic happens. The build_resnet
function stacks these blocks together with some pooling and classification layers on top.
The Mask Problem
COVID threw everyone a curveball when suddenly half your face was covered. Models trained on full faces suddenly couldn’t recognize people wearing masks. The solution? Train on masked faces too, or focus the model on the parts around the eyes and forehead that are still visible.
What I Learned
Building this face recognition system taught me that the math behind these systems isn’t as scary as it looks. ResNet’s skip connections are actually a pretty simple idea that solved a big problem. And sometimes the biggest challenge isn’t the algorithm itself, but dealing with real world changes like everyone suddenly wearing masks.
The key is understanding that each piece builds on the last: convolutions detect features, skip connections help with training deep networks, and the whole system learns to turn faces into unique fingerprints that can be compared mathematically.