
I built a face recognition system during the COVID lockdown and ended up learning a lot more about ResNet than I planned to. The way your phone unlocks the moment it sees your face looks like magic until you trace through what’s actually happening. Underneath, it’s some clever math, a few well-chosen architectural ideas, and a lot of careful preprocessing. The code I ended up with is on GitHub.
Why face recognition
Face recognition is everywhere now. Phones, security systems, photo apps, and the occasional coffee shop. The interesting part isn’t that it works at all. It’s that modern systems are robust enough to deal with messy real-world conditions: weird lighting, awkward angles, partial occlusions, even people aging over time. ResNet is one of the architectures that made that level of robustness practical.
Convolutions are simpler than they look
An image is a 3D grid of numbers: height, width, and channels (RGB for color images). A convolution slides a small filter across the image and computes a weighted sum at each location:
\[(\mathbf{X} * \mathbf{K})(i,j) = \sum_{p=0}^{k_{h}-1} \sum_{q=0}^{k_{w}-1} \sum_{c=0}^{C-1} \mathbf{X}(i+p,j+q,c) \cdot \mathbf{K}(p,q,c)\]
Behind the notation it’s a small dot product computed at every position in the image. Early layers tend to learn low-level features like edges and textures. Deeper layers combine those into more complex features like eyes, noses, and eventually whole face structures.
ReLU
After each convolution we apply ReLU:
\[\text{ReLU}(z) = \max(0, z)\]
Negative values become zero, and that’s the whole nonlinearity. Without something like ReLU, stacking layers wouldn’t actually buy you any expressive power, since composing linear maps just gives you another linear map.
Multiple filters per layer
A single layer typically has many filters, each one looking for a different feature:
\[\mathbf{Y} = f(\mathbf{X} * \mathbf{K}_1, \mathbf{X} * \mathbf{K}_2, \ldots, \mathbf{X} * \mathbf{K}_N)\]
Stacking those filtered outputs is what gives the network its capacity to detect complex patterns higher up.
The ResNet trick
The reason ResNet matters is that very deep networks used to be hard to train. As you stacked more layers, gradients would shrink as they propagated back, and the early layers would barely learn anything. ResNet fixes that with skip connections:
\[\mathbf{X} = \mathbf{Z} + F(\mathbf{Z})\]
The block \(F\) does its convolutions and batch norm as usual, and then you add the original input \(\mathbf{Z}\) back in. Gradients now have a direct path back through the skip, so the optimization stays well-conditioned even at 50+ layers.
A typical ResNet block looks like:
\[F(\mathbf{X}) = W_2 \sigma(\text{BN}(W_1 \mathbf{X}))\]
\[\mathbf{X}_{\text{out}} = \mathbf{X} + F(\mathbf{X})\]
In words: do some convolutions and batch normalization, then add the result to the input. That’s the whole idea, and it’s responsible for most of the depth we get to use today.
Building a face recognition system
Finding and aligning faces
The first thing you need is a way to detect faces in an arbitrary image. Older systems used Haar cascades, modern ones use CNN-based detectors. Once you have a face, you crop it out and align it so that the eyes and nose sit in roughly consistent positions across all your inputs. Alignment matters a lot for downstream accuracy, since the network never has to learn to be invariant to translations or rotations it never sees during training.
Faces to embeddings
The interesting bit is that you don’t usually try to classify “this is Alice” with the network directly. You take the activations from a deep layer and treat them as an embedding vector that captures what makes that particular face distinct. Two faces that belong to the same person should produce embeddings that are close in vector space.
Recognition
For closed-set identification (a fixed list of people), you stick a softmax layer on top:
\[\hat{y} = \text{softmax}(\mathbf{z}) = \frac{e^{\mathbf{z}}}{\sum_{j=1}^K e^{z_{j}}}\]
and train with cross entropy:
\[\mathcal{L} = -\sum_{i=1}^K y_{i} \log(\hat{y}_{i})\]
For open-set verification, where the person might not be in your training set, you compare embeddings directly using a distance metric (cosine similarity or L2) and threshold the result.
Implementation
TensorFlow ships ResNet50, but I wanted to write my own miniature version to understand what was happening inside:
1 | import tensorflow as tf |
The residual_block function is where the skip connection
happens. The build_resnet function stacks a few of those
together with a small initial conv stem and a global average pool at the
end. It’s nowhere near ResNet50 in capacity, but it’s enough to learn a
small set of faces well.
Masks broke everything
A few weeks into the project, COVID happened and everyone started wearing masks. That turned out to be a real test for the model. A network trained on full faces does not generalize well to faces with the bottom half covered, since most of the discriminative features around the mouth and chin are gone. There are two main ways to fix this: include masked faces in your training set, or train the network to focus on regions that stay visible (eyes and forehead). I tried both, and the data augmentation approach worked better in my case.
What I took from this
The math behind face recognition is less mysterious than it looks once you sit with it. Convolutions detect features. Skip connections solve the gradient problem in deep networks. Embeddings turn faces into points in a vector space where geometric distance corresponds to identity similarity. Each piece is small on its own, and they compose into something that genuinely works in production-grade systems.
The other thing I took away is how much of the work isn’t the model itself. Detection, alignment, dataset balance, evaluation under realistic conditions, and edge cases like masks all matter at least as much as the architecture. Most of the engineering effort lives in those pieces.