Building Neural Networks From Scratch

Dilawar Mahmood

2018-11-20

Machine Learning, Mathematics, Neural Networks

I’d been using TensorFlow for a while without really understanding what was happening behind all the convenient function calls. So I built a small neural network from scratch using only NumPy. The point was to work through the math myself and see if any of it stuck. Michael Nielsen’s book and 3Blue1Brown’s neural network videos helped a lot.

The setup

A neural network is a stack of layers, where each layer is a bunch of neurons doing matrix math with a nonlinearity at the end. Each neuron takes some inputs, multiplies them by weights, adds a bias, and outputs a number:

\[z = \sum_{i=1}^{n} w_i x_i + b\]

I went with three layers, sigmoid activations, and binary cross entropy loss. Nothing fancy.

import numpy as np

class NeuralNetwork:
    def __init__(self, layer_dims: list[int]) -> None:
        self.layer_dims = layer_dims
        self.parameters = self._initialize_parameters()

Initialization

How you initialize the weights matters more than I expected. Make them too big and the network blows up; too small and it barely learns. Xavier initialization keeps the variance reasonable as activations propagate through layers:

def _initialize_parameters(self) -> dict[str, np.ndarray]:
    parameters: dict[str, np.ndarray] = {}
    for l in range(1, len(self.layer_dims)):
        parameters[f"W{l}"] = np.random.randn(
            self.layer_dims[l],
            self.layer_dims[l-1]
        ) * np.sqrt(1./self.layer_dims[l])
        parameters[f"b{l}"] = np.zeros((self.layer_dims[l], 1))
    return parameters

Forward pass

This is where you actually run data through the network. The sigmoid squashes the output between 0 and 1:

\[\sigma(z) = \frac{1}{1 + e^{-z}}\]

def forward_propagation(self, X: np.ndarray):
    cache: dict[str, np.ndarray] = {}
    A = X
    for l in range(1, len(self.layer_dims)):
        Z = np.dot(self.parameters[f"W{l}"], A) + self.parameters[f"b{l}"]
        A = 1 / (1 + np.exp(-Z))
        cache[f"A{l}"] = A
        cache[f"Z{l}"] = Z
    return A, cache

I cached the intermediate Z and A values because backprop needs them, and recomputing the forward pass every time would be wasteful.

Loss

For binary classification I used binary cross entropy. It penalizes confident wrong answers more than uncertain ones, which is the behavior you want from a loss:

\[E = -\frac{1}{m}\sum_{i=1}^m \left[ y_i \log(a_i) + (1 - y_i) \log(1 - a_i) \right]\]

The loss should drop as predictions get better. If it doesn’t, something is wrong upstream.

Backprop

The network learns by figuring out how much each weight contributed to the error and nudging it in the right direction. The chain rule does most of the work:

\[\frac{\partial E}{\partial w_{i}} = \frac{\partial E}{\partial y} \cdot \frac{\partial y}{\partial z} \cdot \frac{\partial z}{\partial w_{i}}\]

Then you update the weights:

\[w_{i} := w_{i} - \alpha \frac{\partial E}{\partial w_{i}}\]

The training loop ties everything together:

def train(self, X, Y, learning_rate: float = 0.1, epochs: int = 1000):
    losses: list[float] = []
    m = X.shape[1]
    for _ in range(epochs):
        A, cache = self.forward_propagation(X)
        loss = -1/m * np.sum(Y * np.log(A) + (1 - Y) * np.log(1 - A))
        losses.append(float(loss))
        self._backward_propagation(X, Y, cache, learning_rate)
    return losses

For the output layer the gradient is straightforward:

\[\frac{\partial L}{\partial z^{[L]}} = A^{[L]} - Y\]

For hidden layers you propagate backwards using the chain rule:

\[\frac{\partial L}{\partial z^{[l]}} = W^{[l+1]T} \frac{\partial L}{\partial z^{[l+1]}} \odot \sigma'(z^{[l]})\]

The implementation walks the layers in reverse:

def _backward_propagation(self, X, Y, cache, learning_rate):
    m = X.shape[1]
    L = len(self.layer_dims) - 1
    
    dA = -(np.divide(Y, cache[f"A{L}"]) -
           np.divide(1 - Y, 1 - cache[f"A{L}"]))
    
    for l in reversed(range(1, L + 1)):
        Z = cache[f"Z{l}"]
        A_prev = cache[f"A{l-1}"] if l > 1 else X
        
        dZ = dA * (cache[f"A{l}"] * (1 - cache[f"A{l}"]))
        dW = 1/m * np.dot(dZ, A_prev.T)
        db = 1/m * np.sum(dZ, axis=1, keepdims=True)
        
        self.parameters[f"W{l}"] -= learning_rate * dW
        self.parameters[f"b{l}"] -= learning_rate * db
        
        if l > 1:
            dA = np.dot(self.parameters[f"W{l}"].T, dZ)

The sigmoid derivative has a nice closed form, which makes the implementation cleaner: \(\sigma'(z) = \sigma(z)(1 - \sigma(z))\). You already have \(\sigma(z)\) from the forward pass, so the derivative is basically free.

What I got out of it

After this exercise I had way more intuition for what TensorFlow was doing under the hood. Things like “why is my loss exploding to NaN” or “why is my network not learning” stopped feeling like mysteries. They turned into pretty mechanical questions about gradients, initialization, and learning rates.

If you’ve been using ML libraries without ever opening them up, I’d recommend doing this once. It’s tedious, but the intuition you walk away with sticks in a way that no tutorial gives you.