Statistical Learning: Essential Mathematics

Dilawar Mahmood

2021-07-21

Machine Learning, Mathematics, Statistical Learning

I used to think I could skip the statistics and jump straight into deep learning. Then I took TMA4268 at NTNU and the gaps in my understanding became hard to ignore. Most of the things I’d been doing on instinct had names, theorems behind them, and (more usefully) failure modes I could now anticipate.

The math underneath the methods

Statistical learning is the math that tells you why machine learning works. Neural networks get most of the attention, but the moment you understand maximum likelihood estimation, hypothesis testing, and the bias-variance decomposition, you stop running models blind and start running them with intent.

The framing is general. You have a probability space \((X \times Y, P)\) where \(X\) is your input space, \(Y\) is your output space, and \(P\) describes how the data is distributed. You want a function \(f: X \to Y\) that minimizes the expected risk:

\[R(f) = \int_{X \times Y} L(f(x), y) \, dP(x, y)\]

where \(L\) is your loss. Linear regression is a special case. So is logistic regression. So is a Transformer. The whole field is variations on this theme.

Why bother with statistics

Statistics gives you tools that machine learning by itself doesn’t: a way to test whether a feature actually contributes signal, confidence intervals on predictions, principled ways to choose between models, and a vocabulary for talking about uncertainty. If you skip it you can still get models to fit, but you’ll have a hard time saying anything trustworthy about them.

Here’s what a proper linear regression with statistical analysis looks like, instead of just fitting and hoping:

import numpy as np
from scipy import stats

def fit_linear_model(X: np.ndarray, y: np.ndarray) -> dict:
    X = np.column_stack([np.ones(X.shape[0]), X])
    
    beta = np.linalg.inv(X.T @ X) @ X.T @ y
    
    n = X.shape[0]
    y_pred = X @ beta
    residuals = y - y_pred
    mse = np.sum(residuals**2) / (n - X.shape[1])
    var_beta = mse * np.linalg.inv(X.T @ X)
    se = np.sqrt(np.diag(var_beta))
    
    t_stats = beta / se
    p_values = 2 * (1 - stats.t.cdf(np.abs(t_stats), n - X.shape[1]))
    
    return {
        'coefficients': beta,
        'std_errors': se,
        'p_values': p_values,
    }

The output isn’t just a set of coefficients. It tells you how confident you should be in each one, which is the part that actually matters when you have to decide whether a coefficient is real signal or noise.

The bias-variance decomposition

If I had to pick one idea from statistical learning to carry forward, it would be this one:

\[E[(Y - \hat{f}(X))^2] = \text{Var}(\hat{f}(X)) + [\text{Bias}(\hat{f}(X))]^2 + \sigma^2\]

The expected error of a model decomposes into three pieces: how much your predictions vary across different training samples (variance), how systematically off they are on average (bias), and the noise inherent in the data that no model can ever beat (\(\sigma^2\)). Once you understand this decomposition, the entire conversation about model complexity, regularization, and overfitting fits onto a single page.

Combining ML with statistical rigor

Cross-validation produces a number, but the number alone doesn’t tell you how reliable it is. Wrap it with confidence intervals and the picture gets clearer:

from sklearn.model_selection import KFold
import numpy as np
from scipy import stats

def statistical_cross_validate(model, X, y, n_splits: int = 5):
    kf = KFold(n_splits=n_splits, shuffle=True)
    scores = []
    
    for train_idx, test_idx in kf.split(X):
        X_train, X_test = X[train_idx], X[test_idx]
        y_train, y_test = y[train_idx], y[test_idx]
        
        model.fit(X_train, y_train)
        scores.append(model.score(X_test, y_test))
    
    mean_score = np.mean(scores)
    ci = stats.t.interval(
        0.95, len(scores) - 1, loc=mean_score, scale=stats.sem(scores)
    )
    
    return {
        'mean_score': mean_score,
        'confidence_interval': ci,
        'std_score': np.std(scores),
    }

Now the report is something like “85% accuracy with a 95% CI of [82%, 88%],” which is a lot more useful than “85%” on its own. It also makes A/B comparisons of models honest, since you can see whether the difference between two models actually clears the noise floor.

What I took from the course

For a lot of problems, a thoughtful linear or generalized linear model with proper diagnostics will outperform a sloppy neural network and be far easier to interpret. For the problems that genuinely need a neural network, the statistical lens still helps, since the same questions about variance, calibration, and significance still apply.

Next time you build a model, try reaching for confidence intervals or hypothesis tests at least once before declaring victory. The view it gives you of your own work is hard to get back to once you’ve seen it.