Statistical Learning: Essential Mathematics

Dilawar Mahmood

2021-07-21

Machine Learning, Mathematics, Statistical Learning

I used to be one of those people who wanted to jump straight into deep learning without bothering with the math. Then I took a statistical learning course at NTNU and realized I was missing a huge piece of the puzzle. Turns out understanding the fundamentals actually makes you better at machine learning, not worse.

The Foundation of Machine Learning

Statistical learning is basically the math that explains how machine learning works. Sure, neural networks are flashy, but concepts like maximum likelihood estimation and hypothesis testing help you understand what’s actually happening when your model learns something.

What Statistical Learning Actually Is

At its core, statistical learning is about finding functions that make good predictions from data. Say you have a probability space $(X \times Y, P)$ where $X$ is your input space and $Y$ is your output space. $P$ describes how your data is distributed. Your goal is to find a function $f : X \to Y$ that minimizes the expected risk:

$$R(f) = \int_{X \times Y} L(f(x),y) , dP(x,y)$$

where $L$ is your loss function. This framework works for everything from simple linear regression to complex neural networks.

Why Statistics Matter in Practice

Statistical learning gives you practical tools for:

Model Validation

Testing if your features actually matter
Getting confidence intervals for your predictions
Choosing between different models

Understanding Your Data

Figuring out how your data was generated
Spotting outliers and weird stuff
Quantifying how uncertain your predictions are

A Real Example

Let me show you linear regression with proper statistical analysis, not just the “fit a line and hope for the best” approach:

import numpy as np
from scipy import stats

def fit_linear_model(X: np.ndarray, y: np.ndarray) -> dict[str, ndarray]:
    """
    Fit a linear regression model with statistical analysis.
    
    Args:
        X: Input features of shape (n_samples, n_features)
        y: Target values of shape (n_samples,)
        
    Returns:
        dict: Contains model coefficients, standard errors, and p-values
    """

    # Add intercept term
    X = np.column_stack([np.ones(X.shape[0]), X])
    
    # Calculate coefficients using normal equation
    beta = np.linalg.inv(X.T @ X) @ X.T @ y
    
    # Calculate standard errors
    n = X.shape[0]
    y_pred = X @ beta
    residuals = y - y_pred
    mse = np.sum(residuals**2) / (n - X.shape[1])
    var_beta = mse * np.linalg.inv(X.T @ X)
    se = np.sqrt(np.diag(var_beta))
    
    # Calculate t-statistics and p-values
    t_stats = beta / se
    p_values = 2 * (1 - stats.t.cdf(np.abs(t_stats), n - X.shape[1]))
    
    return {
        'coefficients': beta,
        'std_errors': se,
        'p_values': p_values
    }

This doesn’t just give you predictions. It tells you how reliable your model’s coefficients are, which is incredibly useful.

Three Big Advantages

Statistical learning gives you:

Interpretability: You can actually understand what your models learned, not just trust that they work.

Validation: You get tools to check if your model found real patterns or just fit to noise.

Efficiency: Many statistical methods are simpler and work just as well as complex deep learning for certain problems.

The Bias Variance Tradeoff

One of the most important concepts in statistical learning is the bias variance decomposition. Here’s what it looks like:

$$E[(Y - \hat{f}(X))^2] = \text{Var}(\hat{f}(X)) + [\text{Bias}(\hat{f}(X))]^2 + \sigma^2$$

Breaking this down:

$Y$ is the true value you’re trying to predict
$\hat{f}$ is your model
$\text{Var}(\hat{f}(X))$ is how much your predictions vary across different training sets
$\text{Bias}(\hat{f}(X))$ is how far off your predictions are on average
$\sigma^2$ is the noise you can’t avoid

This explains the fundamental tradeoff between model complexity and generalization.

Putting It Into Practice

Here’s how you can combine modern machine learning with statistical rigor:

from sklearn.model_selection import KFold
import numpy as np
from scipy import stats
from typing import Any, Union

def statistical_cross_validate(
    model: Any, 
    X: np.ndarray, 
    y: np.ndarray, 
    n_splits: int = 5
) -> dict[str, Union[float, tuple[float, float]]]:
    """
    Perform cross validation with statistical analysis.
    
    Args:
        model: Scikit-learn compatible model
        X: Features matrix
        y: Target vector
        n_splits: Number of cross validation folds
        
    Returns:
        dict: Cross validation metrics with confidence intervals
    """
    
    kf = KFold(n_splits=n_splits, shuffle=True)
    scores: list[float] = []
    
    for train_idx, test_idx in kf.split(X):
        X_train, X_test = X[train_idx], X[test_idx]
        y_train, y_test = y[train_idx], y[test_idx]
        
        model.fit(X_train, y_train)
        score = model.score(X_test, y_test)
        scores.append(score)
    
    # Calculate confidence interval
    mean_score = np.mean(scores)
    ci = stats.t.interval(0.95, len(scores)-1,
                         loc=mean_score,
                         scale=stats.sem(scores))
    
    return {
        'mean_score': mean_score,
        'confidence_interval': ci,
        'std_score': np.std(scores)
    }

This gives you not just a performance score, but confidence intervals so you know how reliable that score actually is.

What I Learned

Statistical learning isn’t just academic theory you have to get through before the “real” machine learning. It’s a toolkit that makes you better at building models that actually work and that you can trust.

The math might seem intimidating at first, but once you see how it connects to practical problems, it becomes incredibly valuable. Next time you’re building a model, try incorporating some statistical analysis. You might be surprised how much more insight you get into what’s actually happening.