Statistical Learning: Essential Mathematics

I used to be one of those people who wanted to jump straight into deep learning without bothering with the math. Then I took a statistical learning course at NTNU and realized I was missing a huge piece of the puzzle. Turns out understanding the fundamentals actually makes you better at machine learning, not worse.

The math behind machine learning

Statistical learning is basically the math that explains how machine learning works. Neural networks get all the attention, but if you understand maximum likelihood estimation and hypothesis testing, you’ll actually know what’s happening when your model learns.

What we’re actually doing

Statistical learning is about finding functions that make good predictions from data. You have a probability space $(X \times Y, P)$ where $X$ is your input space and $Y$ is your output space. $P$ describes how your data is distributed. You’re trying to find a function $f : X \to Y$ that minimizes the expected risk:

$$R(f) = \int_{X \times Y} L(f(x),y) , dP(x,y)$$

where $L$ is your loss function. This framework works for everything from simple linear regression to complex neural networks.

Why bother with statistics?

Here’s what statistical learning actually gives you: You can test if your features matter, get confidence intervals for predictions, and choose between models properly. You’ll understand how your data was generated, spot outliers, and quantify how uncertain your predictions are. It’s not just nice to have - it’s essential if you want models that work.

An example

Let me show you linear regression with proper statistical analysis, not just the “fit a line and hope for the best” approach:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
import numpy as np
from scipy import stats

def fit_linear_model(X: np.ndarray, y: np.ndarray) -> dict[str, ndarray]:
"""
Fit a linear regression model with statistical analysis.

Args:
X: Input features of shape (n_samples, n_features)
y: Target values of shape (n_samples,)

Returns:
dict: Contains model coefficients, standard errors, and p-values
"""

# Add intercept term
X = np.column_stack([np.ones(X.shape[0]), X])

# Calculate coefficients using normal equation
beta = np.linalg.inv(X.T @ X) @ X.T @ y

# Calculate standard errors
n = X.shape[0]
y_pred = X @ beta
residuals = y - y_pred
mse = np.sum(residuals**2) / (n - X.shape[1])
var_beta = mse * np.linalg.inv(X.T @ X)
se = np.sqrt(np.diag(var_beta))

# Calculate t-statistics and p-values
t_stats = beta / se
p_values = 2 * (1 - stats.t.cdf(np.abs(t_stats), n - X.shape[1]))

return {
'coefficients': beta,
'std_errors': se,
'p_values': p_values
}

This doesn’t just give you predictions. It tells you how reliable your model’s coefficients are, which is incredibly useful.

Why I like this approach

Statistical learning lets you understand what your models learned, not just trust that they work. You get tools to check if your model found real patterns or just fit to noise. And honestly, many statistical methods are simpler and work just as well as complex deep learning for a lot of problems.

Bias variance tradeoff

One of the most important concepts in statistical learning is the bias variance decomposition:

$$E[(Y - \hat{f}(X))^2] = \text{Var}(\hat{f}(X)) + [\text{Bias}(\hat{f}(X))]^2 + \sigma^2$$

Here $Y$ is the true value you’re trying to predict, $\hat{f}$ is your model, $\text{Var}(\hat{f}(X))$ is how much your predictions vary across different training sets, $\text{Bias}(\hat{f}(X))$ is how far off your predictions are on average, and $\sigma^2$ is the irreducible noise. This explains the fundamental tradeoff between model complexity and generalization.

Combining ML with statistical rigor

Here’s a practical example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
from sklearn.model_selection import KFold
import numpy as np
from scipy import stats
from typing import Any, Union

def statistical_cross_validate(
model: Any,
X: np.ndarray,
y: np.ndarray,
n_splits: int = 5
) -> dict[str, Union[float, tuple[float, float]]]:
"""
Perform cross validation with statistical analysis.

Args:
model: Scikit-learn compatible model
X: Features matrix
y: Target vector
n_splits: Number of cross validation folds

Returns:
dict: Cross validation metrics with confidence intervals
"""

kf = KFold(n_splits=n_splits, shuffle=True)
scores: list[float] = []

for train_idx, test_idx in kf.split(X):
X_train, X_test = X[train_idx], X[test_idx]
y_train, y_test = y[train_idx], y[test_idx]

model.fit(X_train, y_train)
score = model.score(X_test, y_test)
scores.append(score)

# Calculate confidence interval
mean_score = np.mean(scores)
ci = stats.t.interval(0.95, len(scores)-1,
loc=mean_score,
scale=stats.sem(scores))

return {
'mean_score': mean_score,
'confidence_interval': ci,
'std_score': np.std(scores)
}

This gives you not just a performance score, but confidence intervals so you know how reliable that score actually is.

Final thoughts

Statistical learning isn’t just academic stuff you have to sit through before the “real” machine learning. It’s what makes you better at building models that actually work and that you can trust.

The math looks intimidating at first. But once you see how it connects to real problems, it clicks. Next time you build a model, throw in some statistical analysis. You’ll see what your model is actually doing instead of just hoping it works.