Data Science Applications to Astronomy
Week 8: Model Building III:
Regularization
Recall the model for general least squares linear regression \(y = A \theta + \epsilon,\) where:
the vector of measurement values is \(y_{\mathrm{obs}}\),
the design matrix is \(A\),
the coefficients to be inferred are \(\theta\),
the measurement errors \(\epsilon \sim \mathrm{Normal}(0,\Sigma)\), and
the covariance matrix for measurement errors is \(\Sigma^{-1}\).
We can find the maximum likelihood estimator (also known as the best-fit for general least squares) analytically via
$$\hat{\theta}_{MLE} = (A^{T} \Sigma^{-1} A)^{-1} (A^{T} \Sigma^{-1} \ y_\mathrm{obs} ),$$
or via maximizing a loss function equal to the log likelihood plus a constant
$$\mathrm{loss}_{MLE}(\theta) = -\frac{1}{2}\left(y-A \theta\right)^T \Sigma^{-1} \left(y-A \theta\right) + \mathrm{const}.$$
Penalized Regression
Start with standard loss function
Add a penalty term that penalizes coefficients when they do something we don't like.
Some very common forms of regularization:
L2 penalty: "Ridge regression"
L1 penalty: "Lasso"
Both L1 and L2 penalties: "Elastic Net"
L2 Regularization
$$\mathrm{loss}_{L2,\lambda}(\theta) = -\frac{1}{2}\left(y-A \theta\right)^T \Sigma^{-1} \left(y-A \theta\right) -\frac{\lambda}{2} \sum_i \theta_i^2 + \mathrm{const}$$
With some algebra, one can show that
$$\hat{\theta}_{L2} = (A^{T} \Sigma^{-1} A + \lambda I)^{-1} (A^{T} \Sigma^{-1} \ y_\mathrm{obs} ),$$
i.e., there is still an analytic solution for any regularization coefficient \(\lambda\).
L1 Regularization
$$\mathrm{loss}_{L1,\lambda}(\theta) = -\frac{1}{2}\left(y-A \theta\right)^T \Sigma^{-1} \left(y-A \theta\right) -\frac{\lambda}{2} \sum_i \left|\theta_i\right| + \mathrm{const}$$
Note that here \(\hat{\theta}_{L1,1} = 0\). I.e., we can find fits that has less non-zero parameters!
However, if we had chosen a smaller value of \(\lambda\), then \(\hat{\theta}_{L1,1} \neq 0\)
As we increase λ, we can find \(\hat{\theta}_{L1}\) that have fewer and fewer non-zero values.
Unfortunately, there is no analtyic solution for \(\hat{\theta}_{L1}\). Instead, there are itterative algorithms for estimating \(\hat{\theta}_{L1}\).
Bayesian Interpretation
In a Bayesian setting, we would want to work with the posterior distribution for \(\theta\). If we assume a uniform prior for \(\theta\), then the posterior distribution becomes \(\theta \sim \mathrm{Normal}(\hat{\theta}_{MLE}, \Sigma_{\theta_{MLE}}), $ where $\Sigma_{\theta_{MLE}} = (A^{T} \Sigma^{-1} A)^{-1}\).
Since
$$\mathrm{Posterior} \propto \mathrm{Prior} \times \mathrm{Likelihood},$$
we can take the log of both sides to get
$$\ln(\mathrm{Posterior}) = \ln(\mathrm{Prior}) + \ln(\mathrm{Likelihood}).$$
If we replace the uniform prior for \(\theta\) with a Gaussian prior, \(\theta \sim \mathrm{Normal}(\theta_0, \Sigma_{pr})\), then the log posterior becomes
$$\mathrm{loss}_{MAP}(\theta) = -\frac{1}{2}\left(y-A \theta\right)^T \Sigma^{-1} \left(y-A \theta\right) -\frac{1}{2}\left(\theta-\theta_0\right)^T \Sigma_{pr}^{-1} \left(\theta-\theta_0\right) + \mathrm{const}.$$
If we set \(\theta_0 = 0\) and \(\Sigma_{pr}^{-1} = \lambda I\), then this reduces to
$$\mathrm{loss}_{MAP}(\theta) = -\frac{1}{2}\left(y-A \theta\right)^T \Sigma^{-1} \left(y-A \theta\right) -\frac{\lambda}{2}\left|\theta\right|^2 + \mathrm{const}$$
Therefore, the best-fit value of \(\theta\) with an L2 regularization term is equivalent to the mean of the posterior distribution for \(\theta\) with a Gaussian prior with zero mean and using a scaled identity matrix for the prior for \(\theta\).
Comparison of Regularization methods
| No Regularization | L2 | L1 |
|---|---|---|
| \(\ell(\theta)\) | \(\ell(\theta) + \lambda \sum_i \theta_i^2\) | \(\ell(\theta) + \lambda \sum_i \left|(\theta_i)\right|\) |
| Not sparse | Not Sparse | Sparse |
| Unique solution | Unique solution | Not guaranties |
| Closed form solution | Closed form solution | Itterative algorithm |
| Sensitive to outliers | Sensitive to outliers | Robust to outliers |
| Bayesian interpretation | Bayesian interpretation | NA |
Example Application to Fitting a Spectrum
Function to be approximated
f(x) = x - exp(-(x^2)/0.0025);
begin
# Generate the (x, y) data
Random.seed!(42)
num_fit = 1000 # Number of points used to create full model
max_order = 120 # Maximum polynomial order allowed
end;
begin
# Compute function to be approximated on dense grid
x_all = range(-0.5, stop=0.5, length=num_fit)
y_all = f.(x_all)
# Create a sparse polynomial to be approximated
A_all = poly_design_matrix(x_all, max_order)
w_true = fit_L1(A_all, y_all, logλ=-1.0, tol=1e-8)
end;
Visualize basis functions
Which Legendre Polynomial to plot:
Generate Simulated Data
begin # Parameters for simulated observations
num_obs = 50
σ_obs = 0.05
end;
begin
regen_data
x_obs = range(-0.5, stop=0.5, length=num_obs)
A_obs = poly_design_matrix(x_obs, max_order)
y_obs = A_obs * w_true
y_obs += σ_obs .* randn(length(x_obs))
end;
Split observations into training and test data
(df_train = 25×2 DataFrame
Row │ x y
│ Float64 Float64
─────┼────────────────────────
1 │ -0.0918367 -0.0145368
2 │ -0.336735 -0.331575
3 │ -0.0102041 -0.969364
4 │ 0.418367 0.427908
5 │ 0.0918367 0.0736437
⋮ │ ⋮ ⋮
22 │ -0.255102 -0.300075
23 │ 0.0714286 -0.0452896
24 │ -0.0306122 -0.586727
25 │ 0.193878 0.301036
16 rows omitted, df_test = 25×2 DataFrame
Row │ x y
│ Float64 Float64
─────┼───────────────────────
1 │ -0.0510204 -0.419848
2 │ -0.234694 -0.222782
3 │ 0.153061 0.157991
4 │ 0.214286 0.245105
5 │ -0.377551 -0.395318
⋮ │ ⋮ ⋮
22 │ -0.193878 -0.160878
23 │ -0.397959 -0.433741
24 │ 0.479592 0.446315
25 │ -0.357143 -0.304554
16 rows omitted)
Regularly spaced data:
Fit models with & without regularization
Show No Regularization (all terms):
Show No Regularization (truncated):
Show L2:
Show L1:
(num_obs_train = 25, num_obs_test = 25)
| Regularization | χ²_train | χ²_test | num_coeffs | |
|---|---|---|---|---|
| 1 | "None" | 2.63392e-27 | 50638.6 | 120 |
| 2 | "None (truncated)" | 9.88413e-24 | 1.64656e6 | 100 |
| 3 | "L2 (log₁₀λ = -5.0)" | 0.0193778 | 15352.0 | 100 |
| 4 | "L1 (log₁₀λ = -5.0)" | 0.012138 | 18562.6 | 30 |
| 5 | "L1 (refit; log₁₀λ = -5.0)" | 1.91157e-26 | 22477.6 | 30 |
Setup & Functions Used
Perform Fitting
begin
A_train = poly_design_matrix(df_train.x,max_order);
A_test = poly_design_matrix(df_test.x,max_order);
end;
begin
w_no_reg_full = fit_no_reg(A_train, df_train.y)
pred_no_reg_full = A_all * w_no_reg_full
end;
begin
A_all_trunc = view(A_all,:,1:max_order_trunc)
A_train_trunc = view(A_train,:,1:max_order_trunc)
A_test_trunc = view(A_test,:,1:max_order_trunc)
w_no_reg_trunc = fit_no_reg(A_train_trunc, df_train.y)
pred_no_reg_trunc = A_all_trunc * w_no_reg_trunc
end;
begin
w_L2 = fit_L2(A_train_trunc, df_train.y; logλ=logλL2)
pred_L2 = A_all_trunc * w_L2
end;
begin
w_L1 = fit_L1(A_train_trunc, df_train.y; logλ=logλL1)
pred_L1 = A_all_trunc * w_L1
end;
begin
idx_w_L1_keep = abs.(w_L1) .> 1e-6
A_L1_refit_train = view(A_train_trunc,:,idx_w_L1_keep)
w_L1_refit = fit_no_reg(A_L1_refit_train, df_train.y)
A_L1_refit_test = view(A_test_trunc,:,idx_w_L1_keep)
A_L1_all_refit_all = view(A_all_trunc,:,idx_w_L1_keep)
pred_L1_refit = A_L1_all_refit_all * w_L1_refit
end;
Functions Used
Built with Julia 1.11.5 and
DataFrames 1.7.0LaTeXStrings 1.4.0
LegendrePolynomials 0.4.5
LinearAlgebra 1.11.0
MLUtils 0.4.7
Plots 1.40.9
PlutoUI 0.7.61
Random 1.11.0
SpecialFunctions 2.5.0
To run this tutorial locally, download this file and open it with Pluto.jl.