Data Science Applications to Astronomy

Week 8: Model Building III:

Regularization

Recall the model for general least squares linear regression \(y = A \theta + \epsilon,\) where:

  • the vector of measurement values is \(y_{\mathrm{obs}}\),

  • the design matrix is \(A\),

  • the coefficients to be inferred are \(\theta\),

  • the measurement errors \(\epsilon \sim \mathrm{Normal}(0,\Sigma)\), and

  • the covariance matrix for measurement errors is \(\Sigma^{-1}\).

We can find the maximum likelihood estimator (also known as the best-fit for general least squares) analytically via

$$\hat{\theta}_{MLE} = (A^{T} \Sigma^{-1} A)^{-1} (A^{T} \Sigma^{-1} \ y_\mathrm{obs} ),$$

or via maximizing a loss function equal to the log likelihood plus a constant

$$\mathrm{loss}_{MLE}(\theta) = -\frac{1}{2}\left(y-A \theta\right)^T \Sigma^{-1} \left(y-A \theta\right) + \mathrm{const}.$$

Penalized Regression

  • Start with standard loss function

  • Add a penalty term that penalizes coefficients when they do something we don't like.

  • Some very common forms of regularization:

    • L2 penalty: "Ridge regression"

    • L1 penalty: "Lasso"

    • Both L1 and L2 penalties: "Elastic Net"

L2 Regularization

$$\mathrm{loss}_{L2,\lambda}(\theta) = -\frac{1}{2}\left(y-A \theta\right)^T \Sigma^{-1} \left(y-A \theta\right) -\frac{\lambda}{2} \sum_i \theta_i^2 + \mathrm{const}$$

With some algebra, one can show that

$$\hat{\theta}_{L2} = (A^{T} \Sigma^{-1} A + \lambda I)^{-1} (A^{T} \Sigma^{-1} \ y_\mathrm{obs} ),$$

i.e., there is still an analytic solution for any regularization coefficient \(\lambda\).

L1 Regularization

$$\mathrm{loss}_{L1,\lambda}(\theta) = -\frac{1}{2}\left(y-A \theta\right)^T \Sigma^{-1} \left(y-A \theta\right) -\frac{\lambda}{2} \sum_i \left|\theta_i\right| + \mathrm{const}$$

Note that here \(\hat{\theta}_{L1,1} = 0\). I.e., we can find fits that has less non-zero parameters!

However, if we had chosen a smaller value of \(\lambda\), then \(\hat{\theta}_{L1,1} \neq 0\)

As we increase λ, we can find \(\hat{\theta}_{L1}\) that have fewer and fewer non-zero values.

Unfortunately, there is no analtyic solution for \(\hat{\theta}_{L1}\). Instead, there are itterative algorithms for estimating \(\hat{\theta}_{L1}\).

Bayesian Interpretation

In a Bayesian setting, we would want to work with the posterior distribution for \(\theta\). If we assume a uniform prior for \(\theta\), then the posterior distribution becomes \(\theta \sim \mathrm{Normal}(\hat{\theta}_{MLE}, \Sigma_{\theta_{MLE}}), $ where $\Sigma_{\theta_{MLE}} = (A^{T} \Sigma^{-1} A)^{-1}\).

Since

$$\mathrm{Posterior} \propto \mathrm{Prior} \times \mathrm{Likelihood},$$

we can take the log of both sides to get

$$\ln(\mathrm{Posterior}) = \ln(\mathrm{Prior}) + \ln(\mathrm{Likelihood}).$$

If we replace the uniform prior for \(\theta\) with a Gaussian prior, \(\theta \sim \mathrm{Normal}(\theta_0, \Sigma_{pr})\), then the log posterior becomes

$$\mathrm{loss}_{MAP}(\theta) = -\frac{1}{2}\left(y-A \theta\right)^T \Sigma^{-1} \left(y-A \theta\right) -\frac{1}{2}\left(\theta-\theta_0\right)^T \Sigma_{pr}^{-1} \left(\theta-\theta_0\right) + \mathrm{const}.$$

If we set \(\theta_0 = 0\) and \(\Sigma_{pr}^{-1} = \lambda I\), then this reduces to

$$\mathrm{loss}_{MAP}(\theta) = -\frac{1}{2}\left(y-A \theta\right)^T \Sigma^{-1} \left(y-A \theta\right) -\frac{\lambda}{2}\left|\theta\right|^2 + \mathrm{const}$$

Therefore, the best-fit value of \(\theta\) with an L2 regularization term is equivalent to the mean of the posterior distribution for \(\theta\) with a Gaussian prior with zero mean and using a scaled identity matrix for the prior for \(\theta\).

Comparison of Regularization methods

No RegularizationL2L1
\(\ell(\theta)\)\(\ell(\theta) + \lambda \sum_i \theta_i^2\)\(\ell(\theta) + \lambda \sum_i \left|(\theta_i)\right|\)
Not sparseNot SparseSparse
Unique solutionUnique solutionNot guaranties
Closed form solutionClosed form solutionItterative algorithm
Sensitive to outliersSensitive to outliersRobust to outliers
Bayesian interpretationBayesian interpretationNA

Example Application to Fitting a Spectrum

Function to be approximated

f(x) =  x - exp(-(x^2)/0.0025);
begin
    # Generate the (x, y) data
    Random.seed!(42)
    num_fit = 1000    # Number of points used to create full model
    max_order = 120   # Maximum polynomial order allowed
end;
begin  
    # Compute function to be approximated on dense grid
    x_all = range(-0.5, stop=0.5, length=num_fit)
    y_all = f.(x_all) 
    # Create a sparse polynomial to be approximated
    A_all = poly_design_matrix(x_all, max_order)
    w_true = fit_L1(A_all, y_all, logλ=-1.0, tol=1e-8)
end;

Visualize basis functions

Which Legendre Polynomial to plot: 1

Generate Simulated Data

begin   # Parameters for simulated observations
    num_obs = 50
    σ_obs = 0.05
end;
begin
    regen_data
    x_obs = range(-0.5, stop=0.5, length=num_obs)
    A_obs = poly_design_matrix(x_obs, max_order)
    y_obs = A_obs * w_true
    y_obs += σ_obs .* randn(length(x_obs))
end;

Split observations into training and test data

(df_train = 25×2 DataFrame
 Row │ x           y
     │ Float64     Float64
─────┼────────────────────────
   1 │ -0.0918367  -0.0145368
   2 │ -0.336735   -0.331575
   3 │ -0.0102041  -0.969364
   4 │  0.418367    0.427908
   5 │  0.0918367   0.0736437
  ⋮  │     ⋮           ⋮
  22 │ -0.255102   -0.300075
  23 │  0.0714286  -0.0452896
  24 │ -0.0306122  -0.586727
  25 │  0.193878    0.301036
               16 rows omitted, df_test = 25×2 DataFrame
 Row │ x           y
     │ Float64     Float64
─────┼───────────────────────
   1 │ -0.0510204  -0.419848
   2 │ -0.234694   -0.222782
   3 │  0.153061    0.157991
   4 │  0.214286    0.245105
   5 │ -0.377551   -0.395318
  ⋮  │     ⋮           ⋮
  22 │ -0.193878   -0.160878
  23 │ -0.397959   -0.433741
  24 │  0.479592    0.446315
  25 │ -0.357143   -0.304554
              16 rows omitted)

Regularly spaced data:

Fit models with & without regularization

Show No Regularization (all terms):

Show No Regularization (truncated): Order to truncate at: 100

Show L2: logλ for L2: -5.0.

Show L1: logλ for L1: -5.0 Show L1 refit:

(num_obs_train = 25, num_obs_test = 25)
Regularizationχ²_trainχ²_testnum_coeffs
1"None"2.63392e-2750638.6120
2"None (truncated)"9.88413e-241.64656e6100
3"L2 (log₁₀λ = -5.0)"0.019377815352.0100
4"L1 (log₁₀λ = -5.0)"0.01213818562.630
5"L1 (refit; log₁₀λ = -5.0)"1.91157e-2622477.630

Setup & Functions Used

Perform Fitting

begin
    A_train = poly_design_matrix(df_train.x,max_order);
    A_test = poly_design_matrix(df_test.x,max_order);
end;
begin	
    w_no_reg_full = fit_no_reg(A_train, df_train.y)
    pred_no_reg_full = A_all * w_no_reg_full
end;
begin	
    A_all_trunc = view(A_all,:,1:max_order_trunc)
    A_train_trunc = view(A_train,:,1:max_order_trunc)
    A_test_trunc = view(A_test,:,1:max_order_trunc)
    w_no_reg_trunc = fit_no_reg(A_train_trunc, df_train.y)
    pred_no_reg_trunc = A_all_trunc * w_no_reg_trunc
end;
begin
    w_L2 = fit_L2(A_train_trunc, df_train.y; logλ=logλL2)
    pred_L2 = A_all_trunc * w_L2
end;
begin
    w_L1 = fit_L1(A_train_trunc, df_train.y; logλ=logλL1)
    pred_L1 = A_all_trunc * w_L1
end;
begin
    idx_w_L1_keep = abs.(w_L1) .> 1e-6
    A_L1_refit_train = view(A_train_trunc,:,idx_w_L1_keep)
    w_L1_refit = fit_no_reg(A_L1_refit_train, df_train.y)
    A_L1_refit_test = view(A_test_trunc,:,idx_w_L1_keep)
    A_L1_all_refit_all = view(A_all_trunc,:,idx_w_L1_keep)
    pred_L1_refit = A_L1_all_refit_all * w_L1_refit 
end;

Functions Used

Built with Julia 1.11.5 and

DataFrames 1.7.0
LaTeXStrings 1.4.0
LegendrePolynomials 0.4.5
LinearAlgebra 1.11.0
MLUtils 0.4.7
Plots 1.40.9
PlutoUI 0.7.61
Random 1.11.0
SpecialFunctions 2.5.0

To run this tutorial locally, download this file and open it with Pluto.jl.