Data Science Applications to Astronomy

Week 8: Model Building III:

Regularization

Announcements/Logistics

Penn State AI Week

  • Events: April 14-17, 2025

  • Student poster submission deadline: Monday, March 31, 2025

MSEEQ

What's most useful?

  • Labs (Unanimous!)

  • Mondays (mostly positive, 1 found hard to focus if don't see relevance)

  • Friday Q&A got mixed reviews (1 helpful, 1 didn't see relevance)

  • No one mentioned projects... Hopefully that will change by the end of the semester.

What's been difficult? / Suggestions

  • Feel like asking arbitrary questions for Fridays (1 reponse) →

    • I get it, but...

    • that's ok! There are still metacognitive benefits.

  • Reasons for learning things sometimes unclear clear (2 response) →

    • I can provide more context for Monday discussions.

  • Didn't like labs that required a specific answer (1 response) →

    • If the interactive feedback is more annoying than helpful, then ignore it.

    • I haven't added interactive feedback to upcoming labs.

  • Want to write your own code (2 responses) →

    • Labs aim to provide scaffolding to ease students into indepenent coding.

    • Enjoy the project 😄

  • More independence in in labs (1 response) →

    • In others classes, I've found that without lots of structure, it's really easy for students to get distracted by coding details.

    • Increasing independence in Lab 8 (esp ex2) & Lab 9.

  • More class time for labs/project coding (1 response) →

    • I'm on the fence.

    • How often would you like a day to work on project in class?

What practices have you personally adopted that have improved your learning?

  • Work to go beyond reading labs to understand code.

  • Incorporate lessons from labs into research projects.

  • Borrow ideas/code from labs for class project.

Q&A

Regularization

Question:

What exactly is the model doing to “penalize” the values it chooses to penalize?

$$\mathrm{loss}_{L2,\lambda}(\theta) = -\frac{1}{2}\left(y-A \theta\right)^T \Sigma^{-1} \left(y-A \theta\right) -\frac{\lambda}{2} \sum_i \theta_i^2 + \mathrm{const}$$

$$\mathrm{loss}_{L1,\lambda}(\theta) = -\frac{1}{2}\left(y-A \theta\right)^T \Sigma^{-1} \left(y-A \theta\right) -\frac{\lambda}{2} \sum_i \left|\theta_i\right| + \mathrm{const}$$

Question:

In L2 or L1 regularization, does the log10(lambda) specify how small the permitted error is?

How to choose penalty term?

Question:

Is one regularization better than the other? ...

Is there a certain type of data that would be better at being regularized than others?

Is there a certain type of data that would be better at being regularized than others?

Question:

How do you determine when to use L1, L2, or a combination of the two regularizations?

Question:

When working with big datasets like galaxy surveys or star catalogs, how do L2 and L1 regularization compare when it comes to finding important features or cutting down on noise in models used for things like star classification or galaxy shape recognition?

Question:

For L1 and L2 regularization, are there specific situations when one is more common or effective? Special cases where that changes?

Things to consider:

  • What am I trying to prevent?

  • Do I expect that most of my model parameters should be zero?

  • Do I expect that some of my model parameters should be zero?

Often just try and see

Choosing Training/Validation/Test set sizes

Question:

Why not always choose 50% training and 50% test data with every other point sampled?

Project questions

Question:

Is there a way to change a query using a slider? For example, maybe the user wants stars of M>M_sun. Do you have to ingest all data then fit that constraint? Or can you do it in the query?

@bind max_mag Slider(6:0.1:20; default = 15.0)
@bind max_ang_sep Slider(0:0.1:5; default = 1.0)
adql_query_sloppy = 
"""SELECT *, DISTANCE(81.28, -69.78, ra, dec) AS $(max_ang_sep)
FROM gaiadr3.gaia_source
WHERE DISTANCE(81.28, -69.78, ra, dec) < 5./60.
AND phot_g_mean_mag < $(max_mag)""";
simulate_query(adql_query_sloppy)
"SELECT *, DISTANCE(81.28, -69.78, ra, dec) AS 1.0\nFROM gaiadr3.gaia_source\nWHERE DISTANCE(81.28, -69.78, ra, dec) < 5./60.\nAND phot_g_mean_mag < 15.0"
Warning:

What's the problem with that approach?

Adding submit

@bind user_input confirm(PlutoUI.combine() do Child
md"""
Max Magnitude: $(Child("max_mag",Slider(6:0.1:20; default = 15.0, show_value=true)))
$nbsp 
$nbsp 
Max Angular Separation: $(Child("max_ang_sep",Slider(0:0.1:5; default = 1.0, show_value=true)))
"""
end)

Max Magnitude: 15.0     Max Angular Separation: 1.0

typeof(user_input)
@NamedTuple{max_mag::Float64, max_ang_sep::Float64}
user_input.max_mag
15.0
adql_query_good = 
"""SELECT *, DISTANCE(81.28, -69.78, ra, dec) AS $(user_input.max_ang_sep)
FROM gaiadr3.gaia_source
WHERE DISTANCE(81.28, -69.78, ra, dec) < 5./60.
AND phot_g_mean_mag < $(user_input.max_mag)"""
"SELECT *, DISTANCE(81.28, -69.78, ra, dec) AS 1.0\nFROM gaiadr3.gaia_source\nWHERE DISTANCE(81.28, -69.78, ra, dec) < 5./60.\nAND phot_g_mean_mag < 15.0"
simulate_query(adql_query_good)
"SELECT *, DISTANCE(81.28, -69.78, ra, dec) AS 1.0\nFROM gaiadr3.gaia_source\nWHERE DISTANCE(81.28, -69.78, ra, dec) < 5./60.\nAND phot_g_mean_mag < 15.0"
simulate_query (generic function with 1 method)

Context for Week 9: Classification

Why would an astronomer want to classify things?

Binary classification

  • Input Data: \(x_i\):

  • Label for data: \(Y_i\) (0 or 1)

  • Predicted catebory: \(\hat{Y}_i\):

  • Object id: i=1...\(N_{\mathrm{obj}}\):

What is suboptimal about the following loss function?

$$\mathrm{loss}_{\mathrm{class}}(\theta) = \sum_{i=1}^{N_{\mathrm{targ}}} \left[1-\delta(Y_i,\hat{Y}_i(\theta))\right]$$

Relaxing the outputs

  • General: \(y_i(\beta) = f(x_i)\)

  • Logistic Regression: \(y_i(\beta) = f(x_i) = \beta \cdot x_i\)

Logistic Regression Likelihood

$$L(\beta) = \prod_{i:\; Y_i=1} \hat{y}(\beta)_i \prod_{i:\; Y_i=0} (1-\hat{y}_i(\beta))$$

$$\mathrm{loss}(\beta) = \sum_{i=1}^{N_{obj}} \left[ Y_i \ln(\hat{y}_i(\beta)) + (1-Y_i) \ln(1-\hat{y}_i(\beta)) \right]$$

Common activation functions

f(x) = inv(1+exp(-x))
f (generic function with 1 method)
relu(x) = x > zero(x) ? x : zero(x)
relu (generic function with 1 method)
leaky_relu(x; α::Real = 1e-3) = x > zero(x) ? x : α*x
leaky_relu (generic function with 1 method)

Logistic: Tanh: Erf: Relu: Leaky Relu:

Multi-category classification

  • Input Data: \(x_i\):

  • Label for data: \(Y_i\) (integer)

  • Object id: i=1...\(N_{\mathrm{obj}}\):

  • Category ids: k=1...K:

$$\mathrm{Pr}(Y_i=k) = \frac{e^{f(i,k)}}{1+\sum_{j=1}^K e^{f(i,j)}}$$

  • Predicted category: \(\hat{Y}_i\) = k s.t. \(\mathrm{Pr}(Y_i=k) > \mathrm{Pr}(Y_i=k')\)

Generalize linear model for multi-category classification:

$$f(i,k) = \beta_k \cdot x_i$$

Setup

Built with Julia 1.11.5 and

Plots 1.40.9
PlutoTeachingTools 0.3.1
PlutoUI 0.7.61
SpecialFunctions 2.5.0

To run this tutorial locally, download this file and open it with Pluto.jl.