Tip

Remember to start planning class projects.

Data Science Applications to Astronomy

Week 4: Model Building II:

Model Assessment

Overview of Workflow for Assessing Models

Validate methods using simulated data
Validate computation
Compare Likelihood to expected distribution
Sensitivity to individual data points
Sensitivity to assumptions (e.g., prior distributions)
Check Predictive distribution
Cross validation

Step 1: Validate methods using simulated data

Generate simulated data sets that satisfy all model assumptions.
Apply model and computational methods to simulated data
Compare results to true parameter values.

Checks to perform based on simulated data:

Are estimated model parameters close to true values?
Do estimated parameter uncertainties accurately characterize the dispersion from the true values?
Is the distribution of parameter estimates centered on their true values?
How many iterations are required for algorithms to converge?
What can we learn reliably?
Are there regions of parameter space that result in low quality inference?

Validating methods on simulated data is a key step
(too often overlooked)

This approach is very useful for diagnosing many of the most common problems:

Not enough data
Measurement uncertainties too large
Spatial or temporal sampling leads to degeneracies
Model has too many unknown parameters
Iterative algorithms don't converge reliably
Iterative algorithms take longer than expected to converge

Real data is (almost always) more complicated than our idealizd model, but...

If our method doesn't work reliabily for idealized data, then we shouldn't trust the outputs when they are applied to real data.
If our method requires $10^6$ iterations to converge for idealized data, then we know we'll need to run it for at least that long for real data (and potentially longer).

Step 2: Validate the computation

(This step applies regardless of whether analyzing simulated or real data.)

Are the results pausible?
- E.g., Magnitude of best-fit value for each model parameter
For algorithms that have stochastic element, are results robust to:
- Different (reasonable) initial guesses?
- Different draws from psuedo-random number generator?
For itterative algorithms:
- Why did algorithm stop?
- Any warning signs of non-convergence?
For algorithms that return an uncertainty estimate, is the reported precission:
- Robust?
- Scientifically useful?

Step 3: Compare Likelihood to expected distribution

Likelihood:

$$\begin{eqnarray} \mathcal{L} & = & p( \mathrm{data} \; | \; \mathrm{model\; parameters}, \; \mathrm{model} ) \\ & = & p (y|θ,M) = p(y|\theta) \end{eqnarray}$$

Often, we have a complex physical model, but a relatively simple model of the measurement process. In these cases, if we knew the full model was correct and the model parameters ($\theta$), then we can compute the distribution of the likelihood.

For simple measurement models, this can often be done analytically. E.g.,

Poisson distribution for sum of counts (if counts are from a Poisson process)
χ² distribution for sum of squared residuals (if measurement errors are Gaussian)
F-distribution for ratio of two variables drawn from χ² distribution

For more complex measurement models, this is often done via simulation.

Step 4: Inspect Sensitivity to individual data points

In practice, there are usually complications omitted from the model being used for inference.

Could be missing astrophysics (e.g., sample of targets is contaminated with a small fraction of objects that different from intended sample)
Could be incomplete model for measurement process (e.g., cosmic ray, observer points telescope at wrong object, ...)

To procted against making poor decissions/scientific inferences:

Ask would our decisions/inferences change if a small fraction of measurements had be omitted from analysis?

Step 5: Sensitivity to prior distribution/assumptions

In a Bayesian analysis, we specify prior assumptions quantitatively by specifying a prior distribution for unknown model parameters.
There are assumptions in any model (even if they're hard to find, e.g., in a frequentist test).
Usually, reasonable scientists could make different reasonable assumptions.
The precise quantiative results will differ depending on the assumptions.
The differences in the results under different assumptions might be:
- So small they are scientifically unininteresting,
- So large that more detailed analysis isn't a good use of your time,
- Somewhere in the middle.
We need to know where this analysis falls along that spectrum before we start drawing scientific conclusions.

Step 6: Check posterior predictive distribution

Does model make good "predictions" for the observed data?

Numerical experiments

Simualted dataset & best-fit model

$$y_{true} = \sum_{i=0}^{1} a_i x^i$$

Order of polynominal to generate data:
a₀: a₁: a₂: a₃:
Number of observations: Number of outliers:
Measurement uncertainty:

Now let's repeat that analysis many times and compare the MLE estimates for each parameter with the true values.

Number of simulated datasets:

Fallacy of Greek Letters

Question:

What happens if we fit the data with a simpler model than used to generate the data?

Question:

What happens if we fit the data with a more complex model than used to generate the data?

Step 7: Perform Cross validation

Previously, we fit a model based on all of our observations and then asked if the resulting model made good "postdictions". This is a relatively weak test.
A stronger test would be to see if the model can make good predictions for new data.
Often, acquiring new data is very time consuming/expensive.
This motivates a more powerful method for assessing models: cross validation (CV).

Cross validation of a single model

Divide data into two subsets (e.g., 75% "training", 25% "validation")
Fit data ("train model") using only the training set
Use resulting parameters to make predictions for validation set
How well did the model do? (evaluate "loss function")

Cross validation when considering multiple models

Divide data into three subsets (e.g., 60% "training", 20% "validation", 20% "test")
For each model you want to consider:
- Fit/Train using only data from training set
- Validate using only data from validation set
Pick which model you'll use for decissions/publication:
Evaluate predictions for test set.
Report results on test data.
Avoid temptation to make further "improvements" at the point.

Order of polynominal to use for cross validation:

Question:

Any Questions?

Your Questions

Priors

Question:

Is prior and posterior the same thing?

Both are PDFs. But in any given analysis (with non-zero data), they are different functions:

$$\mathrm{(Posterior)} = \mathrm{(Prior)} \times \mathrm{(Likelihood)} / \mathrm{(Evidence)}$$

$$p( \theta| M, D) = \frac{p(\theta | M) p(D | \theta, M )}{p(D | M )}$$

Question:

If a model is sensitive to priors, what are ways to justify the physicality of one prior over another?

Model fitting

Question:

Is there a specific technique for making good initial predictions? I noticed in this lab that [a good initial guess] helps [the optimizer find good] results. Is there a good way of making optimal ones?

Question:

How often do we use optimization techniques parameters vs use statistical modeling techniques like regression to solve problems?

Uncertainty quantification

Question:

There are uncertainties in the data itself for collection and uncertainties when making a fit to a model. Do the uncertainties in the data itself have a major effect on the uncertainty on the fit of the model and how do we account for that error?

begin
    n = 40
    x = rand(n)
    y = 1 .+ 0.5 .* x
    eps =  randn(n)
end

40-element Vector{Float64}:
 -0.3325205543973895
 -0.5778501201338591
 -0.46780306827212526
  0.303847724410151
 -1.0055364711381838
  0.5629800977940854
  0.06804241793226581
  ⋮
 -0.14004967748176717
  0.4053714294968463
  0.0023727799381460133
 -1.007036041947695
 -1.1642977360075129
  0.7025095678480187

begin
    σ_y = fill(0.2, n)
    σ_y = 0.1 .* x .+ 0.1
    y_measured = y .+ σ_y .*eps
end;

let 
    A = hcat( ones(n), x )
    Σ = PDiagMat(σ_y.^2)
    fit_coef = calc_mle_linear_model(A, y_measured, Σ)
    #calc_sigma_mle_linear_model(A, Σ)
end

2-element Vector{Float64}:
 1.0215610016034429
 0.4546503200815888

Question:

In what situations would a parametric model... be insufficient, and why might a non-parametric or semi-parametric approach be more appropriate for modeling them?

Question:

Can you provide such an example of where the measurement uncertainties are not well modeled by a Gaussian distribution? How do we deal with that?

Model comparison

Question:

When creating a model, how do you know when it is important to add a more flexible polynomial? Since this adds more parameters, how do we know when we are trying to overfit data?

Question:

How do different goodness-of-fit tests (like AIC, BIC, or likelihood ratio tests etc) compare in terms of model selection when using MLE for time-series or other data in astronomy? I just found AIC while googling and didn't really get it mathmatically and wanted to see it with a known reference point.

Computing Hardware

Question:

What exactly are the nodes on ROAR and how does this setup allow for faster computing?

Basic Memory Node:
- 4 GB/core
- Ethernet connections between nodes
- For minimal memory and high throughput jobs.
Standard Node:
- 8 GB/core,
- Low-latency Infiniband networking for multi-node jobs
- For moderate memory jobs and multicore processing. -
High Memory nodes:
- For large memory jobs and multicore processing.
- 20 GB/core, >= 1 TB RAM per node
- Low-latency Infiniband networking for multi-node jobs.
GPU nodes:
- A100 or P100 GPUs
- For highly parallel, high-compute density calculations that can run on a GPU

Project Questions

Question:

When picking a topic for the project are we limited to the resources that are provided or will we be able to find others? How do we go about finding data?

You're welcome to pick a project that uses dataset we won't be using in class.
It's up to you to find a dataset that's appropriate and publically avaliable for your project.
If that's intimidating, then you can use a data source from one of the labs (e.g., Gaia, Exoplanet Archive, Kepler/K2/TESS light curves, Kepler TTV measurements, California Planet Survey RV).

Question:

What sort of information is expected to be on our project dashboard?

Question:

Is it a hub where anybody can upload data of a certain type, feed the constraints, and receive some analysis?

That wasn't the intent. If you think this makes sense for your purpose, then let's discuss.

Question:

Or should it store its own data and have a way for users to select what source they want to know about?

That's one good option.

Another good option is for the user to specify a target and have your dashboard query a database, so that it doesn't have to store all the data from a large survey.

Random Julia Questions

Question:

How do I write fstrings in Julia?

begin 
    a = 42
    b = π
end

π = 3.1415926535897...

The value of a is $a.

The value of a is 42.

The value of a+2 is $(a+2).

The value of a+2 is 44.

The value of b is $(string(round(b,digits=2))).

The value of b is 3.14.

If you're already familar with printf, then Printf.@sprintf is useful.

using Printf

formated_string = @sprintf "The value of a is %d.  The value of b is %1.2f." a b

"The value of a is 42.  The value of b is 3.14."

"You can also embed inside a string or markdown text like y = $(@sprintf "%1.2f" b)."

"You can also embed inside a string or markdown text like y = 3.14."

Lab logisics

Question:

After we first clone the repository, we change into its directory through the terminal. The instructions on the class website say to then run the command

cd REPO_DIR
julia --project -e 'using Pkg; Pkg.instantiate(); '

but the Lab 4 Canvas assignment description says to run

cd REPO_DIR
cd deps ; julia --project=.. build.jl ; cd ..

What is the difference between these two, and should we run both of them?

cd REPO_DIR
julia --project -e 'using Pkg; Pkg.instantiate(); '

but the Lab 4 Canvas assignment description says to run

cd REPO_DIR
cd deps ; julia --project=.. build.jl ; cd ..

What is the difference between these two, and should we run both of them?

Let's look at Project.toml and build.jl.

name = "Lab4"
uuid = "3f08caa0-4af4-4e06-87fc-90af5fc7315e"
authors = ["Eric Ford <ebf11@psu.edu>"]
version = "0.1.0"

[deps]
Pkg = "44cfe95a-1eb2-52ea-b672-e2afdf69b78f"
Pluto = "c3e4b0f8-55cb-11ea-2926-15256bba5781"

[compat]
Pkg = "1.11.0"
Pluto = "0.20.4"

and build.jl

println("Making sure Pluto is installed")
import Pkg; 
Pkg.add(name="Pluto", version="0.20.4");
import Pluto; 

println("Installing ex1.jl")
Pluto.activate_notebook_environment("../ex1.jl"); 
Pkg.instantiate(); 

println("Installing ex2.jl")
Pluto.activate_notebook_environment("../ex2.jl"); 
Pkg.instantiate();

One cell inside ex1.jl

PLUTO_PROJECT_TOML_CONTENTS = """
[deps]
CSV = "336ed68f-0bac-5ca0-87d4-7b16caf5d00b"
DataFrames = "a93c6f00-e57d-5684-b7b6-d8193f3e46c0"
Distributions = "31c24e10-a181-5473-b8eb-7969acd0382f"
Downloads = "f43a241f-c20a-4ad4-852c-f6b1247861c6"
LinearAlgebra = "37e2e46d-f89d-539d-b4ee-838fcccc9c8e"
MCMCChains = "c7f686f2-ff18-58e9-bc7b-31028e88f75d"
Optim = "429524aa-4258-5aef-a3af-852621145aeb"
PDMats = "90014a1f-27ba-587c-ab20-58faa44d9150"
Plots = "91a5bcdd-55d7-5caf-9e0b-520d859cae80"
PlutoTeachingTools = "661c6b06-c737-4d37-b85c-46df65de6f69"
PlutoUI = "7f904dfe-b85e-4ff6-b463-dae2292396a8"
QuadGK = "1fd47b50-473d-5c70-9696-f719f8f3bcdc"
Turing = "fce5fe82-541a-59a6-adf8-730c64b5f9a0"

[compat]
CSV = "~0.10.15"
DataFrames = "~1.7.0"
Distributions = "~0.25.116"
MCMCChains = "~6.0.7"
Optim = "~1.10.0"
PDMats = "~0.11.31"
Plots = "~1.40.9"
PlutoTeachingTools = "~0.3.1"
PlutoUI = "~0.7.60"
QuadGK = "~2.11.1"
Turing = "~0.35.5"
"""

Helper Code

Full Width Mode

generate_data (generic function with 1 method)

generate_θ_fit_distribution (generic function with 1 method)

plot_parameter_histograms (generic function with 1 method)

Built with Julia 1.11.5 and

LaTeXStrings 1.4.0
LinearAlgebra 1.11.0
MLUtils 0.4.5
PDMats 0.11.32
Plots 1.40.9
PlutoTeachingTools 0.3.1
PlutoUI 0.7.61
Printf 1.11.0
Statistics 1.11.1

To run this tutorial locally, download this file and open it with Pluto.jl.

Astro 416

Data Science Applications to Astronomy

Week 4: Model Building II:

Model Assessment

Overview of Workflow for Assessing Models

Step 1: Validate methods using simulated data

Checks to perform based on simulated data:

Validating methods on simulated data is a key step (too often overlooked)

Step 2: Validate the computation

Step 3: Compare Likelihood to expected distribution

Step 4: Inspect Sensitivity to individual data points

Step 5: Sensitivity to prior distribution/assumptions

Step 6: Check posterior predictive distribution

Numerical experiments

Simualted dataset & best-fit model

Fallacy of Greek Letters

Step 7: Perform Cross validation

Cross validation of a single model

Cross validation when considering multiple models

Your Questions

Priors

Model fitting

Uncertainty quantification

Model comparison

Computing Hardware

Project Questions

Random Julia Questions

Lab logisics

Helper Code

Validating methods on simulated data is a key step
(too often overlooked)