Data Science Applications to Astronomy

Week 3: Model Building I

Probability

Terminology

We'll use words that sound familiar, but have technical meanings distinct from their everyday usage. E.g.,

Model
Likelihood

Model

Scientific Model vs Statistical Model

What are the differences?

Scientific Model

Describes what's happening, irrespective of our measurements
Can be extremely well tested or highly uncertain

Statistical Model

Describes process for generating data
Inevitably includes assumptions
Reecognizes that uncertainy is inevitable

Think of example sources of uncertainty

pause

Example Scientific Models

Newton's equation of motion & gravitation
Keplerian motion of planets
Salpeter initial mass function for stars
ΛCMD model for growth of large scale structure

Example Statistical Models

Gaussian or Normal distribution for measurement uncertainties
Poisson distribution for number of photons hitting detector given expected photon rate
Binomial distribution for number of "successes" given number of trials and probability of success

All models are wrong, but some are useful.

– George Box

What is a (good) model useful for?

Making predictions
Incorporating new data to refine one's knowledge
Planning future science

Why are all models wrong?

pause

Probability

Multiple definitions and use cases
Surprisingly deep philosophical question
We'll adopt a Bayesian interpretation of probability
We'll start with the Objective Bayesian interpretation of probability
Later we'll discuss the Subjective Bayesian interpretation

Objective Bayesian

Formalization of quantiative logic
Results of statistical inference depend only on the model and the data

Probability Function

Assigns a weight to each possible event/outcome of an experiment or observation.

Probability Mass Function

For discrete events/outcomes
Examples: (pause to ask class)

Examples of Probability Mass Functions

Flip coin: heads (H) or tails (T). P(H) = P(T) = ½
Roll an n-sided die: P(V) = $\frac{1}{n}$
Inspect locations a globe asking land or water:

$$p(\mathrm{Water}) = w$$

$$p(\mathrm{Land}) = 1-w$$

Axioms of Probability

Non-negativity: $p(E)\ge0$
Sum to unity: $\sum_i p(E_i) = 1$ for complete list of possible events
Sum of probabilities of mutually exclusive events: $p(\bigcup_i E_i) = \sum_i p(E_i)$

Probability Density Function (PDF)

Allows for continuous events/outcomes
Examples: (ask class)

Examples of Probability Density Functions

Fraction of a planet covered with water: $p(w)$
Mass of a star: $p(M_\star)$
Luminosity of a star: $p(L_\star)$
Value of future measurement of flux from a star: $p(f_{\star,obs})$

Probability function is about the state of knowledge.

It is not an intrinsic property of the system.

Likelihood function ($\mathcal{L(\theta)}$)

Probability distribution for the outcome of an experiment/observation that was made in the past, ignoring any current knowledge about the actual outcome.
A likelihood is a probability density function.
Not all PDFs are likelihoods.
Likelihood function assumes a statistical model (commonly written $\mathcal{M}$).
Models often have parameters (commonly written as $\theta$)
Likelihood function often written: $\mathcal{L}(\theta)$ or $\mathcal{L_M}(\theta)$

Scientists often perform inference based on $\mathcal{L(\theta)}$

E.g., Maximum Likelihood Estimate

$$\hat{\theta}_{\mathrm{MLE}} = \mathrm{argmax}_\theta \, \mathcal{L(\theta)}$$

Confidence Interval

$$[\theta_l,\theta_u] = \cup \theta \; \mathrm{s.t.} \; \mathcal{L(\theta)} \le \mathcal{L}(\hat{\theta}_{\mathrm{MLE}}) + \Delta\mathcal{L}_{\mathrm{thresh}}$$

Tip

The MLE is useful tool for your toolbox.
For some models there are very efficienct methods for computing $\hat{\theta}$.
However, it is just a point estimate.
Be careful when many parameters, few observations, large uncertainties.
We'll use MLE as a short cut or starting point for more rigorous analysis.

Tip

In an ideal world, we'd use a Bayesian approach to uncertainty quantification. But confidence intervals are used commonly enough that we at least need to understand what they are (and what they aren't).

Conditional Probability

Consider two events: $A$ & $B$
Let each event have its own probability function: $p(A)$ & $p(B)$
Joint probabilty of both events: $p(A,B)$
Conditional probability of $A$ given $B$: $p(A|B)$
Axiom of Conditional Probability: $p(A,B) = p(A) \, p(B|A)$
Alternatively could write Conditional probability of $B$ given $A$: $p(B|A) = p(B) \, p(A|B)$

– Image Credit: Wikipedia CC BY-SA 3.0

Bayes Theorem

Start with Law of Conditional Probability:

$$p(A,B) = p(A) p(B|A) = p(B) p(A|B)$$

OR we might explicitly writeout the asumption of using model $M$:

$$p(A,B|M) = p(A|M) p(B|A,M) = p(B|M) p(A|B,M)$$

Can compute the marginal probabilities using law of total (conditional) probability:

Discrete case:

$$p(A) = \sum_{i} p(B_i) p(A|B_i)$$

$$p(B) = \sum_{i} p(A_i) p(B|A_i)$$

Continuous case:

$$p(A ) = \int dB \, p(B) p(A|B)$$

$$p(B ) = \int dA \, p(A) p(B|A)$$

To derive Baye's Theorem, substitute:

A → $\theta$: model parameters
B → $D$: observed data
Prior PDF for model parameters $p(\theta|M)$ for $p(A)$
Likelihood for observed data given model $\mathcal{L} = p(D|\theta, M)$ for $p(B)$

into $p(D, \theta, M )$, the joint probability for both model parameters and observed data given the model $M$.

$$p(\theta, D, M) = p(\theta | M) p(D | \theta, M ) = p(D | M ) p( \theta| M, D)$$

Solve for $p( \theta| M, D)$:

$$p( \theta| M, D) = \frac{p(\theta | M) p(D | \theta, M )}{p(D | M )}$$

Interpretation of Bayes Theorem:

$$\mathrm{(Posterior)} = \mathrm{(Prior)} \times \mathrm{(Likelihood)} / \mathrm{(Evidence)}$$

Evidence term is often hard to calculate

Evidence is sometimes refered to as the "fully marginalized likelihood"

Discrete case:

$$p(D | M ) = \sum_{i} p(\theta_i|M) p(D|\theta_i,M)$$

Continuous case:

$$p(D | M ) = \int d\theta \, p(\theta|M) p(D|\theta,M)$$

Bayesian Inference

Bayesian inference about model parameters is based on the posterior probability distribution.
Ideally, one uses the full posterior distribution.
Often, is can be helpful to summarize the posterior:
- Approximate posterior distribution with a Normal Distribution
- Maximunm a posteriori Estimator (MAP): $\hat{\theta}_{\mathrm{MAP}} = \mathrm{argmax}_\theta \, p(\theta ) \, \mathcal{L(\theta )}$
- Fischer Information Matrix: Curvature of $p(\theta|D,M)$ evaluated at MAP

$$I_{i,j} = \left. \frac{\partial}{\partial \theta_i} \frac{\partial}{\partial \theta_j} \log p(\theta|D,M) \right|_{\hat{\theta}_{MAP}}$$

For many cases if you have a large, high-quality datasets and a model with few parameters, then the posterior distribution may be well approximated by a Gaussian Distribution, and $\hat{\theta}_{\mathrm{MAP}} \approx \hat{\theta}_{\mathrm{MLE}}$.
However, there are also plenty of times when:
- MAP & MLE differ
- MAP & MLE throws away infortant information about uncertainties.
- Posterior might not be well approximated by a Gaussian (e.g., skewed, not unimodal).
We want to build a toolbox that can work in both regimes.

Simplified Bayesian Model Building

Steps

Develop a statistical model describing how your data are generated. This dictates the likelihood function and model parameters.
Choose prior distributions
Perform computations to estimate (or sample from) the posterior distribution.
Make decisions based on posterior distribution.

See Example Applications in Labs

Setup & Helper Functions

Full Width Mode

Axioms of Conditional Probability

$$p(A|B)\ge0, \; \; \mathrm{if} \, p(B)>0$$

$$\sum_i p(A_i|B) = p(B)$$

$$p(\bigcup_i A_i | B) = \sum_i p(A_i | B)$$

Built with Julia 1.11.5 and

PlutoTeachingTools 0.3.1
PlutoUI 0.7.60

To run this tutorial locally, download this file and open it with Pluto.jl.

Astro 416

Data Science Applications to Astronomy

Week 3: Model Building I

Probability

Terminology

Model

Scientific Model vs Statistical Model

Scientific Model

Statistical Model

Think of example sources of uncertainty

Example Scientific Models

Example Statistical Models

What is a (good) model useful for?

Why are all models wrong?

Probability

Probability Function

Probability Mass Function

Examples of Probability Mass Functions

Axioms of Probability

Probability Density Function (PDF)

Examples of Probability Density Functions

Probability function is about the state of knowledge.

It is not an intrinsic property of the system.

Likelihood function (\(\mathcal{L(\theta)}\))

Scientists often perform inference based on \(\mathcal{L(\theta)}\)

Conditional Probability

Bayes Theorem

Interpretation of Bayes Theorem:

Evidence term is often hard to calculate

Discrete case:

Continuous case:

Bayesian Inference

Simplified Bayesian Model Building

Steps

See Example Applications in Labs

Setup & Helper Functions

Axioms of Conditional Probability