Data Science Applications to Astronomy
Week 3: Model Building I
Probability
Terminology
We'll use words that sound familiar, but have technical meanings distinct from their everyday usage. E.g.,
Model
Likelihood
Model
Scientific Model vs Statistical Model
What are the differences?
Scientific Model
Describes what's happening, irrespective of our measurements
Can be extremely well tested or highly uncertain
Statistical Model
Describes process for generating data
Inevitably includes assumptions
Reecognizes that uncertainy is inevitable
Think of example sources of uncertainty
pause
Example Scientific Models
Newton's equation of motion & gravitation
Keplerian motion of planets
Salpeter initial mass function for stars
ΛCMD model for growth of large scale structure
Example Statistical Models
Gaussian or Normal distribution for measurement uncertainties
Poisson distribution for number of photons hitting detector given expected photon rate
Binomial distribution for number of "successes" given number of trials and probability of success
All models are wrong, but some are useful.
What is a (good) model useful for?
Making predictions
Incorporating new data to refine one's knowledge
Planning future science
Why are all models wrong?
pause
Probability
Multiple definitions and use cases
Surprisingly deep philosophical question
We'll adopt a Bayesian interpretation of probability
We'll start with the Objective Bayesian interpretation of probability
Later we'll discuss the Subjective Bayesian interpretation
Objective Bayesian
Formalization of quantiative logic
Results of statistical inference depend only on the model and the data
Probability Function
Assigns a weight to each possible event/outcome of an experiment or observation.
Probability Mass Function
For discrete events/outcomes
Examples: (pause to ask class)
Examples of Probability Mass Functions
Flip coin: heads (H) or tails (T). P(H) = P(T) = ½
Roll an n-sided die: P(V) = \(\frac{1}{n}\)
Inspect locations a globe asking land or water:
$$p(\mathrm{Water}) = w$$
$$p(\mathrm{Land}) = 1-w$$
Axioms of Probability
Non-negativity: \(p(E)\ge0\)
Sum to unity: \(\sum_i p(E_i) = 1\) for complete list of possible events
Sum of probabilities of mutually exclusive events: \(p(\bigcup_i E_i) = \sum_i p(E_i)\)
Probability Density Function (PDF)
Allows for continuous events/outcomes
Examples: (ask class)
Examples of Probability Density Functions
Fraction of a planet covered with water: \(p(w)\)
Mass of a star: \(p(M_\star)\)
Luminosity of a star: \(p(L_\star)\)
Value of future measurement of flux from a star: \(p(f_{\star,obs})\)
Probability function is about the state of knowledge.
It is not an intrinsic property of the system.
Likelihood function (\(\mathcal{L(\theta)}\))
Probability distribution for the outcome of an experiment/observation that was made in the past, ignoring any current knowledge about the actual outcome.
A likelihood is a probability density function.
Not all PDFs are likelihoods.
Likelihood function assumes a statistical model (commonly written \(\mathcal{M}\)).
Models often have parameters (commonly written as \(\theta\))
Likelihood function often written: \(\mathcal{L}(\theta)\) or \(\mathcal{L_M}(\theta)\)
Scientists often perform inference based on \(\mathcal{L(\theta)}\)
E.g., Maximum Likelihood Estimate
$$\hat{\theta}_{\mathrm{MLE}} = \mathrm{argmax}_\theta \, \mathcal{L(\theta)}$$
Confidence Interval
$$[\theta_l,\theta_u] = \cup \theta \; \mathrm{s.t.} \; \mathcal{L(\theta)} \le \mathcal{L}(\hat{\theta}_{\mathrm{MLE}}) + \Delta\mathcal{L}_{\mathrm{thresh}}$$
The MLE is useful tool for your toolbox.
For some models there are very efficienct methods for computing \(\hat{\theta}\).
However, it is just a point estimate.
Be careful when many parameters, few observations, large uncertainties.
We'll use MLE as a short cut or starting point for more rigorous analysis.
In an ideal world, we'd use a Bayesian approach to uncertainty quantification. But confidence intervals are used commonly enough that we at least need to understand what they are (and what they aren't).
Conditional Probability
Consider two events: \(A\) & \(B\)
Let each event have its own probability function: \(p(A)\) & \(p(B)\)
Joint probabilty of both events: \(p(A,B)\)
Conditional probability of \(A\) given \(B\): \(p(A|B)\)
Axiom of Conditional Probability: \(p(A,B) = p(A) \, p(B|A)\)
Alternatively could write Conditional probability of \(B\) given \(A\): \(p(B|A) = p(B) \, p(A|B)\)
– Image Credit: WikipediaCC BY-SA 3.0
Bayes Theorem
Start with Law of Conditional Probability:
$$p(A,B) = p(A) p(B|A) = p(B) p(A|B)$$
OR we might explicitly writeout the asumption of using model \(M\):
$$p(A,B|M) = p(A|M) p(B|A,M) = p(B|M) p(A|B,M)$$
Can compute the marginal probabilities using law of total (conditional) probability:
Discrete case:
$$p(A) = \sum_{i} p(B_i) p(A|B_i)$$
$$p(B) = \sum_{i} p(A_i) p(B|A_i)$$
Continuous case:
$$p(A ) = \int dB \, p(B) p(A|B)$$
$$p(B ) = \int dA \, p(A) p(B|A)$$
To derive Baye's Theorem, substitute:
A → \(\theta\): model parameters
B → \(D\): observed data
Prior PDF for model parameters \(p(\theta|M)\) for \(p(A)\)
Likelihood for observed data given model \(\mathcal{L} = p(D|\theta, M)\) for \(p(B)\)
into \(p(D, \theta, M )\), the joint probability for both model parameters and observed data given the model \(M\).
$$p(\theta, D, M) = p(\theta | M) p(D | \theta, M ) = p(D | M ) p( \theta| M, D)$$
Solve for \(p( \theta| M, D)\):
$$p( \theta| M, D) = \frac{p(\theta | M) p(D | \theta, M )}{p(D | M )}$$
Interpretation of Bayes Theorem:
$$\mathrm{(Posterior)} = \mathrm{(Prior)} \times \mathrm{(Likelihood)} / \mathrm{(Evidence)}$$
Evidence term is often hard to calculate
Evidence is sometimes refered to as the "fully marginalized likelihood"
Discrete case:
$$p(D | M ) = \sum_{i} p(\theta_i|M) p(D|\theta_i,M)$$
Continuous case:
$$p(D | M ) = \int d\theta \, p(\theta|M) p(D|\theta,M)$$
Bayesian Inference
Bayesian inference about model parameters is based on the posterior probability distribution.
Ideally, one uses the full posterior distribution.
Often, is can be helpful to summarize the posterior:
Approximate posterior distribution with a Normal Distribution
Maximunm a posteriori Estimator (MAP): \(\hat{\theta}_{\mathrm{MAP}} = \mathrm{argmax}_\theta \, p(\theta ) \, \mathcal{L(\theta )}\)
Fischer Information Matrix: Curvature of \(p(\theta|D,M)\) evaluated at MAP
$$I_{i,j} = \left. \frac{\partial}{\partial \theta_i} \frac{\partial}{\partial \theta_j} \log p(\theta|D,M) \right|_{\hat{\theta}_{MAP}}$$
For many cases if you have a large, high-quality datasets and a model with few parameters, then the posterior distribution may be well approximated by a Gaussian Distribution, and \(\hat{\theta}_{\mathrm{MAP}} \approx \hat{\theta}_{\mathrm{MLE}}\).
However, there are also plenty of times when:
MAP & MLE differ
MAP & MLE throws away infortant information about uncertainties.
Posterior might not be well approximated by a Gaussian (e.g., skewed, not unimodal).
We want to build a toolbox that can work in both regimes.
Simplified Bayesian Model Building
Steps
Develop a statistical model describing how your data are generated. This dictates the likelihood function and model parameters.
Choose prior distributions
Perform computations to estimate (or sample from) the posterior distribution.
Make decisions based on posterior distribution.
See Example Applications in Labs
Setup & Helper Functions
Axioms of Conditional Probability
$$p(A|B)\ge0, \; \; \mathrm{if} \, p(B)>0$$
.
$$\sum_i p(A_i|B) = p(B)$$
$$p(\bigcup_i A_i | B) = \sum_i p(A_i | B)$$
Built with Julia 1.11.5 and
PlutoTeachingTools 0.3.1PlutoUI 0.7.60
To run this tutorial locally, download this file and open it with Pluto.jl.