Data Science Applications to Astronomy

Week 2: Exploratory Data Analysis

Exploratory Data Analysis

Overview

Choose data to explore
Ingest data
Validate data
Clean data
Describe/Visualize data
Identify potential relationships in data
Make a plan for investigating potential relationships quantiatively

Choose data to Explore

Classical Astronomy approach:

Idenitfy a scientific problem
Decide what data is needed
Request telescope time
Keep revising and resubmitting until your proposal is selected
Conduct observations
Ingest data you collect

Can you think of another approach to astronomy research?

pause

Other styles of Astronomy Research

Classical archival science approach:

Idenitfy a scientific problem
Decide what data is needed
Learn about/query multiple surveys/datasets that might have data to address your question.
Prioritize which to consider first
Query archive(s) to ingest data others collected.

Survey-science key-project approach:

Idenitfy a scientific problem
Decide what data is needed
Obtain funding
Build observatory, telescope, detector, software pipeline, archive, etc. to meet your specifications
Conduct survey (observations, calibration, data reduction, archiving, etc.)
Query database(s) to ingest data from survey
Release data to public

Survey-science ancillary science approach:

Identify exciting dataset(s)
Learn about how they were collected, limitations, uncertainties, biases, etc.
Decide if they has the potential to addres your science question
Query database(s) to ingest data being collected for other reasons

Many variations

Spectrum of approaches for how to identify questions/datasets
Combine survey, archival and targeted approaches to address a common question.

Stages of Exploratory Data Analysis

Ingest Data

Construct a query
Download the results of that query
Store the data locally
Read the data into memory.

Tip

Options for storing/organizing your data

Vectors, Matrices and higher-dimensional arrays
- Storing many entries (e.g., targets, observation times) that are of the same type and have similar meaning that you'll want to keep together.
DataFrames & Tables:
- Store multiple types of data for a common set of entries (i.e., same length).
- Allow efficiently adding/removing columns of data during your analysis.
- Reduce risk of bookkeeping errors when
Databases
- Contain multiple tables/dataframes of different lengths

Vectors, Matrices and higher-dimensional arrays
- Storing many entries (e.g., targets, observation times) that are of the same type and have similar meaning that you'll want to keep together.
DataFrames & Tables:
- Store multiple types of data for a common set of entries (i.e., same length).
- Allow efficiently adding/removing columns of data during your analysis.
- Reduce risk of bookkeeping errors when
Databases
- Contain multiple tables/dataframes of different lengths

Validate Data

What is the size and shape of the data?
What are the types of data?
What are the ranges of values?
Is there missing data?
Check if a representative subset of the data is consistent with expectations.
Are some entries suspiciously discrepant from expectations/other data?
What is the approximate empirical distribution of value?
Are values self-consistent?

Clean Data

Are some data values:

missing?
clearly erroneous?
susipicously discrepant from expectations?
susipicously discrepant from other data?

Tip

Any large dataset is likely to have some suspicious data!

Could these issues affect my analysis?
Could these values interfere even exploratory data analysis?
Should I try to understand my data source better before I proceed?
Should I fix the issues now or proceed with caution?
- 80%/20% rule
If proceeding, how will I make sure that I (and my team) don't forget these concerns?

Could these issues affect my analysis?
Could these values interfere even exploratory data analysis?
Should I try to understand my data source better before I proceed?
Should I fix the issues now or proceed with caution?
- 80%/20% rule
If proceeding, how will I make sure that I (and my team) don't forget these concerns?

Describe/Visualize Data

Location: mean, median, mode
Scale: standard deviation, quantiles, bounds
Higher-order moments: skewness, kurtosis, behavior of tails
Transformations
- Linear transformations (shift, scale, rotate)
- Non-linear transformations for visualization (e.g., log, sqrt)
- Power transforms to standardize distributions (e.g., Box-Cox transform)
Ohter strategies
- Clamping data to limit effects of outliers
- Imputing missing data to allow for fast exploratory analysis
Statistical tests
- Test for normality

Tip

Do you see the qualitative patterns that you're expecting?
Are there additional patterns that you didn't anticipate?
Do you really understand the data you're about to analyze?

Identify potential relationships in Data

Look for relationships between values:

For each object
Across objects
In space
In time

Statistics

Correlation coefficients
Rank correlation coefficient
Dangers of statistics

Visualizations

Scatter plot
2-d histograms or density estimates
Limitations of visualizations

Make a Plan

Is this question/dataset combination worthy of more of my time?
Should I consider combining with other dataset(s) to fill gaps?
What needs to done before begining quantiative analysis?
What apparent relationships should be evaluted quantiatively?
What potential concerns should be kept in mind?

Big Data

What counts as "big"?

What are examples of “Big Data” in Astronomy?

pause for ideas from class

Forms of “Big Data” in Astronomy

Many observations of your target
- (e.g., flux measured every minute for years)
Many targets in your survey
- (e.g., 5 band photometry of \(\sim10^7\) galaxies)
Many types of measurement for each target
- (e.g., modest number of spectra)
Computationally expensive physical model
- (e.g., CMB, cosmic structure)
Many parameters in models
- (e.g., neural network)

Finding Middle Ground

This course aims to prepare you to combine the power of:

Astronomical surveys
Astrophysical knowledge, and
Modern data science tools

Questions

Git

Question:

I understand that git commit is a type of save command, but does it save locally to my computer? Or in some sort of “middle-ground” computer space? Or does it send it to github to store without publishing until git push is used?

Question:

What exactly is a local repository?

Question:

What is the difference between saving to a local repository and pushing to github?

Question:

Why are there two separate terminal commands to commit and push to GitHub? I don't quite understand the nuances between them.

Julia

Question:

Why was Julia in particular chosen as the language for this class?

Question:

In python, I can look up functions that are part of libraries pretty easily - there are websites dedicated to numpy, matplotlib, etc.. Does this same thing exist with Julia libraries or is it a little more underground?python, I can look up functions that are part of libraries pretty easily - there are websites dedicated to numpy, matplotlib, etc.. Does this same thing exist with Julia libraries or is it a little more underground?

Julia Resources

Julia Manual
Think Julia (Lauwens & Downey; online version available for free)
Learn Julia in Y Minutes
The Fasttrack to Julia
Julia<->Python<->Matlap Cheatsheet
Julia Language Discourse
The Julia Language Slack

Now there's also an AI package (https://github.com/svilupp/AIHelpMe.jl) trained on Julia manual and Discourse content

Can access without installing via JuliaHub
But first need to create an account there
Or can install and add an API key from a LLM provider (see https://github.com/svilupp/AIHelpMe.jl for details)
Or can setup your own local LLM server (e.g., Ollama), but this is only for the brave/people with time to mess around

Please read the lists of goals and objectives from students in the attached document, Astro 416 Lab 1 Goals.docx . Then summarize the most common goals.

I reviewed the goals and objectives listed by students in the document. Here are the most common goals:

Learning Data Science Techniques:

Many students aim to develop strong foundational skills in data science, including methodologies and techniques commonly used in the field.

Programming Proficiency:

A significant number of students want to become proficient in programming languages such as Julia and Python, and improve their coding skills.

GitHub and Version Control:

Several students mentioned the goal of becoming more familiar with GitHub, learning its commands, and using it effectively for version control.

Data Handling and Analysis:

Students expressed a desire to handle large datasets, improve data modeling skills, and apply statistical tests for data analysis.

Visualization and Presentation:

Improving the ability to create effective data visualizations and making plots that are clear and compelling is a common objective.

Application to Research:

Many students want to apply the skills learned in the course to their own research projects, bridging the gap between coursework and practical applications.

Efficiency and Best Practices:

Goals include learning best practices for data documentation, reproducible research, and writing cleaner, more efficient code.

Confidence and Independence:

Increasing confidence in coding, especially in new languages, and learning to solve problems independently were also frequently mentioned.

These goals reflect a strong focus on practical skills and their application in both academic and professional contexts. Is there anything specific you would like to delve deeper into?

Lab

What’s hard about Big Data?

Pause for input from class

Examples of challenges applying Big Data to Astronomy

Domain Scientist (e.g., Astronomer):

Collecting large & high-quality dataset
Understanding what processes affect data significantly

Computer Scientist/IT Professional:

Storing large volume of data
Accessing data efficiently
Computing resources to process large dataset
Harnessing modern computing capabilities

Statistician/Data Scientist:

Usually, simple model are not adequate.
Correlation does not imply causation (even if it could have been used for prediction)
Is training data representative of production data?
Many model parameters
Potential sensitivity of results to choices for prior, regularization, features, loss function, etc.

Everyone:

Data-driven models can be hard to interpret (& explain)
Communicating results (and their limitations)
Rapidly evolving toolkits

Setup & Helper Code

Full Width Mode

Built with Julia 1.11.5 and

PlutoTeachingTools 0.3.1
PlutoUI 0.7.60

To run this tutorial locally, download this file and open it with Pluto.jl.