Data Science Applications to Astronomy
Week 2: Exploratory Data Analysis
Exploratory Data Analysis
Overview
Choose data to explore
Ingest data
Validate data
Clean data
Describe/Visualize data
Identify potential relationships in data
Make a plan for investigating potential relationships quantiatively
Choose data to Explore
Classical Astronomy approach:
Idenitfy a scientific problem
Decide what data is needed
Request telescope time
Keep revising and resubmitting until your proposal is selected
Conduct observations
Ingest data you collect
Can you think of another approach to astronomy research?
pause
Other styles of Astronomy Research
Classical archival science approach:
Idenitfy a scientific problem
Decide what data is needed
Learn about/query multiple surveys/datasets that might have data to address your question.
Prioritize which to consider first
Query archive(s) to ingest data others collected.
Survey-science key-project approach:
Idenitfy a scientific problem
Decide what data is needed
Obtain funding
Build observatory, telescope, detector, software pipeline, archive, etc. to meet your specifications
Conduct survey (observations, calibration, data reduction, archiving, etc.)
Query database(s) to ingest data from survey
Release data to public
Survey-science ancillary science approach:
Identify exciting dataset(s)
Learn about how they were collected, limitations, uncertainties, biases, etc.
Decide if they has the potential to addres your science question
Query database(s) to ingest data being collected for other reasons
Many variations
Spectrum of approaches for how to identify questions/datasets
Combine survey, archival and targeted approaches to address a common question.
Stages of Exploratory Data Analysis
Ingest Data
Construct a query
Download the results of that query
Store the data locally
Read the data into memory.
Options for storing/organizing your data
Vectors, Matrices and higher-dimensional arrays
Storing many entries (e.g., targets, observation times) that are of the same type and have similar meaning that you'll want to keep together.
DataFrames & Tables:
Store multiple types of data for a common set of entries (i.e., same length).
Allow efficiently adding/removing columns of data during your analysis.
Reduce risk of bookkeeping errors when
Databases
Contain multiple tables/dataframes of different lengths
Vectors, Matrices and higher-dimensional arrays
Storing many entries (e.g., targets, observation times) that are of the same type and have similar meaning that you'll want to keep together.
DataFrames & Tables:
Store multiple types of data for a common set of entries (i.e., same length).
Allow efficiently adding/removing columns of data during your analysis.
Reduce risk of bookkeeping errors when
Databases
Contain multiple tables/dataframes of different lengths
Validate Data
What is the size and shape of the data?
What are the types of data?
What are the ranges of values?
Is there missing data?
Check if a representative subset of the data is consistent with expectations.
Are some entries suspiciously discrepant from expectations/other data?
What is the approximate empirical distribution of value?
Are values self-consistent?
Clean Data
Are some data values:
missing?
clearly erroneous?
susipicously discrepant from expectations?
susipicously discrepant from other data?
Any large dataset is likely to have some suspicious data!
Could these issues affect my analysis?
Could these values interfere even exploratory data analysis?
Should I try to understand my data source better before I proceed?
Should I fix the issues now or proceed with caution?
80%/20% rule
If proceeding, how will I make sure that I (and my team) don't forget these concerns?
Could these issues affect my analysis?
Could these values interfere even exploratory data analysis?
Should I try to understand my data source better before I proceed?
Should I fix the issues now or proceed with caution?
80%/20% rule
If proceeding, how will I make sure that I (and my team) don't forget these concerns?
Describe/Visualize Data
Location: mean, median, mode
Scale: standard deviation, quantiles, bounds
Higher-order moments: skewness, kurtosis, behavior of tails
Transformations
Linear transformations (shift, scale, rotate)
Non-linear transformations for visualization (e.g., log, sqrt)
Power transforms to standardize distributions (e.g., Box-Cox transform)
Ohter strategies
Clamping data to limit effects of outliers
Imputing missing data to allow for fast exploratory analysis
Statistical tests
Test for normality
Do you see the qualitative patterns that you're expecting?
Are there additional patterns that you didn't anticipate?
Do you really understand the data you're about to analyze?
Identify potential relationships in Data
Look for relationships between values:
For each object
Across objects
In space
In time
Statistics
Correlation coefficients
Rank correlation coefficient
Dangers of statistics
Visualizations
Scatter plot
2-d histograms or density estimates
Limitations of visualizations
Make a Plan
Is this question/dataset combination worthy of more of my time?
Should I consider combining with other dataset(s) to fill gaps?
What needs to done before begining quantiative analysis?
What apparent relationships should be evaluted quantiatively?
What potential concerns should be kept in mind?
Big Data
What counts as "big"?
What are examples of “Big Data” in Astronomy?
pause for ideas from class
Forms of “Big Data” in Astronomy
Many observations of your target
(e.g., flux measured every minute for years)
Many targets in your survey
(e.g., 5 band photometry of \(\sim10^7\) galaxies)
Many types of measurement for each target
(e.g., modest number of spectra)
Computationally expensive physical model
(e.g., CMB, cosmic structure)
Many parameters in models
(e.g., neural network)
Finding Middle Ground
This course aims to prepare you to combine the power of:
Astronomical surveys
Astrophysical knowledge, and
Modern data science tools
Questions
Git
I understand that git commit is a type of save command, but does it save locally to my computer? Or in some sort of “middle-ground” computer space? Or does it send it to github to store without publishing until git push is used?
What exactly is a local repository?
What is the difference between saving to a local repository and pushing to github?
Why are there two separate terminal commands to commit and push to GitHub? I don't quite understand the nuances between them.
Julia
Why was Julia in particular chosen as the language for this class?
In python, I can look up functions that are part of libraries pretty easily - there are websites dedicated to numpy, matplotlib, etc.. Does this same thing exist with Julia libraries or is it a little more underground?python, I can look up functions that are part of libraries pretty easily - there are websites dedicated to numpy, matplotlib, etc.. Does this same thing exist with Julia libraries or is it a little more underground?
Julia Resources
Think Julia (Lauwens & Downey; online version available for free)
Now there's also an AI package (https://github.com/svilupp/AIHelpMe.jl) trained on Julia manual and Discourse content
Can access without installing via JuliaHub
But first need to create an account there
Or can install and add an API key from a LLM provider (see https://github.com/svilupp/AIHelpMe.jl for details)
Or can setup your own local LLM server (e.g., Ollama), but this is only for the brave/people with time to mess around
Please read the lists of goals and objectives from students in the attached document, Astro 416 Lab 1 Goals.docx . Then summarize the most common goals.
I reviewed the goals and objectives listed by students in the document. Here are the most common goals:
Learning Data Science Techniques:
Many students aim to develop strong foundational skills in data science, including methodologies and techniques commonly used in the field.
Programming Proficiency:
A significant number of students want to become proficient in programming languages such as Julia and Python, and improve their coding skills.
GitHub and Version Control:
Several students mentioned the goal of becoming more familiar with GitHub, learning its commands, and using it effectively for version control.
Data Handling and Analysis:
Students expressed a desire to handle large datasets, improve data modeling skills, and apply statistical tests for data analysis.
Visualization and Presentation:
Improving the ability to create effective data visualizations and making plots that are clear and compelling is a common objective.
Application to Research:
Many students want to apply the skills learned in the course to their own research projects, bridging the gap between coursework and practical applications.
Efficiency and Best Practices:
Goals include learning best practices for data documentation, reproducible research, and writing cleaner, more efficient code.
Confidence and Independence:
Increasing confidence in coding, especially in new languages, and learning to solve problems independently were also frequently mentioned.
These goals reflect a strong focus on practical skills and their application in both academic and professional contexts. Is there anything specific you would like to delve deeper into?
Lab
What’s hard about Big Data?
Pause for input from class
Examples of challenges applying Big Data to Astronomy
Domain Scientist (e.g., Astronomer):
Collecting large & high-quality dataset
Understanding what processes affect data significantly
Computer Scientist/IT Professional:
Storing large volume of data
Accessing data efficiently
Computing resources to process large dataset
Harnessing modern computing capabilities
Statistician/Data Scientist:
Usually, simple model are not adequate.
Correlation does not imply causation (even if it could have been used for prediction)
Is training data representative of production data?
Many model parameters
Potential sensitivity of results to choices for prior, regularization, features, loss function, etc.
Everyone:
Data-driven models can be hard to interpret (& explain)
Communicating results (and their limitations)
Rapidly evolving toolkits
Setup & Helper Code
Built with Julia 1.11.5 and
PlutoTeachingTools 0.3.1PlutoUI 0.7.60
To run this tutorial locally, download this file and open it with Pluto.jl.