Data Science Applications to Astronomy

Week 2: Exploratory Data Analysis

Exploratory Data Analysis

Overview

  1. Choose data to explore

  2. Ingest data

  3. Validate data

  4. Clean data

  5. Describe/Visualize data

  6. Identify potential relationships in data

  7. Make a plan for investigating potential relationships quantiatively

Choose data to Explore

Classical Astronomy approach:

  1. Idenitfy a scientific problem

  2. Decide what data is needed

  3. Request telescope time

  4. Keep revising and resubmitting until your proposal is selected

  5. Conduct observations

  6. Ingest data you collect

Can you think of another approach to astronomy research?

  • pause

Other styles of Astronomy Research

Classical archival science approach:

  1. Idenitfy a scientific problem

  2. Decide what data is needed

  3. Learn about/query multiple surveys/datasets that might have data to address your question.

  4. Prioritize which to consider first

  5. Query archive(s) to ingest data others collected.

Survey-science key-project approach:

  1. Idenitfy a scientific problem

  2. Decide what data is needed

  3. Obtain funding

  4. Build observatory, telescope, detector, software pipeline, archive, etc. to meet your specifications

  5. Conduct survey (observations, calibration, data reduction, archiving, etc.)

  6. Query database(s) to ingest data from survey

  7. Release data to public

Survey-science ancillary science approach:

  1. Identify exciting dataset(s)

  2. Learn about how they were collected, limitations, uncertainties, biases, etc.

  3. Decide if they has the potential to addres your science question

  4. Query database(s) to ingest data being collected for other reasons

Many variations

  • Spectrum of approaches for how to identify questions/datasets

  • Combine survey, archival and targeted approaches to address a common question.

Stages of Exploratory Data Analysis

Ingest Data

  • Construct a query

  • Download the results of that query

  • Store the data locally

  • Read the data into memory.

Tip

Options for storing/organizing your data

  • Vectors, Matrices and higher-dimensional arrays

    • Storing many entries (e.g., targets, observation times) that are of the same type and have similar meaning that you'll want to keep together.

  • DataFrames & Tables:

    • Store multiple types of data for a common set of entries (i.e., same length).

    • Allow efficiently adding/removing columns of data during your analysis.

    • Reduce risk of bookkeeping errors when

  • Databases

    • Contain multiple tables/dataframes of different lengths

  • Vectors, Matrices and higher-dimensional arrays

    • Storing many entries (e.g., targets, observation times) that are of the same type and have similar meaning that you'll want to keep together.

  • DataFrames & Tables:

    • Store multiple types of data for a common set of entries (i.e., same length).

    • Allow efficiently adding/removing columns of data during your analysis.

    • Reduce risk of bookkeeping errors when

  • Databases

    • Contain multiple tables/dataframes of different lengths

Validate Data

  • What is the size and shape of the data?

  • What are the types of data?

  • What are the ranges of values?

  • Is there missing data?

  • Check if a representative subset of the data is consistent with expectations.

  • Are some entries suspiciously discrepant from expectations/other data?

  • What is the approximate empirical distribution of value?

  • Are values self-consistent?

Clean Data

Are some data values:

  • missing?

  • clearly erroneous?

  • susipicously discrepant from expectations?

  • susipicously discrepant from other data?

Tip

Any large dataset is likely to have some suspicious data!

  • Could these issues affect my analysis?

  • Could these values interfere even exploratory data analysis?

  • Should I try to understand my data source better before I proceed?

  • Should I fix the issues now or proceed with caution?

    • 80%/20% rule

  • If proceeding, how will I make sure that I (and my team) don't forget these concerns?

  • Could these issues affect my analysis?

  • Could these values interfere even exploratory data analysis?

  • Should I try to understand my data source better before I proceed?

  • Should I fix the issues now or proceed with caution?

    • 80%/20% rule

  • If proceeding, how will I make sure that I (and my team) don't forget these concerns?

Describe/Visualize Data

  • Location: mean, median, mode

  • Scale: standard deviation, quantiles, bounds

  • Higher-order moments: skewness, kurtosis, behavior of tails

  • Transformations

    • Linear transformations (shift, scale, rotate)

    • Non-linear transformations for visualization (e.g., log, sqrt)

    • Power transforms to standardize distributions (e.g., Box-Cox transform)

  • Ohter strategies

    • Clamping data to limit effects of outliers

    • Imputing missing data to allow for fast exploratory analysis

  • Statistical tests

    • Test for normality

Tip
  • Do you see the qualitative patterns that you're expecting?

  • Are there additional patterns that you didn't anticipate?

  • Do you really understand the data you're about to analyze?

Identify potential relationships in Data

Look for relationships between values:

  • For each object

  • Across objects

  • In space

  • In time

Statistics

  • Correlation coefficients

  • Rank correlation coefficient

  • Dangers of statistics

Visualizations

  • Scatter plot

  • 2-d histograms or density estimates

  • Limitations of visualizations

Make a Plan

  • Is this question/dataset combination worthy of more of my time?

  • Should I consider combining with other dataset(s) to fill gaps?

  • What needs to done before begining quantiative analysis?

  • What apparent relationships should be evaluted quantiatively?

  • What potential concerns should be kept in mind?

Big Data

  • What counts as "big"?

What are examples of “Big Data” in Astronomy?

  • pause for ideas from class

Forms of “Big Data” in Astronomy

  • Many observations of your target

    • (e.g., flux measured every minute for years)

  • Many targets in your survey

    • (e.g., 5 band photometry of \(\sim10^7\) galaxies)

  • Many types of measurement for each target

    • (e.g., modest number of spectra)

  • Computationally expensive physical model

    • (e.g., CMB, cosmic structure)

  • Many parameters in models

    • (e.g., neural network)

Finding Middle Ground

This course aims to prepare you to combine the power of:

  • Astronomical surveys

  • Astrophysical knowledge, and

  • Modern data science tools

Questions

Git

Question:

I understand that git commit is a type of save command, but does it save locally to my computer? Or in some sort of “middle-ground” computer space? Or does it send it to github to store without publishing until git push is used?

Question:

What exactly is a local repository?

Question:

What is the difference between saving to a local repository and pushing to github?

Question:

Why are there two separate terminal commands to commit and push to GitHub? I don't quite understand the nuances between them.

Julia

Question:

Why was Julia in particular chosen as the language for this class?

Question:

In python, I can look up functions that are part of libraries pretty easily - there are websites dedicated to numpy, matplotlib, etc.. Does this same thing exist with Julia libraries or is it a little more underground?python, I can look up functions that are part of libraries pretty easily - there are websites dedicated to numpy, matplotlib, etc.. Does this same thing exist with Julia libraries or is it a little more underground?

Now there's also an AI package (https://github.com/svilupp/AIHelpMe.jl) trained on Julia manual and Discourse content

  • Can access without installing via JuliaHub

  • But first need to create an account there

  • Or can install and add an API key from a LLM provider (see https://github.com/svilupp/AIHelpMe.jl for details)

  • Or can setup your own local LLM server (e.g., Ollama), but this is only for the brave/people with time to mess around

Please read the lists of goals and objectives from students in the attached document, Astro 416 Lab 1 Goals.docx . Then summarize the most common goals.

I reviewed the goals and objectives listed by students in the document. Here are the most common goals:

Learning Data Science Techniques:

Many students aim to develop strong foundational skills in data science, including methodologies and techniques commonly used in the field.

Programming Proficiency:

A significant number of students want to become proficient in programming languages such as Julia and Python, and improve their coding skills.

GitHub and Version Control:

Several students mentioned the goal of becoming more familiar with GitHub, learning its commands, and using it effectively for version control.

Data Handling and Analysis:

Students expressed a desire to handle large datasets, improve data modeling skills, and apply statistical tests for data analysis.

Visualization and Presentation:

Improving the ability to create effective data visualizations and making plots that are clear and compelling is a common objective.

Application to Research:

Many students want to apply the skills learned in the course to their own research projects, bridging the gap between coursework and practical applications.

Efficiency and Best Practices:

Goals include learning best practices for data documentation, reproducible research, and writing cleaner, more efficient code.

Confidence and Independence:

Increasing confidence in coding, especially in new languages, and learning to solve problems independently were also frequently mentioned.

These goals reflect a strong focus on practical skills and their application in both academic and professional contexts. Is there anything specific you would like to delve deeper into?

Lab

What’s hard about Big Data?

  • Pause for input from class

Examples of challenges applying Big Data to Astronomy

Domain Scientist (e.g., Astronomer):

  • Collecting large & high-quality dataset

  • Understanding what processes affect data significantly

Computer Scientist/IT Professional:

  • Storing large volume of data

  • Accessing data efficiently

  • Computing resources to process large dataset

  • Harnessing modern computing capabilities

Statistician/Data Scientist:

  • Usually, simple model are not adequate.

  • Correlation does not imply causation (even if it could have been used for prediction)

  • Is training data representative of production data?

  • Many model parameters

  • Potential sensitivity of results to choices for prior, regularization, features, loss function, etc.

Everyone:

  • Data-driven models can be hard to interpret (& explain)

  • Communicating results (and their limitations)

  • Rapidly evolving toolkits

Setup & Helper Code

Built with Julia 1.11.5 and

PlutoTeachingTools 0.3.1
PlutoUI 0.7.60

To run this tutorial locally, download this file and open it with Pluto.jl.