Data Science Applications for Astronomy

Week 12: Data Lifecycle:

Data Science Workflow

Data Science Lifecycle

Example of a Data Science Lifecycle

(This is just one of many.)

  1. Ask an interesting question

    • What is the scientific goal?

    • What would you do if you had all the data?

    • What do you want to predict or estimate?

  2. Get the data

    • How were the data sampled?

    • Which data are relevant?

    • Are there privacy issues?

  3. Explore the data

    • Plot the data.

    • Are there anomalies?

    • Are there patterns?

  4. Model the data

    • Build a model.

    • Fit the model.

    • Validate the model.

  5. Communicate and visualize the results

    • What did we learn?

    • Do the results make sense?

    • Can we tell a story?

–- Blitzstein & Pfister for Harvard CS109

What's missing?

Hint
  • Making iterative process/loops explicit

  • Interpreting results for oneself

  • Deploying model to work for future data

Some workflows common in industry

OSEMN (pronounced awesome)

  • Obtain

  • Scrub

  • Explore

  • Model

  • iNterpret

CRoss-Industry Standard Process for Data Mining (CRISP-DM)

  • Business Understanding

  • Data Understanding

  • Data Preparation

  • Modeling

  • Evaluation

  • Deployment

Emphasizes loops and deployment

SCRUM

Three pillars:

  • Transparency: Make emergent work visible.

  • Inspection: Look out for variances.

  • Adaption: Adapt your processes to minimize adverse variances and maximize beneficial opportunities.

SCRUM build on sprints: Divide the larger project into a series of sprints, each consisting of:

  • Sprint Planning

  • Daily Scrum (Standup)

  • Sprint Review:

  • Sprint Retrospective

Team Data Science Process (TDSP)

Combines a workflow with project templates and recommendations for infrastructure and tools. Favors MS products.

Domino’s data science life cycle is founded on three guiding principles

that emphasizes frequest iteration, collaboration and reproducibility.

  1. Ideation

  2. Data Acquisition and Exploration

  3. Research & Development

  4. Validation

  5. Delivery

  6. Monitoring

Adapting Data Science Workflows from Industry to Scientific Setting

  • Reinterpret terms like "business case" and "customer"

  • Often don't know to quantify success when we start a project

  • Generally, place more value on interpretability

  • Can accommodate projects requiring longer timescales

  • Increasingly, plan to make data & codes public

  • In academia communication is often primarily with other scientists

Collaborating

Question:

How do you currently collaborate on projects requiring coding?

Asynchronous

  • Write separate files/functions/modules

  • Maintain independent repositories

  • Merge changes via git

  • Create branches for new features, so main branch is always usable

Synchronous

  • Like asynchronous, but ask questions as you go

  • Pair Coding: Driver & Navigator

  • Debugging: Explainer & Audience

  • Beware of using shared filesystem

Question:

What tools do you use for collaborating on coding projects?

Tools

Q&A

Project

Question:

Should the presentation be more focused on a demo of the dashboard or on detailed explanations of the fitting equations/models?

Presentation Rubric

  • Clarity of explanation of purpose of the dashboard and data set(s) used (1 point)

  • Clarity of explanation of the models fit to data and their motivation (1 point)

  • Effective demonstration of the dashboard in action (1 point)

  • Clarity of explanation of how dashboard performs model assessment and of any potential failure modes that are not reliably recognized by the dashboard (1 point)

  • Thoughtful discussion of challenges encountered during project and lessons learned (1 point)

Question:

When is the individual write up due?

  • April 21 (Dashboard itself)

  • May 2 (Report & Reflection)

Report/Reflection Rubric

  • their contributions to the dashboard project and the contributions of their teammates (1 point),

  • what the next steps would be if there were more time to make improvements to the dashboard (1 point),

  • reflecting on what they learned from the experience (2 points),

  • offering any suggestions for how to make a similar project more valuable in future semesters (1 point), and

  • offering any suggestions for how to make the course more valuable in future semesters (optional, 0 points).

Question:

I would love a small rundown of the best packages you reccommend [for interactivity].

Question:

What are some ways that we can make them look cleaner and more user friendly

Question:

What is a good process for trying to speed up code run time?

Big picture steps to efficiency:

  • Use a compiled language

  • Use a strongly-typed language

  • Choose of algorithms wisely

  • Choose data types wisely

  • Avoid unnecessary memory allocations

  • Arrange memory accesses to reduce cache misses

Implementation details for code efficiency (assuming JIT language like Julia or JAX)

  • Organize code into small functions

  • Avoid type instability

    • Untyped global variables

    • Containers (e.g., arrays) of abstract types

    • struct's with abstract types

  • Avoid unnecessary memory allocations

    • Not taking advantage of fusing and broadcasting

    • Making copies instead of using a view (array[1:5,:] instead of view(array,1:5,:))

    • Many small allocations on heap (instead use StaticArrays.jl)

  • Organize functions into a package (so it only needs to be precompiled once)

  • Adding annotations that allow for compiler optimizations (e.g., @inbounds, @fastmath, @simd, @turbo) but only when appropriate

  • Avoid unnecessary use of strings or string interpolation

  • Write code so that it can be parallelized in the future

  • See Performance Tips for more details.

Question:

Is there a good way to verify the effectiveness of a classification model aside from checking which points it identifies correctly?

Random

Question:

Does Scrum stand for anything?

No

Setup

Built with Julia 1.11.5 and

DataFrames 1.7.0
HypertextLiteral 0.9.5
PlutoTeachingTools 0.3.1
PlutoUI 0.7.61

To run this tutorial locally, download this file and open it with Pluto.jl.