Data Science Applications for Astronomy
Week 12: Data Lifecycle:
Data Science Workflow
Data Science Lifecycle
Example of a Data Science Lifecycle
(This is just one of many.)
Ask an interesting question
What is the scientific goal?
What would you do if you had all the data?
What do you want to predict or estimate?
Get the data
How were the data sampled?
Which data are relevant?
Are there privacy issues?
Explore the data
Plot the data.
Are there anomalies?
Are there patterns?
Model the data
Build a model.
Fit the model.
Validate the model.
Communicate and visualize the results
What did we learn?
Do the results make sense?
Can we tell a story?
–- Blitzstein & Pfister for Harvard CS109
What's missing?
Making iterative process/loops explicit
Interpreting results for oneself
Deploying model to work for future data
Some workflows common in industry
OSEMN (pronounced awesome)
Obtain
Scrub
Explore
Model
iNterpret
CRoss-Industry Standard Process for Data Mining (CRISP-DM)
Business Understanding
Data Understanding
Data Preparation
Modeling
Evaluation
Deployment
Emphasizes loops and deployment
SCRUM
Three pillars:
Transparency: Make emergent work visible.
Inspection: Look out for variances.
Adaption: Adapt your processes to minimize adverse variances and maximize beneficial opportunities.
SCRUM build on sprints: Divide the larger project into a series of sprints, each consisting of:
Sprint Planning
Daily Scrum (Standup)
Sprint Review:
Sprint Retrospective
Team Data Science Process (TDSP)
Combines a workflow with project templates and recommendations for infrastructure and tools. Favors MS products.
Domino’s data science life cycle is founded on three guiding principles
that emphasizes frequest iteration, collaboration and reproducibility.
Ideation
Data Acquisition and Exploration
Research & Development
Validation
Delivery
Monitoring
Adapting Data Science Workflows from Industry to Scientific Setting
Reinterpret terms like "business case" and "customer"
Often don't know to quantify success when we start a project
Generally, place more value on interpretability
Can accommodate projects requiring longer timescales
Increasingly, plan to make data & codes public
In academia communication is often primarily with other scientists
Collaborating
How do you currently collaborate on projects requiring coding?
Asynchronous
Write separate files/functions/modules
Maintain independent repositories
Merge changes via git
Create branches for new features, so main branch is always usable
Synchronous
Like asynchronous, but ask questions as you go
Pair Coding: Driver & Navigator
Debugging: Explainer & Audience
Beware of using shared filesystem
What tools do you use for collaborating on coding projects?
Tools
Merging updates: Git
Sharing screen: Zoom, Teams, Virtual Desktop
Collaborative coding: VS Code/VS Codium, Google Colab, Julia Hub, Repl.it,...
Collaborative writing: Overleaf, Google Docs, MS Office 365, ...
Q&A
Project
Should the presentation be more focused on a demo of the dashboard or on detailed explanations of the fitting equations/models?
Presentation Rubric
Clarity of explanation of purpose of the dashboard and data set(s) used (1 point)
Clarity of explanation of the models fit to data and their motivation (1 point)
Effective demonstration of the dashboard in action (1 point)
Clarity of explanation of how dashboard performs model assessment and of any potential failure modes that are not reliably recognized by the dashboard (1 point)
Thoughtful discussion of challenges encountered during project and lessons learned (1 point)
When is the individual write up due?
April 21 (Dashboard itself)
May 2 (Report & Reflection)
Report/Reflection Rubric
their contributions to the dashboard project and the contributions of their teammates (1 point),
what the next steps would be if there were more time to make improvements to the dashboard (1 point),
reflecting on what they learned from the experience (2 points),
offering any suggestions for how to make a similar project more valuable in future semesters (1 point), and
offering any suggestions for how to make the course more valuable in future semesters (optional, 0 points).
I would love a small rundown of the best packages you reccommend [for interactivity].
PlutoTeachingTools.jl: Formatting like labs
What are some ways that we can make them look cleaner and more user friendly
What is a good process for trying to speed up code run time?
Big picture steps to efficiency:
Use a compiled language
Use a strongly-typed language
Choose of algorithms wisely
Choose data types wisely
Avoid unnecessary memory allocations
Arrange memory accesses to reduce cache misses
Implementation details for code efficiency (assuming JIT language like Julia or JAX)
Organize code into small functions
Avoid type instability
Untyped global variables
Containers (e.g., arrays) of abstract types
struct's with abstract types
Avoid unnecessary memory allocations
Not taking advantage of fusing and broadcasting
Making copies instead of using a view (
array[1:5,:] instead of view(array,1:5,:))Many small allocations on heap (instead use StaticArrays.jl)
Organize functions into a package (so it only needs to be precompiled once)
Adding annotations that allow for compiler optimizations (e.g., @inbounds, @fastmath, @simd, @turbo) but only when appropriate
Avoid unnecessary use of strings or string interpolation
Write code so that it can be parallelized in the future
See Performance Tips for more details.
Is there a good way to verify the effectiveness of a classification model aside from checking which points it identifies correctly?
Random
Does Scrum stand for anything?
No
Setup
Built with Julia 1.11.5 and
DataFrames 1.7.0HypertextLiteral 0.9.5
PlutoTeachingTools 0.3.1
PlutoUI 0.7.61
To run this tutorial locally, download this file and open it with Pluto.jl.