Data Science Applications for Astronomy

Week 13: Reproducible Research:

What should we expect of science?

  • Reproducible

  • Replicable

  • Valid

Historically, different fields of science have used these terms in different ways. As their importance became more widely recognized, the National Academies produced a report that attempts to standardize language.

Reproduciblity

"obtaining consistent results using the same input data, computational steps, methods, and code, and conditions of analysis."

–- Reproducibility & Replicability in Science (2019)

  • Focuses on the reliability of the computations and their implementation

  • If a study isn't reproducible, then there are likely errors that should be corrected.

  • (Some subtleties in the context of stochastic algorithms)

  • Minimal requirement for a study to be trusted.

Replicablility

"obtaining consistent results across studies aimed at answering the same scientific question, each of which has obtained its own data”

–- Reproducibility & Replicability in Science (2019)

  • Robustness of a scientific conclusion... given the researcher's choices (e.g., definition of sample, analysis method), but allowing for natural variations in data.

  • Even if a study isn't replicable, the results could still be of high-quality and the study could still be a valuable contribution to the scientific literature. E.g.,

    • Collecting data is hard/expensive. An initial study based on a small sample size may hint at a finding that is not supported once a larger dataset is collected.

    • If two studies initially appear to be in disagreement, then a detailed reading of their methods could help someone figure out what choices led to the difference.

Validity

"obtaining consistent results across studies aimed at answering the same scientific question, each of which has obtained its own data”

–- Reproducibility & Replicability in Science (2019)

  • Robustness of a scientific conclusion

Making research replicabile & valid is very hard!

"when a researcher transparently reports a study and makes available the underlying digital artifacts, such as data and code, the results should be computationally reproducible. In contrast, even when a study was rigorously conducted according to best practices, correctly analyzed, and transparently reported, it may fail to be replicated."

–- Reproducibility & Replicabilitiy in Science (2019)

Common Barriers to Reproducibility

  • Inadequate recordkeeping (e.g., failing to archive data & metadata)

  • Availability of data & metadata (e.g., not sharing data)

  • Obsolescence of data (e.g., glass plates, digital media, file formats,...)

  • Obsolescence of code (e.g., programs, libraries, compilers, OS,...)

  • Flaws in attempt to replicate (e.g., lack of expertise, failure to follow protocols)

  • Barriers in the culture of research: resources & incentives

How is astronomy doing?

Good

  • Federally funded observatories (and many larger private ones) have archives for their data.

  • Institutional & discipline-specific services for archiving data products:

    • ScholarSphere & Data Comomons (Penn State)

    • Zenodo (CERN)

    • Dataverse (Harvard)

    • SciServer (JHU)

  • FITS format around in 1970s and standardized since 1981.

  • Programming languages used for Data Science (e.g., Julia, Python, R) incorporate package managers

  • Funding agencies & AAS journals increasingly encourage archiving data, providing and "data behind the figures".

Not-so-good

  • Smaller, private observatories less likely to have funding to archive data

  • Sharing and archiving of higher-level data products, metadata and documentation necessary to make use of them is largely dependent on effort/goodwill of individual research groups.

  • Large datasets often need performant file formats that have yet to prove their longevity (e.g., HDF5 became popular around ~2000)

  • Package managers are great when everything works smoothly, but can be painful when different dependancies have conflicting requirements.

    • Most packages are maintained by a single research group. If one person is busy/graduates/changes fields, then packages not be maintaned.

    • Most computational R and Python rely on C/C++ or Fortran code underneath that rely on Makefiles that are customized for different architectures by hand.

  • Making research reproducible takes serious time and funding. When there are finite resources, difficult choices have to be made.

Challenges

  • When funding gets tight, following "best-practices" is often first thing to go.

  • Open-source software was built in culture of trust. An environment of constant security threats discourages sharing.

Common Barriers to Replicability

  • Human error (typically unintentional)

  • Misuse of statistical methods

  • Publication bias

  • Inadequate experimental design

  • Inadequate reporting of study protocols

  • Incentive system that encourages "significant" results

Failure to Replicate can lead to Scientific Progress!

  • Different research groups can make different, reasonable choices

  • One (or more) choices affect results

  • Subsequent investigation identifies which choice(s) were responsible for the different outcomes

  • Only works if both groups precisely document their choices.

Who is reproducing & replicating research?

  • Original investigator(s) reproducing their own results to convince themselves (most common)

  • Original investigator(s) reproducing their own results to convince others (e.g., collaborators, other scientists in the field, or industry/government), particularly if a result is highly surprising or has significant ramifications

  • Different investigators (potentially from same or different lab) may attempt to replicate a study using a data set they are collecting as a stepping stone in their research process.

  • Different investigators may try to build on a previous study, not succeed, and then decide to try to replicate the previous study to identify why they didn't succeed.

  • "Adversarial Collaborations" can be extremely effective.

  • Maybe no one.

Publication bias

  • Publishing practices varry substantially by field

  • Astronomy is unusual:

    • Relatively few well-respected astronomy journals

    • Each of those has a high acceptance rate.

    • Sometime astronomer publish in multi-disciplinary journals (Nature, Science, Publication of Nataional Academies of Science) with a lower acceptance rate.

  • Many other fields (e.g., biology, economics) have many repsected journals, most with low acceptance rates

Question:

How do difference in acceptance rates affect incentives and lead to publication biases?

Strategies to make your work reproducible

  • Make input data publically available (when allowed & ethical)

  • Use open-source software for analysis

  • Use package manager to completely specify languages, libraries & packages used.

  • Provide container

  • Version control source code and scripts

  • Publish results, tables & figures that were generated by scripts

  • For complex calculations, use workflow management software

  • Make code used to generate results public

  • Archive code & data

  • Provide sufficient documentation for others to reproduce calculations.

  • Encourage a team to replicate your results from the documentation you've provided.

Tools to automate workflows

  • Build tools: make, cmake

  • Package managers

  • Scientific workflows: Snakemake, Galaxy, Nextflow, BigDataScript, ...

  • Containers: Docker, Apptainer,...

  • Example scripts for code/data behind figures in AAS journals: Showyourwork

Dangers of Big Data

  • Multiple testing: Perform many possible tests (explicitly or "by eye") and then report one that appears to be significant in isolation

  • \(p\)-hacking: "the practice of collecting, selecting, or analyzing data until a result of statistical significance is found" (RRiS 2019)

  • Overfitting: Over confidence in model performance, especially when applied to out-of-sample data

  • Machine learning models: Overreliance on optimizing predictive performance using complex models, rather than prioritizing, interpretability and explainability

Other relevant terms:

Rigor

"the strict application of the scientific method to ensure robust and unbiased experimental design"

NIH 2018 via Reproducibility & Replicabilitiy in Science (2019)

Reliability

"A predominant focus on the replicability of individual studies is an inefficient way to assure the reliability of scientific knowledge. Rather, reviews of cumulative evidence on a subject, to assess both the overall effect size and generalizability, is often a more useful way to gain confidence in the state of scientific knowledge."

–- Reproducibility & Replicabilitiy in Science (2019)

Generalizability

"Generalizability, another term frequently used in science, refers to the extent that results of a study apply in other contexts or populations that differ from the original one."

–- Reproducibility & Replicabilitiy in Science (2019)

Transparency

Q&A

Reproducibility

Question:

How many astronomers focus on replicating results? Most astronomy papers I see published are on novel research?

Should more astronomers focus on replicating results?

Or do they just not achieve the front page of sites often?

Should more astronomers focus on replicating results?

Or do they just not achieve the front page of sites often?

Question:

How can a researcher assess the validity of a study that has not been replicated yet?

Question:

For reproducibility, replicability, and validity of research, is it the same standards across all of astronomy/astrophysics? Do you know if it extends into physics too?

Question:

Does a more complex model mean more parameters or a firmer scientific model?

Question:

Does knowing two parameters for a basic model serve as a good prior to use when analyzing our data with a more complex model?

Project

Question:

What are some ways that we can make our dashboard look cleaner and more user friendly?

Question:

How long have written reports typically been, historically?

The report/reflection will be graded based on:

  • their contributions to the dashboard project and the contributions of their teammates (1 point),

  • what the next steps would be if there were more time to make improvements to the dashboard (1 point),

  • reflecting on what they learned from the experience (2 points),

  • offering any suggestions for how to make a similar project more valuable in future semesters (1 point), and

  • offering any suggestions for how to make the course more valuable in future semesters (optional, 0 points).

Setup/Helper Code

Built with Julia 1.11.5 and

PlutoTeachingTools 0.3.1
PlutoUI 0.7.61

To run this tutorial locally, download this file and open it with Pluto.jl.