Content from Why Reproducibility Matters


Last updated on 2026-03-10 | Edit this page

Motivation: Why should you care about reproducible research?


Most astronomers agree that reproducible research is “a good thing”.

Few astronomers change how they work because of it.

This lesson starts by being honest about why that is — and why, despite this, reproducible practices are often worth it for you personally, especially if you are a PhD student or early‑career researcher (ECR).

We will talk briefly about benefits to science, but we will focus mainly on short‑term, selfish reasons that tend to matter more day‑to‑day.

Reproducibility is not (just) about being virtuous

You will often hear that reproducible research is important because it:

  • improves trust in science
  • allows others to verify your results
  • makes research more reliable in the long term

All of this is true — but for many researchers, these benefits feel:

  • distant
  • abstract
  • misaligned with immediate career pressures

Sarah Wild, in an article for physics today describes the concern many astronomers have around reproducibility and a potential erosion of trust in science. One of the issues she points out is that our current publication systems are based on “paper and letter” based communication rather than being designed to include the publication of data, methodology as code, and results.

PhDs and ECRs are usually evaluated on: - papers - citations - finishing projects on time - surviving supervisor or project changes

So let’s reframe the question: What does reproducibility do for you, right now?

Selfish reason #1: Reproducibility saves you time

Many researchers first encounter reproducibility as a burden.

In practice, the opposite is often true.

Reproducible workflows make it easier to:

  • pause and restart work after months away
  • recover from broken laptops or lost files
  • return to a project after a supervisor, postdoc, or collaborator leaves
  • debug your own results

Additionally, a reproducible workflow is easier to incorporate different/new data or new analysis techniques than a non-reproducible workflow. This means that any future projects which have some commonality with your previous projects, will have a head-start and lower barrier for entry.

For PhD students in particular, this matters because:

  • projects routinely span multiple years
  • interruptions are common (teaching, observing, writing, life)
  • memory is unreliable, documentation is not
  • future projects will likely build on your thesis work

A recurring finding in studies of early‑career researchers is that reproducible practices reduce re‑work and dead ends, even when they add a small amount of effort up front.

Remember: You are the first and most frequent reuser of your own code.

Discussion

Reflection

Have you ever failed to reproduce your own result after a few months?

Selfish reason #2: Reproducible work is easier to defend

ML and data‑driven results are increasingly scrutinised after publication. When your work is questioned, reproducibility acts as protection. If you can point to:

  • versioned code
  • documented data splits
  • clearly stated assumptions and limitations

then criticism becomes:

  • technical, not personal
  • something you can respond to, not panic about

For PhDs and ECRs, this matters because:

  • you often have less institutional protection
  • you are more exposed to reviewer and community criticism
  • you may no longer be around when questions arise

Result: Reproducibility shifts risk from “who did this?” to “what does the evidence show?”

Selfish reason #3: Reproducible work is cited more

This is one of the few incentives with quantitative evidence.

Colavisa et al. 2004 found that:

  • the early release of a publication as a preprint correlates with a significant positive citation advantage of about 20.2%,
  • sharing data in an online repository correlates with a smaller yet still positive citation advantage of 4.3%.

They did not see a significant citation advantage for papers which shared code, but note that “Further research is needed on additional or alternative measures of impact beyond citations”.

This effect has been shown even when controlling for:

  • journal impact
  • field differences
  • publication year

The takeaway is not “do this to game citations”, but: Visibility and reuse tend to follow clarity and accessibility.

It is already normal practice within astronomy to post pre-prints to the arXiv, and these finding should give us confidence that this is a good practice that should be continued.

In a study by Allen et al. 2018, papers from 2015 were scanned for citations and links to code. Of the 285 unique codes that were used, 58% offered source code for download - not a great success rate. However 90% of the hyperlinks to code were found to be still working at the time of the study (three years post publication). The lead author, Alice Allen, oversees the Astrophysics Source Code Library which you can think of as “an arXiv for code”, and provides a doi and permalink for people to cite code.

What happens when reproducibility is missing?

Reproducibility failures are rarely dramatic at first. More often, they look like:

  • results that “can’t quite be repeated”
  • models that work once but never again
  • performance that disappears when reused elsewhere

In astronomy and ML‑based science, documented failure modes include:

  • random samples that cannot be regenerated
  • machine‑learning models that rely on subtle data leakage
  • results that vanish when preprocessing is done correctly

In many published cases:

  • no fraud was involved
  • no one acted in bad faith
  • the problem was simply undocumented decisions

For ECRs, the risk is asymmetric:

  • the cost of failure is personal and immediate
  • the benefit of cutting corners is often short‑lived

Reproducibility as career insurance

It is reasonable to think of reproducibility as a form of insurance. You invest a small amount of effort:

  • documenting choices
  • fixing randomness
  • structuring workflows

In return, you reduce the chance of:

  • losing months of work
  • being unable to answer basic questions about your own results
  • inheriting an unfixable mess (or becoming one)

Reproducibility is insurance you pay for up front — instead of with stress later.

Discussion

Take one minute to think (no sharing required):

  • What is one thing in your current workflow that only you understand?
  • How confident are you that you could rerun your main result in a year?

Take-away: “Reproducible research is not about being perfect, it’s about making your future life easier.”

Content from What do we mean by reproducibility?


Last updated on 2026-03-10 | Edit this page

What do we mean by reproducibility?


The word “reproducibility” is used in many ways, often imprecisely. In practice, misunderstandings about what kind of reproducibility is being claimed are a major source of confusion, frustration, and irreproducible results in computational astronomy and ML‑based research. This section introduces a small set of definitions that are widely used across computational sciences and astronomy, and shows how they apply in practice.

Many communities now distinguish between three ideas:

Repeatability
The same researchers can rerun the same code, on the same data, in the same environment, and get the same result.
Reproducibility
A different researcher (or you, later) can rerun the same code, on the same data, and get the same result.
Replicability
A different analysis, dataset, or method leads to a consistent scientific conclusion.

In astronomy, most failures happen at the repeatability level - before we even reach reproducibility.

Remember: If you cannot repeat your own result, no one else can reproduce it.

Why this matters in ML‑based astronomy

Machine learning makes reproducibility harder than many traditional analyses because it introduces:

  • randomness (initialisation, data splits, stochastic optimisation)
  • long, implicit preprocessing chains
  • complex software stacks
  • performance claims that depend on subtle choices

Common astronomy‑specific examples include:

  • a transient classifier whose accuracy depends on an undocumented random seed
  • a photometric redshift model trained on one survey and reused on another without retraining
  • a simulation‑trained model whose domain of validity is unclear

These issues are discussed explicitly in astronomy‑adjacent reproducibility training materials such as the MPI‑Astronomy reproducibility workshop.

Discussion

Classifying reproducibility claims

Read the following statements and decide which category they belong to: repeatable, reproducible, or replicable.

No discussion yet; just decide for yourself.

  1. “I reran my notebook on my laptop and got the same plot.”
  2. “My collaborator ran my code from GitHub and reproduced all figures.”
  3. “Another group used a different survey and found the same astrophysical trend.”

After one minute, discuss briefly with a neighbour.

Reproducibility is a spectrum, not a switch

In real research, reproducibility is rarely all‑or‑nothing.

For example:

  • You might share code but not data
  • You might share data but not preprocessing
  • You might document assumptions but not environments

This is normal. The goal is not perfection, but making explicit what can and cannot be reproduced. The Turing Way emphasises this framing, especially for computational and ML‑based research.

Remember: Partial reproducibility is better than implicit irreproducibility.

What reproducibility does not require

A common misconception is that reproducibility means:

  • your code must be beautiful
  • your results must be flawless
  • everything must be public forever

None of these are true. Reproducibility requires clarity, not elegance. You can be reproducible while:

  • using imperfect code
  • reporting negative or null results
  • restricting access to sensitive data (with clear conditions)

Examples of good reproducible practice in astronomy

The following are widely cited examples of astronomy communities explicitly designing for reproducibility:

  • Astropy Collaboration
    Astropy provides a reproducible, community‑maintained software ecosystem with explicit versioning, testing, and citation practices.

  • IVOA standards and Virtual Observatory workflows
    The International Virtual Observatory Alliance (IVOA) has implemented interoperability and metadata standards that strongly align with FAIR and reproducibility principles .

  • Reproducible ML workflows in astronomy training
    MPI‑Astronomy and similar groups explicitly teach reproducible ML pipelines, including environment capture and workflow provenance.

These examples show that reproducibility is not hypothetical; it is already embedded in successful astronomy infrastructure.

Discussion

Identifying reproducibility gaps

Think about your own current project. Silently answer the following:

  • Could you rerun your main result today?
  • Could someone else rerun it using your description?
  • Which part would fail first?

Write down one concrete gap (e.g. “train/test split not saved”, “data cleaning undocumented”).

No sharing required.

Reproducibility versus performance in ML papers

In ML‑based science, performance claims are often treated as evidence. That is to say, model performance (correlation of output to “known” results) is treated as a proxy for scientific validity (the ability to understand or explain a phenomena). This is the classic correlation vs causation confusion.

However, large‑scale reviews have shown that:

  • subtle methodological errors such as data leakage are widespread, leading to an over-reporting of model performance
  • claimed improvements often disappear when analyses are reproduced correctly (e.g Kapoor and Narayanan, 2023)

Performance does not imply physical understanding. A classifier that distinguishes galaxies from stars with high accuracy does not necessarily:

  • encode meaningful morphology
  • generalize across surveys
  • respect physical invariants

It may instead exploit:

  • PSF differences
  • survey depth artefacts
  • preprocessing quirks

This is not an argument against ML. It is an argument for clear, reproducible evidence when ML models are used to support scientific claims. Reproducible results are easier to interrogate, and lead to a higher confidence in the reported outcomes.

Key takeaways

  • Reproducibility, repeatability, and replicability are related but distinct
  • Most failures occur at the repeatability stage
  • ML increases the need for explicit documentation
  • Reproducibility is about clarity, not perfection
  • If you do not state what can be reproduced, readers will assume nothing can.

Content from Data, models, and FAIRness in ML astronomy


Last updated on 2026-03-10 | Edit this page

Data, models, and FAIR practices in ML astronomy


Machine learning workflows in astronomy produce more than papers. They produce:

  • training datasets
  • labels and annotations
  • trained models
  • derived catalogues
  • complex preprocessing pipelines

If these outputs are not shared in a FAIR way, ML results become difficult or impossible to reuse, validate, or build upon.

This section focuses specifically on FAIR data and model sharing practices, and how they apply to ML‑based astronomy.

What does FAIR mean in astronomy?

FAIR stands for:

  • Findable
  • Accessible
  • Interoperable
  • Reusable

The FAIR principles were introduced by Wilkinson et al. (2016) and are now widely adopted across scientific domains. In astronomy, FAIR has a concrete and well‑established interpretation through:

  • Virtual Observatory infrastructure
  • IVOA standards
  • long‑lived data archives

An overview of how FAIR maps onto astronomical practice is given by O’Toole and Tocknell (2022).

Summary: FAIR is about making research outputs usable by humans and machines.

Astronomy is close to FAIR by default

Astronomy is often described as “world‑leading” in data stewardship. This is largely true because:

  • most survey data are archived
  • metadata standards are well developed
  • access is usually long‑term

IVOA standards were, in practice, implementing FAIR‑like ideas before the FAIR principles were formally articulated. As a result:

  • raw observational data are often FAIR
  • catalogues linked to publications are often FAIR
  • archive‑level metadata is usually strong

However, ML workflows introduce new outputs that are often not FAIR.

Where ML workflows break FAIRness

In ML‑based astronomy, FAIR failures usually occur downstream of the archive. Common examples include:

  • private training sets derived from public data
  • labels created by individuals and never shared
  • preprocessing scripts that are undocumented
  • trained models released without context
  • catalogues published without provenance

These issues are explicitly identified in FAIR guidance for astronomy, which emphasises that provenance, processing history, and metadata are essential for reuse (O’Toole and Tocknell, 2022).

Remember: FAIR does not stop at the telescope. It extends through the full analysis pipeline.

FAIR applies to models, not just data

A common misconception is that FAIR applies only to raw data. In ML astronomy, FAIR should apply to:

  • training datasets
  • labels and annotations
  • feature representations
  • trained models
  • evaluation datasets

If a trained model is shared without:

  • its training data description
  • preprocessing steps
  • intended scope

then it is not reusable, even if the model file itself is available.

This point is emphasised in astronomy‑focused FAIR discussions, which note that machine‑actionable reuse requires rich metadata and provenance, not just file access (Berriman, 2022).

Discussion

Is your ML output FAIR?

Think about one ML artefact you have produced or used. For that artefact, ask:

  • Can someone find it?
  • Can they access it?
  • Can they understand what it expects as input?
  • Can they reuse it without asking you questions?

Write down one missing piece of information.

No sharing required.

FAIR does not mean open at all costs

Another common misconception is that FAIR means “everything must be open”. This is not true. FAIR explicitly allows for:

  • access controls
  • embargo periods
  • restricted data

The requirement is not openness, but clarity:

  • how the data can be accessed
  • under what conditions
  • with what limitations

FAIR guidance for astronomy explicitly separates:

  • openness (a policy choice)
  • FAIRness (a technical and metadata choice)

Data can be FAIR without being fully open.

Why FAIR matters for ML reuse

ML outputs are frequently reused:

  • by collaborators
  • by downstream projects
  • by future surveys

If ML artefacts are not FAIR:

  • reuse requires personal communication
  • validation becomes difficult
  • results quietly decay over time

FAIR practices reduce this fragility by:

  • making assumptions explicit
  • preserving provenance
  • supporting long‑term reuse

This is especially important in astronomy, where datasets often outlive individual projects and researchers.

Discussion

One FAIR improvement

Identify one concrete change you could make to improve FAIRness:

  • a README
  • a data dictionary
  • a provenance note
  • a model description

Write it down.

That is enough for now.

Key takeaways

  • FAIR applies to data, models, and derived products
  • Astronomy infrastructure already supports FAIR principles
  • ML workflows often fall outside existing FAIR practices
  • Small documentation steps dramatically improve reuse

Thought: “If your ML result cannot be reused without emailing you, it is not FAIR yet.”

Content from Ethics of ML & AI in astronomy research


Last updated on 2026-03-10 | Edit this page

Ethics of ML and AI in astronomy research


Ethics in ML is often discussed in the context of social or human impacts. In astronomy, ethical issues usually look quieter and more technical. They show up as:

  • overconfident claims
  • silent misuse of models
  • results applied beyond their valid scope
  • downstream users trusting outputs more than they should

This section focuses on research ethics in ML‑based astronomy, not policy or regulation.

Ethics in astronomy is mostly about scope and responsibility

Astronomy ML models are rarely deployed directly in high‑stakes decisions. Instead, they are used to:

  • classify objects
  • generate catalogues
  • estimate physical parameters
  • prioritise follow‑up observations

The ethical risk is not immediate harm. The risk is that scientific conclusions become detached from their assumptions.

In ML‑based research, ethical problems usually arise when:

  • models are reused outside their intended domain
  • limitations are undocumented or forgotten
  • outputs are treated as ground truth

Good news: Most ethical failures in astronomy ML are failures of scope, not intent.

Automation bias in scientific workflows

Automation bias is the tendency to over‑trust automated outputs, even when they are wrong. In astronomy, this appears when:

  • ML classifications are accepted without inspection
  • catalogue flags are treated as definitive
  • human judgement is quietly removed from the loop

This is especially common when:

  • datasets are too large for manual checking
  • models perform well on average
  • uncertainty is not visible in outputs

Automation bias does not require negligence. It emerges naturally when tools scale faster than scrutiny.

Example: When a model outlives its context

A common pattern in astronomy ML is:

  • a model is trained for a specific survey
  • it performs well and is published
  • it is reused years later on a different dataset

If the original paper does not clearly state:

  • training data assumptions
  • preprocessing steps
  • intended domain of validity

then reuse becomes guesswork. This is an ethical issue because:

  • downstream users are misled by omission
  • incorrect results propagate silently
  • responsibility becomes unclear

Remember: “Silence about limitations is not neutral.”

Discussion

Who is your model for?

Think about one ML model you have built or used.

Silently answer:

  • What dataset was it trained on?
  • What dataset was it validated on?
  • Where would you hesitate to apply it?

Write down one boundary you would not cross.

No sharing required.

Over-claiming is an ethical failure

In ML‑based science, performance numbers are often used to justify scientific claims. Problems arise when:

  • predictive accuracy is treated as physical understanding
  • benchmark improvements are interpreted as theoretical progress
  • limitations are buried in appendices or not stated at all

Overclaiming does not require exaggeration. It can occur simply by:

  • failing to state uncertainty
  • omitting evaluation caveats
  • allowing readers to assume too much

In astronomy, this is particularly risky because ML outputs are often reused far beyond their original context.

Transparency is an ethical safeguard

Ethical ML practice in astronomy is mostly about making assumptions visible. This includes:

  • documenting training data provenance
  • stating what the model was designed to do
  • stating what it was not tested on
  • clarifying how outputs should be interpreted

This aligns closely with reproducibility and FAIR practices:

  • ethics, reproducibility, and reuse reinforce each other
  • none of them work in isolation

Best practice: Transparency is the ethical minimum, not an optional extra.

Ethics without blame

It is important to be explicit about what this section is not saying. Ethical issues in ML astronomy usually:

  • are not caused by bad actors
  • are not the result of incompetence
  • arise from complex systems and incentives

Most problems emerge because:

  • ML scales faster than documentation
  • models are reused in good faith
  • assumptions fade over time

Framing ethics as a systems issue, not a personal failure, is essential for changing practice.

Discussion

One ethical improvement

Think of one small change you could make when reporting results that rely on ML/AI:

  • a clearer limitations paragraph
  • a model scope statement
  • a warning in documentation
  • an explicit “not tested on” list

Write it down.

That is enough.

Key takeaways

  • Ethical issues in astronomy ML are usually about scope and reuse
  • Automation bias can occur even with high‑performing models
  • Overclaiming often happens through omission, not exaggeration
  • Transparency is the most effective ethical safeguard

Most ethical failures in ML astronomy happen when tools are trusted more than they are understood.

Content from Practical takeaways for astronomers


Last updated on 2026-03-10 | Edit this page

Practical takeaways for astronomers


Reproducibility, FAIR practices, and ethical ML use can feel abstract until they are translated into concrete actions. This section focuses on practical minimum standards that astronomers can apply immediately, without rewriting their entire workflow or becoming software engineers. The goal is not perfection. The goal is for our work to be clear, honest, and reusable enough.

A minimum reproducibility standard for ML projects

For most astronomy ML projects, a reasonable minimum standard is that:

  • you can rerun your own main result
  • someone else could rerun it with effort, but without guessing
  • limitations are stated explicitly

In practice, this usually means having:

  • versioned code
  • fixed or recorded randomness
  • explicit data splits
  • documented preprocessing
  • a short description of scope and limitations

Anything beyond this is a bonus.

What to document (even if you share nothing else)

If you only document a few things, make them these:

  • Data provenance
    • Where the data came from, including archive, release, and selection criteria.
  • Preprocessing steps
    • What was done to the data before training, especially filtering, scaling, and feature construction.
  • Randomness
    • Whether results depend on random seeds, and whether those seeds were fixed.
  • Model scope
    • What the model was designed to do, and where it was not tested.

This information is often more important than hyperparameter details.

Rule of thumb: Documentation that answers questions is more useful than documentation that looks complete.

FAIR does not require new infrastructure

Many astronomers assume FAIR practices require:

  • specialised repositories
  • complex metadata schemas
  • institutional support

In reality, small steps already help a lot. For example:

  • a README.md in a Git repository
  • a data dictionary for derived features
  • a short “how to reproduce Figure 3” note
  • a model description paragraph in the paper

FAIRness improves through clarity, not tooling.

Discussion

The five‑minute README

Imagine someone opens your project folder in a year. Write down the headings of a README that would help them.

For example:

  • What this project does
  • Where the data came from
  • How to rerun the main result
  • Known limitations

You do not need to write the content now.

Just the headings are enough.

How to be ethically careful without being defensive

Ethical ML practice in astronomy does not require disclaimers everywhere.

It requires:

  • stating assumptions
  • stating limits
  • avoiding implied generality

Helpful phrases include:

  • “This model was trained on…”
  • “Performance was evaluated only on…”
  • “We did not test…”
  • “Results should not be interpreted as…”

These are not weaknesses. They are signals of careful science.

When full openness is not possible

Sometimes you cannot share:

  • proprietary data
  • sensitive observations
  • intermediate products
  • large files

This does not prevent reproducibility or FAIR alignment. You can still:

  • describe access conditions
  • share code without data
  • provide synthetic examples
  • document the full workflow

Silence is the only irreproducible option.

Reuse is where most harm (and benefit) happens

Most problems arise after publication, when:

  • models are reused on new surveys
  • catalogues are taken as ground truth
  • assumptions are forgotten

You cannot control all reuse.

You can influence it by:

  • writing clear limitations
  • choosing careful language
  • making uncertainty visible

This protects both downstream users and your future self.

Discussion

One concrete improvement

Think about your current or next project.

Identify one thing you could improve:

  • a clearer scope statement
  • a fixed random seed
  • a short README
  • a note on reuse limitations

Write it down. Do not optimise it. Just commit to doing that one thing.

A sustainable mindset

Good practice accumulates.

Most reproducible and ethical ML workflows were not built all at once.

They evolved because:

  • small habits stuck
  • mistakes were documented
  • clarity was rewarded

Progress is incremental.

Key takeaways

  • Aim for a clear minimum standard, not perfection
  • Document decisions that affect results
  • Make scope and limitations explicit
  • Small improvements compound over time

Remember: Good (ML) practice is not about doing everything right. It is about making fewer things mysterious.

Content from Wrap‑up and next steps


Last updated on 2026-03-10 | Edit this page

One‑page checklist: Reproducible, FAIR, and ethical ML in astronomy


Use this as a quick self‑check for your next ML project. You do not need to tick every box to start. Ticking a few is already progress.

1. Reproducibility (can someone rerun this?)

Minimum goal: someone else could rerun this without guessing.

2. Data provenance (where did this come from?)

Minimum goal: a reader understands what data went in, and why.

3. FAIR practices (is this reusable?)

Findable

Accessible

Interoperable

Reusable

Minimum goal: reuse does not require emailing you.

4. Models and outputs (what exactly is being shared?)

Minimum goal: users know what the model was built for.

5. Ethical safeguards (are assumptions visible?)

Minimum goal: downstream users are not misled by omission.

6. The five‑minute test

If you stopped working on this today:

If not, write one sentence to fix that.

Remember

  • Clarity beats completeness
  • Partial reproducibility beats silence
  • Small improvements compound

You are not aiming for perfect practice.
You are aiming for fewer mysteries.

Wrap‑up and next steps


This workshop has focused on a simple idea:

Machine learning does not change what good astronomy looks like.
It changes how easy it is to lose track of assumptions.

Across the lessons, we have seen that:

  • reproducibility is a practical skill, not a moral stance
  • FAIR practices extend beyond raw data to models and workflows
  • ethical issues in astronomy ML are usually about scope and reuse
  • small, explicit choices prevent large downstream problems

None of these require perfect code or ideal infrastructure. They require clarity.

What you should take away

If you remember only a few things:

  • You are the primary user of your own ML work
  • Performance numbers are not self‑explanatory
  • Models outlive their context unless you stop them
  • Silence about limitations is never neutral

Good practice is mostly about writing things down.

What to do next

After this workshop, consider doing just one of the following:

  • Add a README to an existing project
  • Write a scope and limitations paragraph for a paper
  • Fix and record random seeds
  • Document a preprocessing step you currently do implicitly
  • Add a “not tested on” note to a model description

Choose one thing. Do it once. Let it stick.

How this fits into your career

For PhD students and ECRs especially:

  • reproducible work is easier to defend
  • clear documentation saves time
  • careful scope statements protect you from overclaiming
  • FAIR practices increase the lifespan of your work

These benefits show up quickly.

Discussion

Final reflection (optional)

Take one minute and think about:

  • One assumption in your current work that could be made explicit
  • One future reader you could help with a single sentence

That is enough for today.

Final Thought: Reproducible and ethical ML is not about doing more work. It is about making your work easier to trust and reuse.