Content from Why Reproducibility Matters
Last updated on 2026-03-10 | Edit this page
Estimated time: 25 minutes
Motivation: Why should you care about reproducible research?
Most astronomers agree that reproducible research is “a good thing”.
Few astronomers change how they work because of it.
This lesson starts by being honest about why that is — and why, despite this, reproducible practices are often worth it for you personally, especially if you are a PhD student or early‑career researcher (ECR).
We will talk briefly about benefits to science, but we will focus mainly on short‑term, selfish reasons that tend to matter more day‑to‑day.
Reproducibility is not (just) about being virtuous
You will often hear that reproducible research is important because it:
- improves trust in science
- allows others to verify your results
- makes research more reliable in the long term
All of this is true — but for many researchers, these benefits feel:
- distant
- abstract
- misaligned with immediate career pressures
Sarah Wild, in an article for physics today describes the concern many astronomers have around reproducibility and a potential erosion of trust in science. One of the issues she points out is that our current publication systems are based on “paper and letter” based communication rather than being designed to include the publication of data, methodology as code, and results.
PhDs and ECRs are usually evaluated on: - papers - citations - finishing projects on time - surviving supervisor or project changes
So let’s reframe the question: What does reproducibility do for you, right now?
Selfish reason #1: Reproducibility saves you time
Many researchers first encounter reproducibility as a burden.
In practice, the opposite is often true.
Reproducible workflows make it easier to:
- pause and restart work after months away
- recover from broken laptops or lost files
- return to a project after a supervisor, postdoc, or collaborator leaves
- debug your own results
Additionally, a reproducible workflow is easier to incorporate different/new data or new analysis techniques than a non-reproducible workflow. This means that any future projects which have some commonality with your previous projects, will have a head-start and lower barrier for entry.
For PhD students in particular, this matters because:
- projects routinely span multiple years
- interruptions are common (teaching, observing, writing, life)
- memory is unreliable, documentation is not
- future projects will likely build on your thesis work
A recurring finding in studies of early‑career researchers is that reproducible practices reduce re‑work and dead ends, even when they add a small amount of effort up front.
Remember: You are the first and most frequent reuser of your own code.
Reflection
Have you ever failed to reproduce your own result after a few months?
This is about normalising failure, not blaming individuals.
Selfish reason #2: Reproducible work is easier to defend
ML and data‑driven results are increasingly scrutinised after publication. When your work is questioned, reproducibility acts as protection. If you can point to:
- versioned code
- documented data splits
- clearly stated assumptions and limitations
then criticism becomes:
- technical, not personal
- something you can respond to, not panic about
For PhDs and ECRs, this matters because:
- you often have less institutional protection
- you are more exposed to reviewer and community criticism
- you may no longer be around when questions arise
Result: Reproducibility shifts risk from “who did this?” to “what does the evidence show?”
Selfish reason #3: Reproducible work is cited more
This is one of the few incentives with quantitative evidence.
Colavisa et al. 2004 found that:
- the early release of a publication as a preprint correlates with a significant positive citation advantage of about 20.2%,
- sharing data in an online repository correlates with a smaller yet still positive citation advantage of 4.3%.
They did not see a significant citation advantage for papers which shared code, but note that “Further research is needed on additional or alternative measures of impact beyond citations”.
This effect has been shown even when controlling for:
- journal impact
- field differences
- publication year
The takeaway is not “do this to game citations”, but: Visibility and reuse tend to follow clarity and accessibility.
It is already normal practice within astronomy to post pre-prints to the arXiv, and these finding should give us confidence that this is a good practice that should be continued.
In a study by Allen et al. 2018, papers from 2015 were scanned for citations and links to code. Of the 285 unique codes that were used, 58% offered source code for download - not a great success rate. However 90% of the hyperlinks to code were found to be still working at the time of the study (three years post publication). The lead author, Alice Allen, oversees the Astrophysics Source Code Library which you can think of as “an arXiv for code”, and provides a doi and permalink for people to cite code.
What happens when reproducibility is missing?
Reproducibility failures are rarely dramatic at first. More often, they look like:
- results that “can’t quite be repeated”
- models that work once but never again
- performance that disappears when reused elsewhere
In astronomy and ML‑based science, documented failure modes include:
- random samples that cannot be regenerated
- machine‑learning models that rely on subtle data leakage
- results that vanish when preprocessing is done correctly
In many published cases:
- no fraud was involved
- no one acted in bad faith
- the problem was simply undocumented decisions
For ECRs, the risk is asymmetric:
- the cost of failure is personal and immediate
- the benefit of cutting corners is often short‑lived
Reproducibility as career insurance
It is reasonable to think of reproducibility as a form of insurance. You invest a small amount of effort:
- documenting choices
- fixing randomness
- structuring workflows
In return, you reduce the chance of:
- losing months of work
- being unable to answer basic questions about your own results
- inheriting an unfixable mess (or becoming one)
Reproducibility is insurance you pay for up front — instead of with stress later.
Take one minute to think (no sharing required):
- What is one thing in your current workflow that only you understand?
- How confident are you that you could rerun your main result in a year?
Take-away: “Reproducible research is not about being perfect, it’s about making your future life easier.”
Content from What do we mean by reproducibility?
Last updated on 2026-03-10 | Edit this page
Estimated time: 25 minutes
What do we mean by reproducibility?
The word “reproducibility” is used in many ways, often imprecisely. In practice, misunderstandings about what kind of reproducibility is being claimed are a major source of confusion, frustration, and irreproducible results in computational astronomy and ML‑based research. This section introduces a small set of definitions that are widely used across computational sciences and astronomy, and shows how they apply in practice.
Three related but distinct concepts
Many communities now distinguish between three ideas:
- Repeatability
- The same researchers can rerun the same code, on the same data, in the same environment, and get the same result.
- Reproducibility
- A different researcher (or you, later) can rerun the same code, on the same data, and get the same result.
- Replicability
- A different analysis, dataset, or method leads to a consistent scientific conclusion.
In astronomy, most failures happen at the repeatability level - before we even reach reproducibility.
Remember: If you cannot repeat your own result, no one else can reproduce it.
Why this matters in ML‑based astronomy
Machine learning makes reproducibility harder than many traditional analyses because it introduces:
- randomness (initialisation, data splits, stochastic optimisation)
- long, implicit preprocessing chains
- complex software stacks
- performance claims that depend on subtle choices
Common astronomy‑specific examples include:
- a transient classifier whose accuracy depends on an undocumented random seed
- a photometric redshift model trained on one survey and reused on another without retraining
- a simulation‑trained model whose domain of validity is unclear
These issues are discussed explicitly in astronomy‑adjacent reproducibility training materials such as the MPI‑Astronomy reproducibility workshop.
Classifying reproducibility claims
Read the following statements and decide which category they belong to: repeatable, reproducible, or replicable.
No discussion yet; just decide for yourself.
- “I reran my notebook on my laptop and got the same plot.”
- “My collaborator ran my code from GitHub and reproduced all figures.”
- “Another group used a different survey and found the same astrophysical trend.”
After one minute, discuss briefly with a neighbour.
This activity usually reveals that many people conflate reproducibility and replicability.
Reproducibility is a spectrum, not a switch
In real research, reproducibility is rarely all‑or‑nothing.
For example:
- You might share code but not data
- You might share data but not preprocessing
- You might document assumptions but not environments
This is normal. The goal is not perfection, but making explicit what can and cannot be reproduced. The Turing Way emphasises this framing, especially for computational and ML‑based research.
Remember: Partial reproducibility is better than implicit irreproducibility.
What reproducibility does not require
A common misconception is that reproducibility means:
- your code must be beautiful
- your results must be flawless
- everything must be public forever
None of these are true. Reproducibility requires clarity, not elegance. You can be reproducible while:
- using imperfect code
- reporting negative or null results
- restricting access to sensitive data (with clear conditions)
Examples of good reproducible practice in astronomy
The following are widely cited examples of astronomy communities explicitly designing for reproducibility:
Astropy Collaboration
Astropy provides a reproducible, community‑maintained software ecosystem with explicit versioning, testing, and citation practices.IVOA standards and Virtual Observatory workflows
The International Virtual Observatory Alliance (IVOA) has implemented interoperability and metadata standards that strongly align with FAIR and reproducibility principles .Reproducible ML workflows in astronomy training
MPI‑Astronomy and similar groups explicitly teach reproducible ML pipelines, including environment capture and workflow provenance.
These examples show that reproducibility is not hypothetical; it is already embedded in successful astronomy infrastructure.
Identifying reproducibility gaps
Think about your own current project. Silently answer the following:
- Could you rerun your main result today?
- Could someone else rerun it using your description?
- Which part would fail first?
Write down one concrete gap (e.g. “train/test split not saved”, “data cleaning undocumented”).
No sharing required.
This activity is intentionally non‑judgmental and reflective.
Reproducibility versus performance in ML papers
In ML‑based science, performance claims are often treated as evidence. That is to say, model performance (correlation of output to “known” results) is treated as a proxy for scientific validity (the ability to understand or explain a phenomena). This is the classic correlation vs causation confusion.
However, large‑scale reviews have shown that:
- subtle methodological errors such as data leakage are widespread, leading to an over-reporting of model performance
- claimed improvements often disappear when analyses are reproduced correctly (e.g Kapoor and Narayanan, 2023)
Performance does not imply physical understanding. A classifier that distinguishes galaxies from stars with high accuracy does not necessarily:
- encode meaningful morphology
- generalize across surveys
- respect physical invariants
It may instead exploit:
- PSF differences
- survey depth artefacts
- preprocessing quirks
This is not an argument against ML. It is an argument for clear, reproducible evidence when ML models are used to support scientific claims. Reproducible results are easier to interrogate, and lead to a higher confidence in the reported outcomes.
Key takeaways
- Reproducibility, repeatability, and replicability are related but distinct
- Most failures occur at the repeatability stage
- ML increases the need for explicit documentation
- Reproducibility is about clarity, not perfection
- If you do not state what can be reproduced, readers will assume nothing can.
Content from Data, models, and FAIRness in ML astronomy
Last updated on 2026-03-10 | Edit this page
Estimated time: 25 minutes
Data, models, and FAIR practices in ML astronomy
Machine learning workflows in astronomy produce more than papers. They produce:
- training datasets
- labels and annotations
- trained models
- derived catalogues
- complex preprocessing pipelines
If these outputs are not shared in a FAIR way, ML results become difficult or impossible to reuse, validate, or build upon.
This section focuses specifically on FAIR data and model sharing practices, and how they apply to ML‑based astronomy.
What does FAIR mean in astronomy?
FAIR stands for:
- Findable
- Accessible
- Interoperable
- Reusable
The FAIR principles were introduced by Wilkinson et al. (2016) and are now widely adopted across scientific domains. In astronomy, FAIR has a concrete and well‑established interpretation through:
- Virtual Observatory infrastructure
- IVOA standards
- long‑lived data archives
An overview of how FAIR maps onto astronomical practice is given by O’Toole and Tocknell (2022).
Summary: FAIR is about making research outputs usable by humans and machines.
Astronomy is close to FAIR by default
Astronomy is often described as “world‑leading” in data stewardship. This is largely true because:
- most survey data are archived
- metadata standards are well developed
- access is usually long‑term
IVOA standards were, in practice, implementing FAIR‑like ideas before the FAIR principles were formally articulated. As a result:
- raw observational data are often FAIR
- catalogues linked to publications are often FAIR
- archive‑level metadata is usually strong
However, ML workflows introduce new outputs that are often not FAIR.
Where ML workflows break FAIRness
In ML‑based astronomy, FAIR failures usually occur downstream of the archive. Common examples include:
- private training sets derived from public data
- labels created by individuals and never shared
- preprocessing scripts that are undocumented
- trained models released without context
- catalogues published without provenance
These issues are explicitly identified in FAIR guidance for astronomy, which emphasises that provenance, processing history, and metadata are essential for reuse (O’Toole and Tocknell, 2022).
Remember: FAIR does not stop at the telescope. It extends through the full analysis pipeline.
FAIR applies to models, not just data
A common misconception is that FAIR applies only to raw data. In ML astronomy, FAIR should apply to:
- training datasets
- labels and annotations
- feature representations
- trained models
- evaluation datasets
If a trained model is shared without:
- its training data description
- preprocessing steps
- intended scope
then it is not reusable, even if the model file itself is available.
This point is emphasised in astronomy‑focused FAIR discussions, which note that machine‑actionable reuse requires rich metadata and provenance, not just file access (Berriman, 2022).
Is your ML output FAIR?
Think about one ML artefact you have produced or used. For that artefact, ask:
- Can someone find it?
- Can they access it?
- Can they understand what it expects as input?
- Can they reuse it without asking you questions?
Write down one missing piece of information.
No sharing required.
FAIR does not mean open at all costs
Another common misconception is that FAIR means “everything must be open”. This is not true. FAIR explicitly allows for:
- access controls
- embargo periods
- restricted data
The requirement is not openness, but clarity:
- how the data can be accessed
- under what conditions
- with what limitations
FAIR guidance for astronomy explicitly separates:
- openness (a policy choice)
- FAIRness (a technical and metadata choice)
Data can be FAIR without being fully open.
Why FAIR matters for ML reuse
ML outputs are frequently reused:
- by collaborators
- by downstream projects
- by future surveys
If ML artefacts are not FAIR:
- reuse requires personal communication
- validation becomes difficult
- results quietly decay over time
FAIR practices reduce this fragility by:
- making assumptions explicit
- preserving provenance
- supporting long‑term reuse
This is especially important in astronomy, where datasets often outlive individual projects and researchers.
One FAIR improvement
Identify one concrete change you could make to improve FAIRness:
- a README
- a data dictionary
- a provenance note
- a model description
Write it down.
That is enough for now.
Key takeaways
- FAIR applies to data, models, and derived products
- Astronomy infrastructure already supports FAIR principles
- ML workflows often fall outside existing FAIR practices
- Small documentation steps dramatically improve reuse
Thought: “If your ML result cannot be reused without emailing you, it is not FAIR yet.”
Content from Ethics of ML & AI in astronomy research
Last updated on 2026-03-10 | Edit this page
Estimated time: 12 minutes
Ethics of ML and AI in astronomy research
Ethics in ML is often discussed in the context of social or human impacts. In astronomy, ethical issues usually look quieter and more technical. They show up as:
- overconfident claims
- silent misuse of models
- results applied beyond their valid scope
- downstream users trusting outputs more than they should
This section focuses on research ethics in ML‑based astronomy, not policy or regulation.
Ethics in astronomy is mostly about scope and responsibility
Astronomy ML models are rarely deployed directly in high‑stakes decisions. Instead, they are used to:
- classify objects
- generate catalogues
- estimate physical parameters
- prioritise follow‑up observations
The ethical risk is not immediate harm. The risk is that scientific conclusions become detached from their assumptions.
In ML‑based research, ethical problems usually arise when:
- models are reused outside their intended domain
- limitations are undocumented or forgotten
- outputs are treated as ground truth
Good news: Most ethical failures in astronomy ML are failures of scope, not intent.
Automation bias in scientific workflows
Automation bias is the tendency to over‑trust automated outputs, even when they are wrong. In astronomy, this appears when:
- ML classifications are accepted without inspection
- catalogue flags are treated as definitive
- human judgement is quietly removed from the loop
This is especially common when:
- datasets are too large for manual checking
- models perform well on average
- uncertainty is not visible in outputs
Automation bias does not require negligence. It emerges naturally when tools scale faster than scrutiny.
Example: When a model outlives its context
A common pattern in astronomy ML is:
- a model is trained for a specific survey
- it performs well and is published
- it is reused years later on a different dataset
If the original paper does not clearly state:
- training data assumptions
- preprocessing steps
- intended domain of validity
then reuse becomes guesswork. This is an ethical issue because:
- downstream users are misled by omission
- incorrect results propagate silently
- responsibility becomes unclear
Remember: “Silence about limitations is not neutral.”
Who is your model for?
Think about one ML model you have built or used.
Silently answer:
- What dataset was it trained on?
- What dataset was it validated on?
- Where would you hesitate to apply it?
Write down one boundary you would not cross.
No sharing required.
Over-claiming is an ethical failure
In ML‑based science, performance numbers are often used to justify scientific claims. Problems arise when:
- predictive accuracy is treated as physical understanding
- benchmark improvements are interpreted as theoretical progress
- limitations are buried in appendices or not stated at all
Overclaiming does not require exaggeration. It can occur simply by:
- failing to state uncertainty
- omitting evaluation caveats
- allowing readers to assume too much
In astronomy, this is particularly risky because ML outputs are often reused far beyond their original context.
Transparency is an ethical safeguard
Ethical ML practice in astronomy is mostly about making assumptions visible. This includes:
- documenting training data provenance
- stating what the model was designed to do
- stating what it was not tested on
- clarifying how outputs should be interpreted
This aligns closely with reproducibility and FAIR practices:
- ethics, reproducibility, and reuse reinforce each other
- none of them work in isolation
Best practice: Transparency is the ethical minimum, not an optional extra.
Ethics without blame
It is important to be explicit about what this section is not saying. Ethical issues in ML astronomy usually:
- are not caused by bad actors
- are not the result of incompetence
- arise from complex systems and incentives
Most problems emerge because:
- ML scales faster than documentation
- models are reused in good faith
- assumptions fade over time
Framing ethics as a systems issue, not a personal failure, is essential for changing practice.
One ethical improvement
Think of one small change you could make when reporting results that rely on ML/AI:
- a clearer limitations paragraph
- a model scope statement
- a warning in documentation
- an explicit “not tested on” list
Write it down.
That is enough.
Key takeaways
- Ethical issues in astronomy ML are usually about scope and reuse
- Automation bias can occur even with high‑performing models
- Overclaiming often happens through omission, not exaggeration
- Transparency is the most effective ethical safeguard
Most ethical failures in ML astronomy happen when tools are trusted more than they are understood.
Content from Practical takeaways for astronomers
Last updated on 2026-03-10 | Edit this page
Estimated time: 12 minutes
Practical takeaways for astronomers
Reproducibility, FAIR practices, and ethical ML use can feel abstract until they are translated into concrete actions. This section focuses on practical minimum standards that astronomers can apply immediately, without rewriting their entire workflow or becoming software engineers. The goal is not perfection. The goal is for our work to be clear, honest, and reusable enough.
A minimum reproducibility standard for ML projects
For most astronomy ML projects, a reasonable minimum standard is that:
- you can rerun your own main result
- someone else could rerun it with effort, but without guessing
- limitations are stated explicitly
In practice, this usually means having:
- versioned code
- fixed or recorded randomness
- explicit data splits
- documented preprocessing
- a short description of scope and limitations
Anything beyond this is a bonus.
What to document (even if you share nothing else)
If you only document a few things, make them these:
-
Data provenance
- Where the data came from, including archive, release, and selection criteria.
-
Preprocessing steps
- What was done to the data before training, especially filtering, scaling, and feature construction.
-
Randomness
- Whether results depend on random seeds, and whether those seeds were fixed.
-
Model scope
- What the model was designed to do, and where it was not tested.
This information is often more important than hyperparameter details.
Rule of thumb: Documentation that answers questions is more useful than documentation that looks complete.
FAIR does not require new infrastructure
Many astronomers assume FAIR practices require:
- specialised repositories
- complex metadata schemas
- institutional support
In reality, small steps already help a lot. For example:
- a README.md in a Git repository
- a data dictionary for derived features
- a short “how to reproduce Figure 3” note
- a model description paragraph in the paper
FAIRness improves through clarity, not tooling.
The five‑minute README
Imagine someone opens your project folder in a year. Write down the headings of a README that would help them.
For example:
- What this project does
- Where the data came from
- How to rerun the main result
- Known limitations
You do not need to write the content now.
Just the headings are enough.
How to be ethically careful without being defensive
Ethical ML practice in astronomy does not require disclaimers everywhere.
It requires:
- stating assumptions
- stating limits
- avoiding implied generality
Helpful phrases include:
- “This model was trained on…”
- “Performance was evaluated only on…”
- “We did not test…”
- “Results should not be interpreted as…”
These are not weaknesses. They are signals of careful science.
When full openness is not possible
Sometimes you cannot share:
- proprietary data
- sensitive observations
- intermediate products
- large files
This does not prevent reproducibility or FAIR alignment. You can still:
- describe access conditions
- share code without data
- provide synthetic examples
- document the full workflow
Silence is the only irreproducible option.
Reuse is where most harm (and benefit) happens
Most problems arise after publication, when:
- models are reused on new surveys
- catalogues are taken as ground truth
- assumptions are forgotten
You cannot control all reuse.
You can influence it by:
- writing clear limitations
- choosing careful language
- making uncertainty visible
This protects both downstream users and your future self.
One concrete improvement
Think about your current or next project.
Identify one thing you could improve:
- a clearer scope statement
- a fixed random seed
- a short README
- a note on reuse limitations
Write it down. Do not optimise it. Just commit to doing that one thing.
Content from Wrap‑up and next steps
Last updated on 2026-03-10 | Edit this page
Estimated time: 20 minutes
One‑page checklist: Reproducible, FAIR, and ethical ML in astronomy
Use this as a quick self‑check for your next ML project. You do not need to tick every box to start. Ticking a few is already progress.
1. Reproducibility (can someone rerun this?)
Minimum goal: someone else could rerun this without guessing.
2. Data provenance (where did this come from?)
Minimum goal: a reader understands what data went in, and why.
3. FAIR practices (is this reusable?)
Findable
Accessible
Interoperable
Reusable
Minimum goal: reuse does not require emailing you.
4. Models and outputs (what exactly is being shared?)
Minimum goal: users know what the model was built for.
5. Ethical safeguards (are assumptions visible?)
Minimum goal: downstream users are not misled by omission.
Wrap‑up and next steps
This workshop has focused on a simple idea:
Machine learning does not change what good astronomy looks
like.
It changes how easy it is to lose track of assumptions.
Across the lessons, we have seen that:
- reproducibility is a practical skill, not a moral stance
- FAIR practices extend beyond raw data to models and workflows
- ethical issues in astronomy ML are usually about scope and reuse
- small, explicit choices prevent large downstream problems
None of these require perfect code or ideal infrastructure. They require clarity.
What you should take away
If you remember only a few things:
- You are the primary user of your own ML work
- Performance numbers are not self‑explanatory
- Models outlive their context unless you stop them
- Silence about limitations is never neutral
Good practice is mostly about writing things down.
What to do next
After this workshop, consider doing just one of the following:
- Add a README to an existing project
- Write a scope and limitations paragraph for a paper
- Fix and record random seeds
- Document a preprocessing step you currently do implicitly
- Add a “not tested on” note to a model description
Choose one thing. Do it once. Let it stick.
How this fits into your career
For PhD students and ECRs especially:
- reproducible work is easier to defend
- clear documentation saves time
- careful scope statements protect you from overclaiming
- FAIR practices increase the lifespan of your work
These benefits show up quickly.
Final reflection (optional)
Take one minute and think about:
- One assumption in your current work that could be made explicit
- One future reader you could help with a single sentence
That is enough for today.
Final Thought: Reproducible and ethical ML is not about doing more work. It is about making your work easier to trust and reuse.
