Reproducibility and Ethics in ML & AI for Astronomy: All in One View

Content from Why Reproducibility Matters

Last updated on 2026-03-10 | Edit this page

Motivation: Why should you care about reproducible research?

Most astronomers agree that reproducible research is “a good thing”.

Few astronomers change how they work because of it.

This lesson starts by being honest about why that is — and why, despite this, reproducible practices are often worth it for you personally, especially if you are a PhD student or early‑career researcher (ECR).

We will talk briefly about benefits to science, but we will focus mainly on short‑term, selfish reasons that tend to matter more day‑to‑day.

Reproducibility is not (just) about being virtuous

You will often hear that reproducible research is important because it:

improves trust in science
allows others to verify your results
makes research more reliable in the long term

All of this is true — but for many researchers, these benefits feel:

distant
abstract
misaligned with immediate career pressures

Sarah Wild, in an article for physics today describes the concern many astronomers have around reproducibility and a potential erosion of trust in science. One of the issues she points out is that our current publication systems are based on “paper and letter” based communication rather than being designed to include the publication of data, methodology as code, and results.

PhDs and ECRs are usually evaluated on: - papers - citations - finishing projects on time - surviving supervisor or project changes

So let’s reframe the question: What does reproducibility do for you, right now?

Selfish reason #1: Reproducibility saves you time

Many researchers first encounter reproducibility as a burden.

In practice, the opposite is often true.

Reproducible workflows make it easier to:

pause and restart work after months away
recover from broken laptops or lost files
return to a project after a supervisor, postdoc, or collaborator leaves
debug your own results

Additionally, a reproducible workflow is easier to incorporate different/new data or new analysis techniques than a non-reproducible workflow. This means that any future projects which have some commonality with your previous projects, will have a head-start and lower barrier for entry.

For PhD students in particular, this matters because:

projects routinely span multiple years
interruptions are common (teaching, observing, writing, life)
memory is unreliable, documentation is not
future projects will likely build on your thesis work

A recurring finding in studies of early‑career researchers is that reproducible practices reduce re‑work and dead ends, even when they add a small amount of effort up front.

Remember: You are the first and most frequent reuser of your own code.

Discussion

Reflection

Have you ever failed to reproduce your own result after a few months?

Selfish reason #2: Reproducible work is easier to defend

ML and data‑driven results are increasingly scrutinised after publication. When your work is questioned, reproducibility acts as protection. If you can point to:

versioned code
documented data splits
clearly stated assumptions and limitations

then criticism becomes:

technical, not personal
something you can respond to, not panic about

For PhDs and ECRs, this matters because:

you often have less institutional protection
you are more exposed to reviewer and community criticism
you may no longer be around when questions arise

Result: Reproducibility shifts risk from “who did this?” to “what does the evidence show?”

Selfish reason #3: Reproducible work is cited more

This is one of the few incentives with quantitative evidence.

Colavisa et al. 2004 found that:

the early release of a publication as a preprint correlates with a significant positive citation advantage of about 20.2%,
sharing data in an online repository correlates with a smaller yet still positive citation advantage of 4.3%.

They did not see a significant citation advantage for papers which shared code, but note that “Further research is needed on additional or alternative measures of impact beyond citations”.

This effect has been shown even when controlling for:

journal impact
field differences
publication year

The takeaway is not “do this to game citations”, but: Visibility and reuse tend to follow clarity and accessibility.

It is already normal practice within astronomy to post pre-prints to the arXiv, and these finding should give us confidence that this is a good practice that should be continued.

In a study by Allen et al. 2018, papers from 2015 were scanned for citations and links to code. Of the 285 unique codes that were used, 58% offered source code for download - not a great success rate. However 90% of the hyperlinks to code were found to be still working at the time of the study (three years post publication). The lead author, Alice Allen, oversees the Astrophysics Source Code Library which you can think of as “an arXiv for code”, and provides a doi and permalink for people to cite code.

What happens when reproducibility is missing?

Reproducibility failures are rarely dramatic at first. More often, they look like:

results that “can’t quite be repeated”
models that work once but never again
performance that disappears when reused elsewhere

In astronomy and ML‑based science, documented failure modes include:

random samples that cannot be regenerated
machine‑learning models that rely on subtle data leakage
results that vanish when preprocessing is done correctly

In many published cases:

no fraud was involved
no one acted in bad faith
the problem was simply undocumented decisions

For ECRs, the risk is asymmetric:

the cost of failure is personal and immediate
the benefit of cutting corners is often short‑lived

Reproducibility as career insurance

It is reasonable to think of reproducibility as a form of insurance. You invest a small amount of effort:

documenting choices
fixing randomness
structuring workflows

In return, you reduce the chance of:

losing months of work
being unable to answer basic questions about your own results
inheriting an unfixable mess (or becoming one)

Reproducibility is insurance you pay for up front — instead of with stress later.

Take-away: “Reproducible research is not about being perfect, it’s about making your future life easier.”

Content from What do we mean by reproducibility?

Last updated on 2026-03-10 | Edit this page

What do we mean by reproducibility?

The word “reproducibility” is used in many ways, often imprecisely. In practice, misunderstandings about what kind of reproducibility is being claimed are a major source of confusion, frustration, and irreproducible results in computational astronomy and ML‑based research. This section introduces a small set of definitions that are widely used across computational sciences and astronomy, and shows how they apply in practice.

Many communities now distinguish between three ideas:

Repeatability: The same researchers can rerun the same code, on the same data, in the same environment, and get the same result.
Reproducibility: A different researcher (or you, later) can rerun the same code, on the same data, and get the same result.
Replicability: A different analysis, dataset, or method leads to a consistent scientific conclusion.

In astronomy, most failures happen at the repeatability level - before we even reach reproducibility.

Remember: If you cannot repeat your own result, no one else can reproduce it.

Why this matters in ML‑based astronomy

Machine learning makes reproducibility harder than many traditional analyses because it introduces:

randomness (initialisation, data splits, stochastic optimisation)
long, implicit preprocessing chains
complex software stacks
performance claims that depend on subtle choices

Common astronomy‑specific examples include:

a transient classifier whose accuracy depends on an undocumented random seed
a photometric redshift model trained on one survey and reused on another without retraining
a simulation‑trained model whose domain of validity is unclear

These issues are discussed explicitly in astronomy‑adjacent reproducibility training materials such as the MPI‑Astronomy reproducibility workshop.

Discussion

Classifying reproducibility claims

Read the following statements and decide which category they belong to: repeatable, reproducible, or replicable.

No discussion yet; just decide for yourself.

“I reran my notebook on my laptop and got the same plot.”
“My collaborator ran my code from GitHub and reproduced all figures.”
“Another group used a different survey and found the same astrophysical trend.”

After one minute, discuss briefly with a neighbour.

Reproducibility is a spectrum, not a switch

In real research, reproducibility is rarely all‑or‑nothing.

For example:

You might share code but not data
You might share data but not preprocessing
You might document assumptions but not environments

This is normal. The goal is not perfection, but making explicit what can and cannot be reproduced. The Turing Way emphasises this framing, especially for computational and ML‑based research.

Remember: Partial reproducibility is better than implicit irreproducibility.

What reproducibility does not require

A common misconception is that reproducibility means:

your code must be beautiful
your results must be flawless
everything must be public forever

None of these are true. Reproducibility requires clarity, not elegance. You can be reproducible while:

using imperfect code
reporting negative or null results
restricting access to sensitive data (with clear conditions)

Examples of good reproducible practice in astronomy

The following are widely cited examples of astronomy communities explicitly designing for reproducibility:

Astropy Collaboration
Astropy provides a reproducible, community‑maintained software ecosystem with explicit versioning, testing, and citation practices.
IVOA standards and Virtual Observatory workflows
The International Virtual Observatory Alliance (IVOA) has implemented interoperability and metadata standards that strongly align with FAIR and reproducibility principles .
Reproducible ML workflows in astronomy training
MPI‑Astronomy and similar groups explicitly teach reproducible ML pipelines, including environment capture and workflow provenance.

These examples show that reproducibility is not hypothetical; it is already embedded in successful astronomy infrastructure.

Discussion

Identifying reproducibility gaps

Think about your own current project. Silently answer the following:

Could you rerun your main result today?
Could someone else rerun it using your description?
Which part would fail first?

Write down one concrete gap (e.g. “train/test split not saved”, “data cleaning undocumented”).

No sharing required.

Reproducibility versus performance in ML papers

In ML‑based science, performance claims are often treated as evidence. That is to say, model performance (correlation of output to “known” results) is treated as a proxy for scientific validity (the ability to understand or explain a phenomena). This is the classic correlation vs causation confusion.

However, large‑scale reviews have shown that:

subtle methodological errors such as data leakage are widespread, leading to an over-reporting of model performance
claimed improvements often disappear when analyses are reproduced correctly (e.g Kapoor and Narayanan, 2023)

Performance does not imply physical understanding. A classifier that distinguishes galaxies from stars with high accuracy does not necessarily:

encode meaningful morphology
generalize across surveys
respect physical invariants

It may instead exploit:

PSF differences
survey depth artefacts
preprocessing quirks

This is not an argument against ML. It is an argument for clear, reproducible evidence when ML models are used to support scientific claims. Reproducible results are easier to interrogate, and lead to a higher confidence in the reported outcomes.

Key takeaways

Reproducibility, repeatability, and replicability are related but distinct
Most failures occur at the repeatability stage
ML increases the need for explicit documentation
Reproducibility is about clarity, not perfection
If you do not state what can be reproduced, readers will assume nothing can.

Content from Data, models, and FAIRness in ML astronomy

Last updated on 2026-03-10 | Edit this page

Data, models, and FAIR practices in ML astronomy

Machine learning workflows in astronomy produce more than papers. They produce:

training datasets
labels and annotations
trained models
derived catalogues
complex preprocessing pipelines

If these outputs are not shared in a FAIR way, ML results become difficult or impossible to reuse, validate, or build upon.

This section focuses specifically on FAIR data and model sharing practices, and how they apply to ML‑based astronomy.

What does FAIR mean in astronomy?

FAIR stands for:

Findable
Accessible
Interoperable
Reusable

The FAIR principles were introduced by Wilkinson et al. (2016) and are now widely adopted across scientific domains. In astronomy, FAIR has a concrete and well‑established interpretation through:

Virtual Observatory infrastructure
IVOA standards
long‑lived data archives

An overview of how FAIR maps onto astronomical practice is given by O’Toole and Tocknell (2022).

Summary: FAIR is about making research outputs usable by humans and machines.

Astronomy is close to FAIR by default

Astronomy is often described as “world‑leading” in data stewardship. This is largely true because:

most survey data are archived
metadata standards are well developed
access is usually long‑term

IVOA standards were, in practice, implementing FAIR‑like ideas before the FAIR principles were formally articulated. As a result:

raw observational data are often FAIR
catalogues linked to publications are often FAIR
archive‑level metadata is usually strong

However, ML workflows introduce new outputs that are often not FAIR.

Where ML workflows break FAIRness

In ML‑based astronomy, FAIR failures usually occur downstream of the archive. Common examples include:

private training sets derived from public data
labels created by individuals and never shared
preprocessing scripts that are undocumented
trained models released without context
catalogues published without provenance

These issues are explicitly identified in FAIR guidance for astronomy, which emphasises that provenance, processing history, and metadata are essential for reuse (O’Toole and Tocknell, 2022).

Remember: FAIR does not stop at the telescope. It extends through the full analysis pipeline.

FAIR applies to models, not just data

A common misconception is that FAIR applies only to raw data. In ML astronomy, FAIR should apply to:

training datasets
labels and annotations
feature representations
trained models
evaluation datasets

If a trained model is shared without:

its training data description
preprocessing steps
intended scope

then it is not reusable, even if the model file itself is available.

This point is emphasised in astronomy‑focused FAIR discussions, which note that machine‑actionable reuse requires rich metadata and provenance, not just file access (Berriman, 2022).

Discussion

Is your ML output FAIR?

Think about one ML artefact you have produced or used. For that artefact, ask:

Can someone find it?
Can they access it?
Can they understand what it expects as input?
Can they reuse it without asking you questions?

Write down one missing piece of information.

No sharing required.

FAIR does not mean open at all costs

Another common misconception is that FAIR means “everything must be open”. This is not true. FAIR explicitly allows for:

access controls
embargo periods
restricted data

The requirement is not openness, but clarity:

how the data can be accessed
under what conditions
with what limitations

FAIR guidance for astronomy explicitly separates:

openness (a policy choice)
FAIRness (a technical and metadata choice)

Data can be FAIR without being fully open.

Why FAIR matters for ML reuse

ML outputs are frequently reused:

by collaborators
by downstream projects
by future surveys

If ML artefacts are not FAIR:

reuse requires personal communication
validation becomes difficult
results quietly decay over time

FAIR practices reduce this fragility by:

making assumptions explicit
preserving provenance
supporting long‑term reuse

This is especially important in astronomy, where datasets often outlive individual projects and researchers.

Discussion

One FAIR improvement

Identify one concrete change you could make to improve FAIRness:

a README
a data dictionary
a provenance note
a model description

Write it down.

That is enough for now.

Key takeaways

FAIR applies to data, models, and derived products
Astronomy infrastructure already supports FAIR principles
ML workflows often fall outside existing FAIR practices
Small documentation steps dramatically improve reuse

Thought: “If your ML result cannot be reused without emailing you, it is not FAIR yet.”

Content from Ethics of ML & AI in astronomy research

Last updated on 2026-03-10 | Edit this page

Ethics of ML and AI in astronomy research

Ethics in ML is often discussed in the context of social or human impacts. In astronomy, ethical issues usually look quieter and more technical. They show up as:

overconfident claims
silent misuse of models
results applied beyond their valid scope
downstream users trusting outputs more than they should

This section focuses on research ethics in ML‑based astronomy, not policy or regulation.

Ethics in astronomy is mostly about scope and responsibility

Astronomy ML models are rarely deployed directly in high‑stakes decisions. Instead, they are used to:

classify objects
generate catalogues
estimate physical parameters
prioritise follow‑up observations

The ethical risk is not immediate harm. The risk is that scientific conclusions become detached from their assumptions.

In ML‑based research, ethical problems usually arise when:

models are reused outside their intended domain
limitations are undocumented or forgotten
outputs are treated as ground truth

Good news: Most ethical failures in astronomy ML are failures of scope, not intent.

Automation bias in scientific workflows

Automation bias is the tendency to over‑trust automated outputs, even when they are wrong. In astronomy, this appears when:

ML classifications are accepted without inspection
catalogue flags are treated as definitive
human judgement is quietly removed from the loop

This is especially common when:

datasets are too large for manual checking
models perform well on average
uncertainty is not visible in outputs

Automation bias does not require negligence. It emerges naturally when tools scale faster than scrutiny.

Example: When a model outlives its context

A common pattern in astronomy ML is:

a model is trained for a specific survey
it performs well and is published
it is reused years later on a different dataset

If the original paper does not clearly state:

training data assumptions
preprocessing steps
intended domain of validity

then reuse becomes guesswork. This is an ethical issue because:

downstream users are misled by omission
incorrect results propagate silently
responsibility becomes unclear

Remember: “Silence about limitations is not neutral.”

Discussion

Who is your model for?

Think about one ML model you have built or used.

Silently answer:

What dataset was it trained on?
What dataset was it validated on?
Where would you hesitate to apply it?

Write down one boundary you would not cross.

No sharing required.

Over-claiming is an ethical failure

In ML‑based science, performance numbers are often used to justify scientific claims. Problems arise when:

predictive accuracy is treated as physical understanding
benchmark improvements are interpreted as theoretical progress
limitations are buried in appendices or not stated at all

Overclaiming does not require exaggeration. It can occur simply by:

failing to state uncertainty
omitting evaluation caveats
allowing readers to assume too much

In astronomy, this is particularly risky because ML outputs are often reused far beyond their original context.

Transparency is an ethical safeguard

Ethical ML practice in astronomy is mostly about making assumptions visible. This includes:

documenting training data provenance
stating what the model was designed to do
stating what it was not tested on
clarifying how outputs should be interpreted

This aligns closely with reproducibility and FAIR practices:

ethics, reproducibility, and reuse reinforce each other
none of them work in isolation

Best practice: Transparency is the ethical minimum, not an optional extra.

Ethics without blame

It is important to be explicit about what this section is not saying. Ethical issues in ML astronomy usually:

are not caused by bad actors
are not the result of incompetence
arise from complex systems and incentives

Most problems emerge because:

ML scales faster than documentation
models are reused in good faith
assumptions fade over time

Framing ethics as a systems issue, not a personal failure, is essential for changing practice.

Discussion

One ethical improvement

Think of one small change you could make when reporting results that rely on ML/AI:

a clearer limitations paragraph
a model scope statement
a warning in documentation
an explicit “not tested on” list

Write it down.

That is enough.

Key takeaways

Ethical issues in astronomy ML are usually about scope and reuse
Automation bias can occur even with high‑performing models
Overclaiming often happens through omission, not exaggeration
Transparency is the most effective ethical safeguard

Most ethical failures in ML astronomy happen when tools are trusted more than they are understood.

Content from Practical takeaways for astronomers

Last updated on 2026-03-10 | Edit this page

Practical takeaways for astronomers

Reproducibility, FAIR practices, and ethical ML use can feel abstract until they are translated into concrete actions. This section focuses on practical minimum standards that astronomers can apply immediately, without rewriting their entire workflow or becoming software engineers. The goal is not perfection. The goal is for our work to be clear, honest, and reusable enough.

A minimum reproducibility standard for ML projects

For most astronomy ML projects, a reasonable minimum standard is that:

you can rerun your own main result
someone else could rerun it with effort, but without guessing
limitations are stated explicitly

In practice, this usually means having:

versioned code
fixed or recorded randomness
explicit data splits
documented preprocessing
a short description of scope and limitations

Anything beyond this is a bonus.

If you only document a few things, make them these:

Data provenance
- Where the data came from, including archive, release, and selection criteria.
Preprocessing steps
- What was done to the data before training, especially filtering, scaling, and feature construction.
Randomness
- Whether results depend on random seeds, and whether those seeds were fixed.
Model scope
- What the model was designed to do, and where it was not tested.

This information is often more important than hyperparameter details.

Rule of thumb: Documentation that answers questions is more useful than documentation that looks complete.

FAIR does not require new infrastructure

Many astronomers assume FAIR practices require:

specialised repositories
complex metadata schemas
institutional support

In reality, small steps already help a lot. For example:

a README.md in a Git repository
a data dictionary for derived features
a short “how to reproduce Figure 3” note
a model description paragraph in the paper

FAIRness improves through clarity, not tooling.

Discussion

The five‑minute README

Imagine someone opens your project folder in a year. Write down the headings of a README that would help them.

For example:

What this project does
Where the data came from
How to rerun the main result
Known limitations

You do not need to write the content now.

Just the headings are enough.

How to be ethically careful without being defensive

Ethical ML practice in astronomy does not require disclaimers everywhere.

It requires:

stating assumptions
stating limits
avoiding implied generality

Helpful phrases include:

“This model was trained on…”
“Performance was evaluated only on…”
“We did not test…”
“Results should not be interpreted as…”

These are not weaknesses. They are signals of careful science.

When full openness is not possible

Sometimes you cannot share:

proprietary data
sensitive observations
intermediate products
large files

This does not prevent reproducibility or FAIR alignment. You can still:

describe access conditions
share code without data
provide synthetic examples
document the full workflow

Silence is the only irreproducible option.

Reuse is where most harm (and benefit) happens

Most problems arise after publication, when:

models are reused on new surveys
catalogues are taken as ground truth
assumptions are forgotten

You cannot control all reuse.

You can influence it by:

writing clear limitations
choosing careful language
making uncertainty visible

This protects both downstream users and your future self.

Discussion

One concrete improvement

Think about your current or next project.

Identify one thing you could improve:

a clearer scope statement
a fixed random seed
a short README
a note on reuse limitations

Write it down. Do not optimise it. Just commit to doing that one thing.

A sustainable mindset

Good practice accumulates.

Most reproducible and ethical ML workflows were not built all at once.

They evolved because:

small habits stuck
mistakes were documented
clarity was rewarded

Progress is incremental.

Key takeaways

Aim for a clear minimum standard, not perfection
Document decisions that affect results
Make scope and limitations explicit
Small improvements compound over time

Remember: Good (ML) practice is not about doing everything right. It is about making fewer things mysterious.

Motivation: Why should you care about reproducible research?

Reproducibility is not (just) about being virtuous

Selfish reason #1: Reproducibility saves you time

Reflection

Selfish reason #2: Reproducible work is easier to defend

Selfish reason #3: Reproducible work is cited more

What happens when reproducibility is missing?

Reproducibility as career insurance

Take one minute to think (no sharing required):

What do we mean by reproducibility?

Three related but distinct concepts

Why this matters in ML‑based astronomy

Classifying reproducibility claims

Reproducibility is a spectrum, not a switch

What reproducibility does not require

Examples of good reproducible practice in astronomy

Identifying reproducibility gaps

Reproducibility versus performance in ML papers

Key takeaways

Data, models, and FAIR practices in ML astronomy

What does FAIR mean in astronomy?

Astronomy is close to FAIR by default

Where ML workflows break FAIRness

FAIR applies to models, not just data

Is your ML output FAIR?

FAIR does not mean open at all costs

Why FAIR matters for ML reuse

One FAIR improvement

Key takeaways

Ethics of ML and AI in astronomy research

Ethics in astronomy is mostly about scope and responsibility

Automation bias in scientific workflows

Example: When a model outlives its context

Who is your model for?

Over-claiming is an ethical failure

Transparency is an ethical safeguard

Ethics without blame

One ethical improvement

Key takeaways

Practical takeaways for astronomers

A minimum reproducibility standard for ML projects

What to document (even if you share nothing else)

FAIR does not require new infrastructure

The five‑minute README

How to be ethically careful without being defensive

When full openness is not possible

Reuse is where most harm (and benefit) happens

One concrete improvement

A sustainable mindset

Key takeaways

One‑page checklist: Reproducible, FAIR, and ethical ML in astronomy

1. Reproducibility (can someone rerun this?)

2. Data provenance (where did this come from?)

3. FAIR practices (is this reusable?)

4. Models and outputs (what exactly is being shared?)

5. Ethical safeguards (are assumptions visible?)

6. The five‑minute test

Remember

Wrap‑up and next steps

What you should take away

What to do next

How this fits into your career

Final reflection (optional)