Identifying and Overcoming Threats to Reproducibility, Replicability, Robustness, and Generalizability in Microbiome Research
ABSTRACT
PERSPECTIVE
THREATS TO REPRODUCIBILITY
Developing a framework.
Methods | Same experimental system | Different experimental system |
---|---|---|
Same methods | Reproducibility | Replicability |
Different methods | Robustness | Generalizability |
An example.
Reproducibility.
Replicability.
Robustness.
Generalizability.
FOSTERING A CULTURE OF GREATER REPRODUCIBILITY AND REPLICABILITY
Training.
Exercises.
Practice | Good | Better | Best |
---|---|---|---|
Handling of confounding variables | Prior to generating data, did we identify a list of possible confounding variables—biological and technical—that may obscure the interpretation of our results? | Do we indicate the level of randomization and experimental blocking that we performed to minimize the effect of the confounding variables? | Does the interpretation of our results limit itself to only those variables that are not obviously confounded? |
Sex/gender as confounding variables | Do we indicate the sex/gender of research animals/participants? | Do we provide a justification for the lack of even representation? | Is there equitable representation of sexes/genders? Do we account for them as a variable? |
Experimental design considerations | Do we have an active collaboration with a statistician who helps with experimental design and analysis? | Do we indicate the number of hypothesis tests that we performed and have we corrected any P values for multiple comparisons? | For our primary research questions, have we run a power analysis to determine the necessary sample size? |
Data analysis plan | Before starting an analysis, have we articulated a set of primary and secondary research questions? | Has someone else reviewed our data analysis plan prior to analyzing the data? | Have we registered our data analysis plan with a third party before starting the project? |
Provenance of reagents | Is there a table of reagents such as cell lines, strains, and primer sequences that were used? | Where possible, have we obtained reagents from certified entities like the American Type Culture Collection (ATCC)? | Is there a statement indicating how we know the provenance and purity of each cell line and strain? |
Controlling for initial microbiota | Are mice obtained from a breeding facility that allows us to track their pedigree? | Where possible, are mice from different treatment groups cohoused to control for differences in initial microbiota? | Are comparisons between mice with different genotypes made using mice that are the result of matings between animals that are heterozygous for that genotype? |
Clarity of software descriptions | Are all methods, databases, and software tools cited? Do we follow the relevant licensing requirements of each tool? | Do we indicate dates and version numbers of websites that were used to obtain data, code, and other third-party resources? | Are detailed methods registered on a website like protocols.io or GitHub? |
DNA contamination | Did we quantify the background DNA concentration in our reagents? Did we sequence an extraction control? | Are we taking steps to minimize reagent contamination? | What methods do we take to confirm a result that a sequencing result may be clouded by contaminating DNA? |
Availability of data products | Are all of the raw data publicly available? | Are intermediate and final data files publicly available? | Are tools like Amazon Machine Images (AMIs) used to make a snapshot of our working directory? |
Availability of metadata | Are all of the metadata necessary to repeat any analyses that we performed publicly available? | Have we adhered to standards in releasing the minimum amount of metadata about our samples? | Did we go beyond the minimum to incorporate other pieces of metadata that will inform future studies? |
Data analysis organization | Are all data, code, results, and documentation housed within a monophyletic folder structure on our computer? | Is this project contained within a single directory on our computer, and does it separate our raw and processed data, code, documentation, and results? | Is this folder structure under version control? Is the project’s repository publicly available? Are there assurances that this repository will remain accessible? |
Availability of data analysis tools | Are free and open tools used in preference to proprietary commercial tools? | Is the computer code required to run analyses available through a service like GitHub? | Are Amazon Machine Images or Docker containers used to allow recreation of our work environment? |
Documentation of data analysis workflow | Is our code well documented? Do we use a self-commenting coding practice? | Does each of our scripts have a header indicating the inputs, outputs, and dependencies? Is it documented how files relate to each other? | Are automated workflow tools like GNU Make and Common Workflow Language used to convert raw data into final tables, figures, and summary statistics? |
Use of random number generator | Do we know whether any of the steps in our data analysis workflow depend on the use of a random number generator? | For analyses that utilize a random number generator, have we noted the underlying random seed? | Have we repeated our analysis with multiple seeds to show that the results are insensitive to the choice of the seed? |
Defensive data analysis | Is our data analysis pipeline flexible enough to add new data? | Does our code include tests to confirm that it does what we think it does? | Did we make use of automated tests and continuous integration tools to ensure internal reproducibility? |
Ensuring short- and long-term reproducibility | Did we release the underlying code and new data at the time of submitting a paper with their DOIs and accession numbers? | Did we include a reproducibility statement or declaration at the end of the manuscript? Are ORCID identifiers provided for all authors? | What mechanisms are in place to ensure that our analysis remains accessible and reproducible in 5 years? |
Open science to foster reproducibility | Have we released any embargoes on our code repository and raw data prior to submitting the manuscript? | Did we post a preprint version of our manuscript prior to submission? | Have we published under a Creative Commons license? Is a permissive reuse license posted with our code? |
Transparency of data analysis | Is it clear where one would go to find the data and processing steps behind any of our figures? | Are electronic notebooks publicly accessible, and do they accompany the manuscript? | Were literate programming tools used to generate summary statistics, tables, and figures? |
CONCLUSION
ACKNOWLEDGMENTS
REFERENCES
Information & Contributors
Information
Published In
Copyright
History
Keywords
Contributors
Editor
Metrics & Citations
Metrics
Note:
- For recently published articles, the TOTAL download count will appear as zero until a new month starts.
- There is a 3- to 4-day delay in article usage, so article usage will not appear immediately after publication.
- Citation counts come from the Crossref Cited by service.
Citations
If you have the appropriate software installed, you can download article citation data to the citation manager of your choice. For an editable text file, please select Medlars format which will download as a .txt file. Simply select your manager software from the list below and click Download.