Throughout the workshop discussions, two cross-cutting areas for improvement related to microbiome data and standards emerged: (i) encourage a culture that shares microbiome data, and (ii) understand and reduce barriers to (meta)data submission. We present a summary of the workshop discussions in the context of these two key themes.
Encourage a culture that shares microbiome data.
Success in science is often measured by high-impact publications (
7), creating pressure to be the first to make important discoveries and receive credit for the published contribution. Waiting until findings are published before making data available to others is not uncommon and remains a significant barrier to the provision of data to the broader community (
8,
9). Even post publication, data sharing continues to be challenging due to a noted lack of time to prepare data for sharing and reuse, legal or privacy constraints, and concerns about misinterpretation or misuse of data (
8,
10). As a result, researchers often cannot find data (
11), or spend up to 50 to 80% of their time wrangling data into a more usable form (
12). The current data revolution highlights the need to explore other measures of success (
13–15), as researchers are producing massive quantities of data that could provide valuable context for questions far beyond their original intent. While funding agencies are discussing ways to mandate data sharing (
16), the sharing of high-quality, well-curated data should also be driven by incentives. Other considerations include a mechanism to request permission to use data sets prior to publication by the data owner(s), as scientists would be more willing to share data with certain conditions on its use (
8).
To encourage a culture that shares microbiome data, it is critical to develop incentives and promote ways to reward data stewardship. This workshop brainstormed several ways to encourage a culture that shares microbiome data, which the NMDC team is working to support.
(i) Establish digital object identifiers (DOIs) to enable data set citations. It has widely been reported that receiving credit through data set citations is important for data sharing (
8,
17). Providing a method for citing data sets in published articles opens the door for data set reuse to be quantified and, therefore, easily incorporated as a new metric in the research incentives structure. Journals that publish data set papers, such as
Nature Scientific Data,
Gigascience, and
Microbiology Resource Announcements, are an important start, and other publishers have started these discussions (
18). Several organizations are able to issue and register DOIs for data sets, but determining the granularity of DOI assignment at the individual data set or project level, as well as tracking mechanisms, remain challenging. Further coordination with funders and additional publishers will be critical for defining, establishing, and promoting data citations and accurate citation metrics.
(ii) Host data analysis competitions to support training on FAIR data for early career researchers. Early career researchers, including graduate students, are seen as critically important for catalyzing the cultural shift toward sharing well-curated microbiome data. While they may not get to decide when their data are shared, early career researchers are often responsible for the experiments, data collection, data management, data formatting, and efforts needed to make experimental data reusable and publicly accessible. Because of the inherent data access and transparency challenges (
19), meta-analyses can serve as important training for early career researchers to (i) understand the challenges in finding, accessing, and preparing data sets for analysis; (ii) recognize and appreciate data sets that are well curated and accessible; and (iii) thus, be motivated to prepare and share their own data. Hosting data competitions (e.g., DREAM challenges,
http://dreamchallenges.org/) to encourage meta-analyses can showcase data sharing and reproducible science, while also providing benefits for participants (training, professional development, funding) and making important contributions to science (
20–23). Further, data competitions can showcase how aggregating multiple standardized, well-curated microbiome data sets can enable new discoveries (
24) and, more importantly, forge new paths for optimizing data collection and applying data standards earlier in the research workflow.
(iii) Celebrate the value added by impactful meta-analyses. When exploring how to address the current grand challenges in microbiome science, novel approaches using large-scale data science applications are no longer a goal, but a necessity (
25). For example, the increased application of machine learning to biological problems (
26) has begun to expand how we think about data and data sharing (
27). It used to be thought that researchers who published work using someone else’s published data were considered “data parasites” (
28,
29). Now, the Pacific Symposium on Biocomputing celebrates the impactful meta-analyses through their annual Research Parasite Awards (
https://researchparasite.com/), which highlight important contributions of secondary analyses. Well-curated and FAIR microbiome data sets will be necessary for our field to explore applications of machine learning, automation, and secondary analyses (
30,
31).
While making data accessible is an important first step, data sets with missing information, erroneous values, or inconsistent formats hinder reuse. The workshop participants also discussed ways to incentivize efforts for sharing reusable data.
(iv) Establish comprehensive and coordinated data management plan(s) in collaboration with funders, publishers, and research service centers. While funders and publishers have moved toward encouraging open access to data (
32), the details of their data sharing policies vary (
33,
34), and there are insufficient resources for enforcement (
35). Data access remains a challenge for reproducible science (
11,
34,
36,
37). A comprehensive data management plan that includes community standards should be supported by both funders and publishers, which would provide structure and guidelines for data management best practices throughout the scientific research process (
38). In addition, a partnership with research service centers, such as sequencing and other omics centers, can provide an effective strategy for revisiting data management plans earlier in the data life cycle, before experimental data is generated.
(v) Provide training for a variety of learning styles. Data management best practices and data standards and ontologies are powerful tools in support of the FAIR data principles. However, even seasoned scientists are often overwhelmed by guidelines and intimidated by ontologies. It isn’t enough to create a comprehensive data management plan. Making this material accessible to the diversity of individuals who participate in the research process will be critical for effective adoption. A “quick start” guide is often a more approachable entry point for a data management novice. Extensive, searchable documentation is key for veterans who just need a refresher. To allow understanding and exploration of these data types, access can be provided through interfaces that allow programmatic access and visual representation to support researchers with and without computational expertise. Further, the use of various formats, such as tutorial videos, interactive webinars, and in-person events, support a diversity of learning styles and enable bidirectional communication, which is critical for improving and updating training materials.
(vi) Establish a certification of “compliance.” Despite the significant efforts already invested in defining minimum standards for microbiome data, such as the Minimum Information about any (x) Sequence (MIxS) packages (
39), important work remains to ensure that the various standards and ontologies are interoperable and easily accessible to the research community. This entails working with researchers to identify metadata attributes that are valuable for data reuse within their respective communities, and defining community-specific benchmarks. Establishing a “certification of compliance” based on these benchmarks would enable designation of data sets ready for reuse, which encourages inclusion in follow-up studies and enhances their citation metrics (see section i above).
Understand and reduce barriers to data submission.
In addition to encouraging a culture that shares microbiome data, the workshop participants also discussed infrastructure challenges that impede sharing. Current data submission processes to primary data repositories or analytic platforms can be difficult to navigate, creating barriers even for good data stewards. The workshop participants suggested the following as a starting point to understand and reduce barriers to data/metadata submission.
(i) Understand how communities are currently using MIxS packages. MIxS packages are available for a variety of sample types and environments, but comparing their usage across data repositories is challenging. Are certain domains using them more or less often than others? For example, identifying research areas (e.g., domain, geographic location) that rarely use MIxS packages, submit data with the minimal required fields, or use null values to represent more than one meaning (e.g., missing versus not collected) enables a more targeted approach to training and outreach.
(ii) Explore ways to harmonize data submission processes across platforms. Data submission portals, such as those involved in the International Nucleotide Sequence Database Collaboration (INSDC) (
40), each have unique requirements and interfaces, some having more robust manuals or training documents than others. Enabling coordination through community standards and appropriate training materials will greatly enhance the availability of FAIR microbiome data.
(iii) Validate sample metadata with immediate, informative feedback. Using ontologies or MIxS packages requires the use of specific formats for sample metadata attributes. Most communities manage data in spreadsheets without use of controlled vocabularies or data standards, and reformatting entries is error-prone. Reducing barriers to reformatting spreadsheets using sample metadata validators provides immediate, informative, and targeted feedback (
41). Efficient and effective data submission has a significant impact on researchers’ likelihood to share well-curated data.