From Data to Discovery: How C-SPIRIT Is Transforming Metabolite Annotation

By Ashley Stender | June 23, 2025

Over 95% of compounds found in plant and microbial studies remain structurally and functionally undefined. Despite techniques like liquid chromatography–mass spectrometry (LC-MS) producing enormous volumes of chemical data, this data remains largely uninterpreted, leaving the majority of these natural products unanalyzed and unused and creating a significant bottleneck in discovering novel bioactive products from plant and microbial sources.

C-SPIRIT’s Aim 2: Metabolite Annotation and Database Development seeks to change that. The initiative focuses on three tightly connected goals: 1) building the first comprehensive annotation framework for plant and microbial metabolites, 2) designing high-throughput pipelines to identify chemical patterns and stress responses, and 3) integrating annotated data into public, intuitive databases that empower researchers across disciplines. Together, these efforts aim to raise annotation rates dramatically, turning unreadable chemical data into actionable biological insight.

One key insight behind Aim 2 is that pairing chemical data with contextual information, such as species identity or environmental conditions, can significantly improve annotation outcomes.

“If we attach the metadata, like what conditions it came from, [or] what species it came from, more than 90% of the peaks can be annotated,” explains Gaurav Moghe, the lead of Aim 2.

How Aim 2 Took Root

Gaurav Moghe is a plant biologist and associate professor in the School of Integrative Plant Science at Cornell University. Moghe’s lab has been focused on metabolomics and computational biology for many years, particularly in the context of plant biology. Before C-SPIRIT, he was already engaged in collaborative work with Sue Rhee, C-SPIRIT’s director, as a part of the Plant Metabolic Network. Moghe and Rhee, along with Ola Skirycz and Karine Prado at Michigan State University, Mingxun Wang at UC Riverside, and Jazz Dickinson at UC San Diego, worked to improve methods for metabolite annotation. These early efforts included identifying biological signals in publicly available datasets and working on developing frameworks for organizing this information.

“I feel like [C-SPIRIT] was a natural transition to think about how we can do this on a significantly larger scale, impacting the global plant science community,” Moghe says.

When Rhee began planning to write the proposal that would become C-SPIRIT, Moghe’s leadership and expertise became foundational. His group at Cornell was already familiar with the computational challenges and had a deep appreciation for the need to make metabolomics data more accessible and interpretable.

For Moghe, the metabolite annotation challenge is not just a technical bottleneck–it’s a gateway to deeper biological understanding.

“We think that illuminating the 95% of the ‘dark metabolome’ is an essential step in understanding how crops respond to stress, thereby making agriculture more resilient to extreme weather events,” Moghe says. “[Accessible metabolite data] will not only help in applied fields, but also in more foundational areas of plant science.” Today, Aim 2 researchers based in different universities in the US, Canada, UK, Japan, and South Korea are working together to achieve this goal.

Building Flexible Tools for a Complex Field

One of Aim 2’s central goals is building the first comprehensive annotation framework for plant and microbial metabolites. The Aim 2 team is designing a system that will organize LC-MS/MS data into distinct ontological categories. The goal is to create a metabolite “map” that works across species, ecosystems, and experimental conditions. The framework extends existing ontologies like ChemOnt developed by C-SPIRIT’s Canadian team member David Wishart and Gene Ontology, but adds the specificity needed to make sense of understudied metabolites in real-world datasets.

However, defining the metabolite data in the biological context is only part of the solution. Aim 2 also focuses on making that annotation process scalable. The Aim 2 team is building flexible pipelines that combine tools like MzMine, which processes LC-MS data to detect and quantify metabolite peaks, and the Global Natural Products Social Molecular Networking platform (GNPS) developed by C-SPIRIT collaborator Mingxun Wang at UC Riverside, which helps identify structural relationships between compounds through molecular networking. These tools are paired with custom clustering workflows and machine learning approaches to analyze large and variable datasets.

These pipelines can normalize, filter, and group large metabolomic datasets into meaningful clusters, revealing chemical patterns tied to stress responses, developmental stages, or organ specificity.

One major advantage of this approach is adaptability. “We’re not trying to create a one-size-fits-all solution. We’re building an adaptable toolbox,” Moghe says.

That philosophy is essential given the variety of instrumentation and protocols across labs. “People generate this data using very different protocols and very different instruments,” he says. “Every dataset needs to be looked at differently.”

Moghe has applied the pipeline to sweet potato datasets generated under regular water, drought, and high-water stress. His team at Cornell University examined how different varieties responded to each condition, identifying distinct metabolic signatures tied to environmental challenges. “Growers and breeders already know that even under the same stress, like drought, different varieties of the same crop behave differently. Using such high-throughput techniques like RNA-seq and LC-MS, we can pinpoint the specific genes, metabolites and networks associated with these differences, pointing us to solutions for making more resilient varieties” Moghe says. Different C-SPIRIT teams are generating similar datasets in other crops like rice, cassava, potato and even microbial samples under stress conditions, enabling identification of species-specific and conserved metabolic signatures. These case studies will help the team translate raw LC-MS data into biological insights.

Turning Data into Discovery

The final piece of Aim 2’s vision is making annotated metabolomic data not only usable, but discoverable. To do that, the team is working to integrate their results into public databases and build visualization tools that invite exploration across species, tissues, and stresses.

These tools will be grounded in the FAIR data principles: Findable, Accessible, Interoperable, and Reusable. Aim 2 plans to integrate its outputs into platforms like GNPS, where researchers can explore molecular networks, and the BioAnalytical Resource (BAR) developed by C-SPIRIT’s Canada lead Nicholas Provart, which connects metabolite activity with gene expression and other omics datasets.

“We want scientists anywhere to be able to use these tools,” Moghe says. “Whether they’re studying crop resilience, metabolic engineering, or basic biochemistry.”

Aim 2’s integrative approach also helps surface new research questions. When a metabolite appears only in a specific tissue across multiple species, or shows a unique response to flooding, it can point to conserved biological functions or adaptation strategies. “We might be able to use that information… for breeding or engineering new, more resilient crop varieties ” Moghe explains.

By pairing public access with meaningful context, Aim 2 is turning static datasets into living resources. This work is helping to fuel innovation in agriculture, AI, ecology, biotechnology, and beyond.

Linking Across the C-SPIRIT Pipeline

Aim 2 operates within a flexible discovery system that spans all six of C-SPIRIT’s research Aims. Each Aim contributes to identifying, analyzing, or testing candidate bioactive compounds, but the process is not strictly linear. Instead, compounds move through different tracks depending on their origin and how much is already known.

Most candidates begin in Aim 1, which sources molecules from biodiversity fieldwork, stress-responsive metabolomics, and commercial libraries. Some of these compounds are well-characterized; others are entirely novel. Aim 2 plays a central role in interpreting the unknowns. It annotates their structure, function, and biological context, often enabling further analysis in Aim 3, where researchers investigate the genetic pathways that produce them.

However, not every compound follows that route. “We didn’t want to put all our eggs in one basket… so we have three different streams of discovery of bioactive metabolites,” Moghe says.

In some cases, a compound may bypass some steps altogether and proceed directly to testing.

When Aim 2 is applied, it deepens understanding. Its frameworks support comparative analysis, reveal conserved patterns, and inform decisions across the pipeline. Even when not used for every candidate, the insights it generates have lasting value.

As C-SPIRIT works to transform how bioactive compounds are discovered and deployed, Aim 2 offers more than just a technical contribution. It builds the shared language, tools, and insight that enable a global team to translate complex data into real-world biological discovery.