MediaIngredientMech: LLM-Assisted Media Ingredient Curation

Overview

MediaIngredientMech leverages Large Language Models (LLMs) to curate and standardize media ingredient ontology mappings for microbial cultivation research. It addresses the challenge of inconsistent ingredient naming and ambiguous chemical identifiers in cultivation protocols through AI-assisted semantic matching and human-in-the-loop validation workflows.

The Challenge: Media ingredients are described using inconsistent terminology across scientific literature, culture collections, and laboratory protocols. The same chemical compound might be referred to by common names, trade names, systematic IUPAC names, or ambiguous abbreviations, making automated data integration difficult.

The Solution: MediaIngredientMech uses LLMs to intelligently map ingredient names to standardized ontology terms (ChEBI, PubChem, METPO), with confidence scoring, batch processing capabilities, and curation workflows for ambiguous cases.


Key Features

🤖 LLM-Powered Semantic Matching

📋 Curation Workflows

📊 Quality Metrics

🔗 Ontology Integration

💾 Standardized Outputs


Technical Architecture

LLM-Assisted Mapping Pipeline

Raw Ingredient Names
    ↓
Text Preprocessing & Normalization
    ↓
LLM Semantic Analysis
    ↓
Ontology Candidate Retrieval
    ↓
Confidence Scoring
    ↓
├─ High Confidence → Automated Mapping
└─ Low Confidence → Human Curation Queue
    ↓
Validated Mappings
    ↓
Export to Knowledge Graph

Integration with CultureBotAI Ecosystem

MediaIngredientMech enhances the AI curation pipeline by providing semantic standardization:


LLM-Assisted Workflows

Workflow 1: Automated High-Confidence Mapping

For ingredient names with clear, unambiguous mappings:

# Example: High-confidence automated mapping
ingredient = "glucose"
result = mediaingredient_mech.map(ingredient)
# → {
#     "input": "glucose",
#     "mapped_term": "D-glucose",
#     "chebi_id": "CHEBI:17234",
#     "confidence": 0.99,
#     "status": "automated"
# }

Workflow 2: Ambiguous Case Resolution

For ingredients with multiple possible interpretations:

# Example: Ambiguous ingredient requiring curation
ingredient = "peptone"
result = mediaingredient_mech.map(ingredient)
# → {
#     "input": "peptone",
#     "candidates": [
#         {"term": "peptone", "chebi_id": "CHEBI:8429", "confidence": 0.65},
#         {"term": "proteose peptone", "source": "common_name", "confidence": 0.55},
#         {"term": "casein peptone", "source": "common_name", "confidence": 0.50}
#     ],
#     "status": "requires_curation",
#     "rationale": "Multiple peptone types exist; context needed"
# }

Workflow 3: Batch Processing

Process large datasets efficiently:

# Example: Batch processing of ingredients
ingredients = load_ingredients_from_csv("media_ingredients.csv")
results = mediaingredient_mech.batch_map(ingredients,
                                         confidence_threshold=0.8)

# Separate by curation status
automated = results.filter(status="automated")
needs_review = results.filter(status="requires_curation")

# Export results
automated.export("mapped_ingredients.json")
needs_review.export("curation_queue.json")

Use Cases

1. Legacy Data Standardization

Standardize ingredient names from historical culture collection records to enable modern computational analysis.

2. Literature Mining Enhancement

Improve the quality of ingredient extraction from scientific publications by resolving ambiguous names to specific chemical entities.

3. Cross-Database Integration

Harmonize ingredient vocabularies across different culture collections (ATCC, DSMZ, JCM) for unified querying and analysis.

4. AI Training Data Preparation

Create high-quality, ontology-grounded training datasets for machine learning models predicting growth requirements.

5. Real-Time Laboratory Support

Provide ingredient standardization as a service for laboratory information management systems (LIMS) during data entry.


Ontology Integration

ChEBI (Chemical Entities of Biological Interest)

Primary ontology for chemical compound identification:

PubChem

Complementary chemical database integration:

METPO (Microbial Ecology and Taxonomy Phenotypic Ontology)

Domain-specific ontology for microbial cultivation:


Quality Control Features

Confidence Scoring

Validation Checks

Provenance Tracking


Example: Ingredient Mapping

Input Data

ingredient_name,source,context
"NaCl","ATCC Medium 1","Marine bacterium medium"
"table salt","Lab protocol","General bacteriology"
"sodium chloride","Literature","Halophile cultivation"
"salt","DSMZ 514","Seawater-based medium"

MediaIngredientMech Processing

{
  "mappings": [
    {
      "input": "NaCl",
      "standardized_name": "sodium chloride",
      "chebi_id": "CHEBI:26710",
      "chebi_name": "sodium chloride",
      "confidence": 0.99,
      "status": "automated",
      "synonyms": ["NaCl", "table salt", "salt", "halite"]
    },
    {
      "input": "table salt",
      "standardized_name": "sodium chloride",
      "chebi_id": "CHEBI:26710",
      "confidence": 0.95,
      "status": "automated",
      "note": "Common name mapped to systematic term"
    },
    {
      "input": "salt",
      "candidates": [
        {"name": "sodium chloride", "chebi_id": "CHEBI:26710", "confidence": 0.75},
        {"name": "salts", "chebi_id": "CHEBI:24866", "confidence": 0.65}
      ],
      "status": "requires_curation",
      "rationale": "Ambiguous: could refer to specific salt (NaCl) or general salt category"
    }
  ]
}

Repository & Documentation


Getting Started

Installation

# Clone the repository
git clone https://github.com/CultureBotAI/MediaIngredientMech.git
cd MediaIngredientMech

# Install dependencies
pip install -r requirements.txt

# Configure LLM API keys (if using external models)
export OPENAI_API_KEY="your-api-key"
# or
export ANTHROPIC_API_KEY="your-api-key"

Basic Usage

from mediaingredientmech import IngredientMapper

# Initialize mapper
mapper = IngredientMapper(
    ontology="chebi",
    confidence_threshold=0.8
)

# Map a single ingredient
result = mapper.map_ingredient("yeast extract")
print(f"Mapped to: {result.chebi_name} ({result.chebi_id})")

# Batch processing
ingredients = ["peptone", "glucose", "NaCl", "agar"]
results = mapper.batch_map(ingredients)

# Export mappings
results.to_json("ingredient_mappings.json")

Research Impact

MediaIngredientMech improves data quality throughout the CultureBotAI ecosystem by providing:

It is part of the KG-Microbe knowledge graph project at Lawrence Berkeley National Laboratory.



Future Directions

Planned Enhancements

Community Contributions

We welcome contributions for:


Contact & Collaboration

For questions about MediaIngredientMech or to contribute:


Citation

If you use MediaIngredientMech in your research, please cite the KG-Microbe preprint and reference this tool:

MediaIngredientMech: LLM-Assisted Media Ingredient Curation
CultureBotAI Organization
https://github.com/CultureBotAI/MediaIngredientMech