What is CultureBotAI?

CultureBotAI is a research initiative at Lawrence Berkeley National Laboratory in Berkeley, California, that develops AI-driven tools and knowledge graphs for microbial cultivation and computational biology.

Who leads CultureBotAI?

CultureBotAI is led by Dr. Marcin P. Joachimiak, a scientist specializing in microbiology, knowledge graph development, and computational biology at Lawrence Berkeley National Laboratory.

Where is CultureBotAI based?

CultureBotAI is based at Lawrence Berkeley National Laboratory in Berkeley, California, within the Environmental Genomics and Systems Biology Division.

KG-Microbe is a comprehensive modular knowledge graph developed by Dr. Marcin P. Joachimiak that integrates diverse microbial data sources to enable AI-driven insights for growth prediction and culture optimization.

How can I access KG-Microbe?

KG-Microbe is available on GitHub at https://github.com/Knowledge-Graph-Hub/kg-microbe under the BSD-3-Clause license. The preprint is available at https://doi.org/10.1101/2025.02.24.639989

What does CultureBotAI work on?

CultureBotAI focuses on three main areas: (1) cultivation of isolated and novel organisms, (2) culture optimization through data-driven approaches, and (3) growth preference prediction using machine learning and AI methods.

MediaIngredientMech: LLM-Assisted Media Ingredient Curation

Overview

MediaIngredientMech leverages Large Language Models (LLMs) to curate and standardize media ingredient ontology mappings for microbial cultivation research. It addresses the challenge of inconsistent ingredient naming and ambiguous chemical identifiers in cultivation protocols through AI-assisted semantic matching and human-in-the-loop validation workflows.

The Challenge: Media ingredients are described using inconsistent terminology across scientific literature, culture collections, and laboratory protocols. The same chemical compound might be referred to by common names, trade names, systematic IUPAC names, or ambiguous abbreviations, making automated data integration difficult.

The Solution: MediaIngredientMech uses LLMs to intelligently map ingredient names to standardized ontology terms (ChEBI, PubChem, METPO), with confidence scoring, batch processing capabilities, and curation workflows for ambiguous cases.

Key Features

🤖 LLM-Powered Semantic Matching

Leverages foundation models for ingredient name normalization
Context-aware mapping considering cultivation domain knowledge
Handles synonyms, abbreviations, and trade names
Multi-ontology alignment (ChEBI, PubChem, custom vocabularies)

📋 Curation Workflows

Automated processing for high-confidence matches
Human-in-the-loop validation for ambiguous cases
Batch processing of large datasets
Quality control and consistency checking

📊 Quality Metrics

Confidence scores for each mapping
Provenance tracking for curation decisions
Validation status and review history
Inter-annotator agreement metrics

🔗 Ontology Integration

ChEBI (Chemical Entities of Biological Interest)
PubChem compound database
METPO (Microbial Ecology and Taxonomy Phenotypic Ontology)
Custom microbial cultivation vocabularies

💾 Standardized Outputs

Export to JSON, TSV, RDF formats
Integration with LinkML schemas
Compatible with kg-microbe knowledge graph
API-ready structured data

Technical Architecture

LLM-Assisted Mapping Pipeline

Raw Ingredient Names
    ↓
Text Preprocessing & Normalization
    ↓
LLM Semantic Analysis
    ↓
Ontology Candidate Retrieval
    ↓
Confidence Scoring
    ↓
├─ High Confidence → Automated Mapping
└─ Low Confidence → Human Curation Queue
    ↓
Validated Mappings
    ↓
Export to Knowledge Graph

Integration with CultureBotAI Ecosystem

MediaIngredientMech enhances the AI curation pipeline by providing semantic standardization:

Receives Input From:
- CultureMech - Extracted chemical entities from media recipes
- Manual curation efforts
- Legacy database imports
- Literature mining outputs
Feeds Into:
- MicroMediaParam - Enhanced ingredient standardization
- kg-microbe - Ontology-grounded knowledge graph
- MicroGrowAgents - Evidence-based media design
Complements:
- CultureMech - Chemical entity extraction
- CommunityMech - Community-level curation

LLM-Assisted Workflows

Workflow 1: Automated High-Confidence Mapping

For ingredient names with clear, unambiguous mappings:

# Example: High-confidence automated mapping
ingredient = "glucose"
result = mediaingredient_mech.map(ingredient)
# → {
#     "input": "glucose",
#     "mapped_term": "D-glucose",
#     "chebi_id": "CHEBI:17234",
#     "confidence": 0.99,
#     "status": "automated"
# }

Workflow 2: Ambiguous Case Resolution

For ingredients with multiple possible interpretations:

# Example: Ambiguous ingredient requiring curation
ingredient = "peptone"
result = mediaingredient_mech.map(ingredient)
# → {
#     "input": "peptone",
#     "candidates": [
#         {"term": "peptone", "chebi_id": "CHEBI:8429", "confidence": 0.65},
#         {"term": "proteose peptone", "source": "common_name", "confidence": 0.55},
#         {"term": "casein peptone", "source": "common_name", "confidence": 0.50}
#     ],
#     "status": "requires_curation",
#     "rationale": "Multiple peptone types exist; context needed"
# }

Workflow 3: Batch Processing

Process large datasets efficiently:

# Example: Batch processing of ingredients
ingredients = load_ingredients_from_csv("media_ingredients.csv")
results = mediaingredient_mech.batch_map(ingredients,
                                         confidence_threshold=0.8)

# Separate by curation status
automated = results.filter(status="automated")
needs_review = results.filter(status="requires_curation")

# Export results
automated.export("mapped_ingredients.json")
needs_review.export("curation_queue.json")

Use Cases

1. Legacy Data Standardization

Standardize ingredient names from historical culture collection records to enable modern computational analysis.

2. Literature Mining Enhancement

Improve the quality of ingredient extraction from scientific publications by resolving ambiguous names to specific chemical entities.

3. Cross-Database Integration

Harmonize ingredient vocabularies across different culture collections (ATCC, DSMZ, JCM) for unified querying and analysis.

4. AI Training Data Preparation

Create high-quality, ontology-grounded training datasets for machine learning models predicting growth requirements.

5. Real-Time Laboratory Support

Provide ingredient standardization as a service for laboratory information management systems (LIMS) during data entry.

Ontology Integration

ChEBI (Chemical Entities of Biological Interest)

Primary ontology for chemical compound identification:

Systematic chemical classification
Hierarchical relationships (is_a, has_part)
Molecular formulas and structures
Cross-references to other databases

PubChem

Complementary chemical database integration:

Compound identifiers (CID)
Chemical structure search
Bioassay data links
Literature references

METPO (Microbial Ecology and Taxonomy Phenotypic Ontology)

Domain-specific ontology for microbial cultivation:

Growth condition terms
Media component vocabulary
Phenotypic trait descriptions
Integration with kg-microbe

Quality Control Features

Confidence Scoring

0.90-1.00: High confidence (automated approval)
0.70-0.89: Medium confidence (optional review)
0.00-0.69: Low confidence (requires curation)

Validation Checks

Taxonomic appropriateness (e.g., plant extracts for phototrophs)
Chemical compatibility (e.g., pH stability)
Concentration reasonableness
Cross-reference consistency

Provenance Tracking

Mapping algorithm version
LLM model and parameters used
Human curator identity (if applicable)
Timestamp and review history

Example: Ingredient Mapping

Input Data

ingredient_name,source,context
"NaCl","ATCC Medium 1","Marine bacterium medium"
"table salt","Lab protocol","General bacteriology"
"sodium chloride","Literature","Halophile cultivation"
"salt","DSMZ 514","Seawater-based medium"

MediaIngredientMech Processing

{
  "mappings": [
    {
      "input": "NaCl",
      "standardized_name": "sodium chloride",
      "chebi_id": "CHEBI:26710",
      "chebi_name": "sodium chloride",
      "confidence": 0.99,
      "status": "automated",
      "synonyms": ["NaCl", "table salt", "salt", "halite"]
    },
    {
      "input": "table salt",
      "standardized_name": "sodium chloride",
      "chebi_id": "CHEBI:26710",
      "confidence": 0.95,
      "status": "automated",
      "note": "Common name mapped to systematic term"
    },
    {
      "input": "salt",
      "candidates": [
        {"name": "sodium chloride", "chebi_id": "CHEBI:26710", "confidence": 0.75},
        {"name": "salts", "chebi_id": "CHEBI:24866", "confidence": 0.65}
      ],
      "status": "requires_curation",
      "rationale": "Ambiguous: could refer to specific salt (NaCl) or general salt category"
    }
  ]
}

Repository & Documentation

GitHub: github.com/CultureBotAI/MediaIngredientMech
License: To be determined
Language: Python
Status: Active development (Public as of March 2026)

Getting Started

Installation

# Clone the repository
git clone https://github.com/CultureBotAI/MediaIngredientMech.git
cd MediaIngredientMech

# Install dependencies
pip install -r requirements.txt

# Configure LLM API keys (if using external models)
export OPENAI_API_KEY="your-api-key"
# or
export ANTHROPIC_API_KEY="your-api-key"

Basic Usage

from mediaingredientmech import IngredientMapper

# Initialize mapper
mapper = IngredientMapper(
    ontology="chebi",
    confidence_threshold=0.8
)

# Map a single ingredient
result = mapper.map_ingredient("yeast extract")
print(f"Mapped to: {result.chebi_name} ({result.chebi_id})")

# Batch processing
ingredients = ["peptone", "glucose", "NaCl", "agar"]
results = mapper.batch_map(ingredients)

# Export mappings
results.to_json("ingredient_mappings.json")

Research Impact

MediaIngredientMech improves data quality throughout the CultureBotAI ecosystem by providing:

Semantic consistency across heterogeneous data sources
Reduced manual curation burden through intelligent automation
Enhanced AI training data with ontology-grounded ingredient terms
Interoperability with broader biological knowledge graphs

It is part of the KG-Microbe knowledge graph project at Lawrence Berkeley National Laboratory.

CultureMech - Chemical entity extraction from media recipes (10,000+ recipes)
CommunityMech - Microbial community interaction modeling
MicroMediaParam - Chemical compound standardization (78% ChEBI coverage)
kg-microbe - Central knowledge graph for microbial cultivation

Future Directions

Planned Enhancements

Multi-language ingredient name support
Integration with additional ontologies (FoodOn, NCIT)
Real-time curation web interface
Federated learning for cross-institutional curation
Active learning to prioritize human curation efforts

Community Contributions

We welcome contributions for:

Additional ingredient vocabularies
Ontology mapping rules
Validation datasets
Integration with laboratory systems

Contact & Collaboration

For questions about MediaIngredientMech or to contribute:

Principal Investigator: Dr. Marcin P. Joachimiak
Email: mjoachimiak@lbl.gov
Organization: CultureBotAI
Laboratory: Environmental Genomics and Systems Biology Division, Lawrence Berkeley National Laboratory

Citation

If you use MediaIngredientMech in your research, please cite the KG-Microbe preprint and reference this tool:

MediaIngredientMech: LLM-Assisted Media Ingredient Curation
CultureBotAI Organization
https://github.com/CultureBotAI/MediaIngredientMech