Hugging Face Models Integration

Lavoisier leverages cutting-edge AI models from Hugging Face to enhance mass spectrometry analysis through advanced machine learning capabilities.

Model Architecture Overview

Note: Model architecture diagram will be added in future updates

Model Architecture

Core Spectrometry Models

SpecTUS Model

Model: MS-ML/SpecTUS_pretrained_only
Purpose: Structure reconstruction from EI-MS spectra
Features:
- Direct conversion of mass spectra to SMILES
- Beam search for multiple structure candidates
- High accuracy on known compounds

CMSSP Model

Model: OliXio/CMSSP
Purpose: Joint embedding of MS/MS spectra and molecules
Features:
- 768-dimensional embeddings
- Batch processing support
- Efficient spectrum preprocessing

Chemical Language Models

ChemBERTa

Model: DeepChem/ChemBERTa-77M-MLM
Purpose: Chemical property prediction
Features:
- Multiple pooling strategies (CLS, mean, max)
- SMILES encoding
- Property prediction

MoLFormer

Model: ibm-research/MoLFormer-XL-both-10pct
Purpose: SMILES generation and embedding
Features:
- Linear attention mechanism
- Fast processing
- Large-scale molecule handling

Biomedical Models

BioMedLM

Model: stanford-crfm/BioMedLM
Purpose: Biomedical language modeling
Features:
- Context-aware analysis
- Natural language generation
- Domain-specific knowledge

SciBERT

Model: allenai/scibert_scivocab_uncased
Purpose: Scientific text encoding
Features:
- Scientific vocabulary
- Multiple pooling strategies
- Efficient text embedding

Chemical NER

Model: pruas/BENT-PubMedBERT-NER-Chemical
Purpose: Chemical entity recognition
Features:
- Chemical name extraction
- Entity normalization
- High precision recognition

Implementation Details

Base Classes

class BaseHuggingFaceModel:
    """Base class for all Hugging Face models."""
    def __init__(self, model_id, revision="main", device=None, use_cache=True):
        self.model_id = model_id
        self.device = device or ("cuda" if torch.cuda.is_available() else "cpu")

Model Registry

class ModelRegistry:
    """Manages model downloading and versioning."""
    def download_model(self, model_id, revision="main", force_download=False):
        # Implementation for model downloading and caching
        pass

Example Usage

# Initialize a model
model = SpecTUSModel()

# Process a spectrum
smiles = model.process_spectrum(mz_values, intensity_values)

# Generate embeddings
embeddings = model.encode_smiles(smiles)

GPU Requirements

Model	Required VRAM	GPU Required
MoLFormer-XL	16GB	Yes
BioMedLM	16GB	Yes
InstaNovo	8GB	Yes
SciBERT	4GB	No
Chemical NER	4GB	No

Performance Metrics

Model	Task	Accuracy	Speed
SpecTUS	Structure Prediction	0.89	100ms/spectrum
CMSSP	Embedding	0.92	50ms/spectrum
ChemBERTa	Property Prediction	0.85	20ms/SMILES
MoLFormer	SMILES Generation	0.88	30ms/molecule

Future Extensions

Proteomics Support
- Integration of InstaDeepAI/InstaNovo
- De novo peptide sequencing
- Cross-analysis with metabolomics
Model Distillation
- Knowledge transfer to smaller models
- Reduced resource requirements
- Faster inference
Custom Fine-tuning
- Domain adaptation
- Task-specific optimization
- Performance improvements