Hugging Face Model Integration Plan for Lavoisier
1. Overview
This document outlines the plan to integrate specialized machine learning models from Hugging Face into the Lavoisier project to enhance its capabilities in mass spectrometry analysis, chemical structure prediction, and biomedical knowledge integration.
2. Model Integration Priorities
Phase 1: Core Spectrometry & Chemical Analysis Models
- MS-ML/SpecTUS_pretrained_only
- Purpose: Structure reconstruction from EI-MS spectra
- Integration: Create wrapper in
lavoisier/models/spectral_transformers.py
- Priority: High - Foundational for molecular structure prediction
- OliXio/CMSSP
- Purpose: Joint embedding of MS/MS spectra and molecules
- Integration: Implement in
lavoisier/models/embedding_models.py
- Priority: High - Critical for multi-database search and confidence scoring
Phase 2: Chemical Language Models
- DeepChem/ChemBERTa-77M-MLM & MTR
- Purpose: Chemical language modeling for property prediction
- Integration: Implement in
lavoisier/models/chemical_language_models.py
- Priority: Medium - Enables better property and fragmentation prediction
- ibm-research/MoLFormer-XL-both-10pct
- Purpose: SMILES generation and embedding
- Integration: Add to
lavoisier/models/chemical_language_models.py
- Priority: Medium - Useful for data augmentation
Phase 3: Biomedical Knowledge Integration
- stanford-crfm/BioMedLM (2.7B)
- Purpose: Domain-general biomedical LLM
- Integration: Implement in
lavoisier/llm/specialized_llm.py
- Priority: Medium - Enhances analytical assistance capabilities
- allenai/scibert_scivocab_uncased
- Purpose: Scientific text encoding
- Integration: Add to
lavoisier/llm/text_encoders.py
- Priority: Low - Supports pathway database integration
- pruas/BENT-PubMedBERT-NER-Chemical
- Purpose: Chemical entity recognition
- Integration: Create
lavoisier/llm/chemical_ner.py
- Priority: Low - Improves handling of compound names
Phase 4: Advanced Applications (Future)
- InstaDeepAI/InstaNovo
- Purpose: Proteomics support
- Integration: Create new module
lavoisier/proteomics/
- Priority: Low - Extension beyond current scope
3. Implementation Plan
Step 1: Infrastructure Setup
- Update
requirements.txt
to include dependencies:transformers>=4.30.0
torch>=2.0.0
datasets>=2.14.0
huggingface_hub>=0.17.0
- Create model registry module (
lavoisier/models/registry.py
):- Implement model caching and version management
- Support local and remote model loading
- Add progress tracking for downloads
Step 2: Core Model Integration Framework
- Create base model wrapper classes:
BaseHuggingFaceModel
- Common functionality for all modelsSpectralModel
- Specific to MS analysis modelsChemicalLanguageModel
- For chemical-specific language modelsBiomedicalTextModel
- For text-based models
- Implement model loading utilities:
- Automatic caching
- Offline mode support
- Configuration management
Step 3: Model-Specific Implementations
- Implement each model wrapper according to priority
- Create integration tests for each model
- Add documentation and example workflows
Step 4: Pipeline Integration
- Update the Orchestrator to support model selection
- Create new pipeline types that leverage these models
- Implement fallback mechanisms for when models are unavailable
4. Implementation Timeline
- Phase 1: 1-2 weeks
- Phase 2: 1-2 weeks
- Phase 3: 2-3 weeks
- Phase 4: Future work
5. Dependencies
- PyTorch
- Transformers
- HuggingFace Hub
- Additional model-specific dependencies
6. Testing Strategy
- Unit tests for each model wrapper
- Integration tests with sample data
- Performance benchmarks
- Offline functionality tests