Purpose | Model | Why it helps / integration idea |
EI-MS → structure reconstruction (GC‐MS) | MS-ML/SpecTUS_pretrained_only | Transformer that decodes raw EI fragmentation spectra directly into canonical SMILES. Use it as a starting point, then finetune on your in-house library before the “Comprehensive MS2 Annotation” step. Hugging Face |
MS/MS ↔ molecule joint embedding | OliXio/CMSSP | Contrastive pre-training aligns spectra & molecular graphs in one latent space—ideal for your multi-database search, re-ranking, and “confidence‐scoring” modules. Hugging Face |
Chemical language modelling for property / RT / fragmentation prediction | DeepChem/ChemBERTa-77M-MLM & …-MTR | RoBERTa variants trained on ~77 M SMILES; offer strong transfer for retention-time and intensity-prediction regressors when paired with graph features. Hugging FaceHugging Face |
Low-data property prediction & zero-shot assay transfer | mschuh/PubChemDeBERTa | DeBERTa pre-trained with PubChem assays; handy for imputing missing phys-chem/KI values that feed into your “multi-component confidence score”. Hugging Face |
Large-scale SMILES generator / embedder | ibm-research/MoLFormer-XL-both-10pct | Fast linear-attention XL model for molecule enumeration or fingerprint replacement; useful for data augmentation before synthetic-spectra generation. Hugging Face |
Role | Model | Notes |
Domain-general biomedical LLM | stanford-crfm/BioMedLM (2.7 B) | Lightweight enough for local inference (4× A100 ≈ real-time). Excellent for “context-aware analytical assistance” and report drafting. (Rail licence forbids medical diagnosis but metabolomics use is fine.) Hugging Face |
Scientific text encoder | allenai/scibert_scivocab_uncased | Use for rapid embedding of pathway‐database abstracts in your “LLM-powered knowledge” sub-module. Hugging Face |
Chemical NER for literature & user prompts | pruas/BENT-PubMedBERT-NER-Chemical | Drop-in spaCy/transformers pipeline to normalise compound names before they hit your model repository or LLM prompt. Hugging Face |
Model | Relevance |
InstaDeepAI/InstaNovo (Space with model weights) | If you eventually add a proteomics branch, this transformer does de-novo peptide sequencing directly from MS/MS. Could be orchestrated alongside your metabolite annotator for mixed-omics runs. Hugging Face |
Wilhelm Lab datasets & tasks | Their PROSPECT-PTM datasets (retention-time, detectability) pair nicely with the ChemBERTa/MoLFormer stack for transfer-learning and benchmarking. Hugging Face |