Purpose Model Why it helps / integration idea
EI-MS → structure reconstruction (GC‐MS) MS-ML/SpecTUS_pretrained_only Transformer that decodes raw EI fragmentation spectra directly into canonical SMILES. Use it as a starting point, then finetune on your in-house library before the “Comprehensive MS2 Annotation” step. Hugging Face
MS/MS ↔ molecule joint embedding OliXio/CMSSP Contrastive pre-training aligns spectra & molecular graphs in one latent space—ideal for your multi-database search, re-ranking, and “confidence‐scoring” modules. Hugging Face
Chemical language modelling for property / RT / fragmentation prediction DeepChem/ChemBERTa-77M-MLM & …-MTR RoBERTa variants trained on ~77 M SMILES; offer strong transfer for retention-time and intensity-prediction regressors when paired with graph features. Hugging FaceHugging Face
Low-data property prediction & zero-shot assay transfer mschuh/PubChemDeBERTa DeBERTa pre-trained with PubChem assays; handy for imputing missing phys-chem/KI values that feed into your “multi-component confidence score”. Hugging Face
Large-scale SMILES generator / embedder ibm-research/MoLFormer-XL-both-10pct Fast linear-attention XL model for molecule enumeration or fingerprint replacement; useful for data augmentation before synthetic-spectra generation. Hugging Face
Role Model Notes
Domain-general biomedical LLM stanford-crfm/BioMedLM (2.7 B) Lightweight enough for local inference (4× A100 ≈ real-time). Excellent for “context-aware analytical assistance” and report drafting. (Rail licence forbids medical diagnosis but metabolomics use is fine.) Hugging Face
Scientific text encoder allenai/scibert_scivocab_uncased Use for rapid embedding of pathway‐database abstracts in your “LLM-powered knowledge” sub-module. Hugging Face
Chemical NER for literature & user prompts pruas/BENT-PubMedBERT-NER-Chemical Drop-in spaCy/transformers pipeline to normalise compound names before they hit your model repository or LLM prompt. Hugging Face
Model Relevance
InstaDeepAI/InstaNovo (Space with model weights) If you eventually add a proteomics branch, this transformer does de-novo peptide sequencing directly from MS/MS. Could be orchestrated alongside your metabolite annotator for mixed-omics runs. Hugging Face
Wilhelm Lab datasets & tasks Their PROSPECT-PTM datasets (retention-time, detectability) pair nicely with the ChemBERTa/MoLFormer stack for transfer-learning and benchmarking. Hugging Face

Copyright © 2024 Lavoisier Project. Distributed under the MIT License.