Lavoisier

Purpose	Model	Why it helps / integration idea
EI-MS → structure reconstruction (GC‐MS)	MS-ML/SpecTUS_pretrained_only	Transformer that decodes raw EI fragmentation spectra directly into canonical SMILES. Use it as a starting point, then finetune on your in-house library before the “Comprehensive MS2 Annotation” step. Hugging Face
MS/MS ↔ molecule joint embedding	OliXio/CMSSP	Contrastive pre-training aligns spectra & molecular graphs in one latent space—ideal for your multi-database search, re-ranking, and “confidence‐scoring” modules. Hugging Face
Chemical language modelling for property / RT / fragmentation prediction	DeepChem/ChemBERTa-77M-MLM & …-MTR	RoBERTa variants trained on ~77 M SMILES; offer strong transfer for retention-time and intensity-prediction regressors when paired with graph features. Hugging FaceHugging Face
Low-data property prediction & zero-shot assay transfer	mschuh/PubChemDeBERTa	DeBERTa pre-trained with PubChem assays; handy for imputing missing phys-chem/KI values that feed into your “multi-component confidence score”. Hugging Face
Large-scale SMILES generator / embedder	ibm-research/MoLFormer-XL-both-10pct	Fast linear-attention XL model for molecule enumeration or fingerprint replacement; useful for data augmentation before synthetic-spectra generation. Hugging Face

Role	Model	Notes
Domain-general biomedical LLM	stanford-crfm/BioMedLM (2.7 B)	Lightweight enough for local inference (4× A100 ≈ real-time). Excellent for “context-aware analytical assistance” and report drafting. (Rail licence forbids medical diagnosis but metabolomics use is fine.) Hugging Face
Scientific text encoder	allenai/scibert_scivocab_uncased	Use for rapid embedding of pathway‐database abstracts in your “LLM-powered knowledge” sub-module. Hugging Face
Chemical NER for literature & user prompts	pruas/BENT-PubMedBERT-NER-Chemical	Drop-in spaCy/transformers pipeline to normalise compound names before they hit your model repository or LLM prompt. Hugging Face

Model	Relevance
InstaDeepAI/InstaNovo (Space with model weights)	If you eventually add a proteomics branch, this transformer does de-novo peptide sequencing directly from MS/MS. Could be orchestrated alongside your metabolite annotator for mixed-omics runs. Hugging Face
Wilhelm Lab datasets & tasks	Their PROSPECT-PTM datasets (retention-time, detectability) pair nicely with the ChemBERTa/MoLFormer stack for transfer-learning and benchmarking. Hugging Face