Lavoisier Rust Extensions
High-performance mass spectrometry data processing using Rust for computational intensive operations.
Overview
The Rust extensions provide significant performance improvements for large-scale MS data processing:
- 100-1000x speedup for large dataset processing (>100GB)
- Memory-mapped file handling for massive mzML files
- Parallel processing with SIMD optimization
- Zero-copy data structures for maximum efficiency
- PyO3 bindings for seamless Python integration
Architecture
lavoisier-rust/
├── Cargo.toml # Workspace configuration
├── lavoisier-core/ # Core data structures and algorithms
│ ├── src/
│ │ ├── lib.rs # Main library with Python bindings
│ │ ├── spectrum.rs # Spectrum processing utilities
│ │ ├── peak.rs # Peak detection algorithms
│ │ ├── processing.rs # Batch processing pipelines
│ │ ├── memory.rs # Memory management
│ │ └── errors.rs # Error handling
│ └── Cargo.toml
├── lavoisier-io/ # High-performance I/O operations
│ ├── src/
│ │ ├── lib.rs # I/O library with mzML support
│ │ ├── mzml.rs # mzML format handling
│ │ ├── compression.rs # Compression algorithms
│ │ └── indexing.rs # Fast indexing for random access
│ └── Cargo.toml
└── setup_rust.py # Python setup script
Installation
Prerequisites
- Install Rust:
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh source ~/.cargo/env
- Install Python dependencies:
pip install maturin numpy
Build and Install
# Automated setup
python setup_rust.py
# Or manual build
cargo build --release
maturin develop --release
Performance Benchmarks
Operation | Python (ms) | Rust (ms) | Speedup |
---|---|---|---|
Spectrum filtering | 245 | 2.1 | 117x |
Peak detection | 892 | 8.3 | 107x |
mzML parsing | 15,420 | 156 | 99x |
Batch normalization | 1,340 | 12 | 112x |
Benchmarks on MTBLS1707 dataset (2.1GB, 15,000 spectra)
Usage Examples
Basic Spectrum Processing
from lavoisier.rust_extensions import PySpectrum, PyPeakDetector
import numpy as np
# Create spectrum
mz = np.array([100.0, 200.0, 300.0, 400.0])
intensity = np.array([1000.0, 2000.0, 1500.0, 800.0])
spectrum = PySpectrum(mz, intensity, 1.25, 1, "scan_001")
# Filter by intensity
spectrum.filter_intensity(900.0)
# Normalize
spectrum.normalize_intensity("max")
# Peak detection
detector = PyPeakDetector(100.0, 3.0, 5)
peaks = detector.detect_peaks(spectrum.mz.tolist(), spectrum.intensity.tolist(), 1.25)
print(f"Found {len(peaks)} peaks")
for peak in peaks:
print(f"Peak: m/z={peak.mz:.4f}, intensity={peak.intensity:.2f}")
High-Performance mzML Reading
from lavoisier.rust_extensions import PyMzMLReader
# Open large mzML file
reader = PyMzMLReader("large_dataset.mzML")
# Build index for fast random access
reader.build_index()
# Read all spectra efficiently
spectra = reader.read_spectra()
print(f"Loaded {spectra.len()} spectra")
# Get specific spectrum by ID
spectrum = reader.get_spectrum("scan=1000")
if spectrum:
print(f"Spectrum RT: {spectrum.retention_time:.2f} min")
# Get metadata
metadata = reader.get_metadata()
print(f"File size: {metadata.get('file_size', 'unknown')} bytes")
Batch Processing
from lavoisier.rust_extensions import batch_filter_intensity, batch_normalize_spectra
# Load multiple spectra
spectra = []
for i in range(1000):
mz = np.random.uniform(100, 1000, 500)
intensity = np.random.exponential(1000, 500)
spec = PySpectrum(mz, intensity, i * 0.1, 1, f"scan_{i}")
spectra.append(spec)
# Batch operations (parallel processing)
filtered_spectra = batch_filter_intensity(spectra, 100.0)
normalized_spectra = batch_normalize_spectra(filtered_spectra, "tic")
print(f"Processed {len(normalized_spectra)} spectra")
Memory-Efficient Streaming
from lavoisier.rust_extensions import PyMzMLReader
def process_large_file(file_path):
"""Process large mzML files without loading everything into memory"""
reader = PyMzMLReader(file_path)
reader.build_index()
# Process in chunks
chunk_size = 1000
total_spectra = 0
# Get spectra in retention time windows
for rt_start in range(0, 60, 5): # 5-minute windows
rt_end = rt_start + 5
chunk_spectra = reader.get_spectra_in_rt_range(rt_start, rt_end)
# Process chunk
for spectrum in chunk_spectra:
spectrum.filter_intensity(1000.0)
spectrum.normalize_intensity("max")
total_spectra += len(chunk_spectra)
print(f"Processed RT {rt_start}-{rt_end}: {len(chunk_spectra)} spectra")
return total_spectra
# Process 50GB file efficiently
total = process_large_file("massive_dataset.mzML")
print(f"Total processed: {total} spectra")
API Reference
PySpectrum
Core spectrum data structure with high-performance operations.
Constructor:
PySpectrum(mz: np.ndarray, intensity: np.ndarray, retention_time: float,
ms_level: int, scan_id: str)
Properties:
mz: np.ndarray
- Mass-to-charge valuesintensity: np.ndarray
- Intensity valuesretention_time: float
- Retention time in minutesms_level: int
- MS level (1 for MS1, 2 for MS2, etc.)scan_id: str
- Unique scan identifier
Methods:
filter_intensity(threshold: float)
- Remove peaks below thresholdfilter_mz_range(min_mz: float, max_mz: float)
- Filter by m/z rangenormalize_intensity(method: str)
- Normalize intensities (“max”, “tic”, “zscore”)find_peaks(min_intensity: float, window_size: int) -> List[PyPeak]
- Detect peaks
PyMzMLReader
High-performance mzML file reader with memory mapping.
Constructor:
PyMzMLReader(file_path: str)
Methods:
build_index()
- Build index for fast random accessread_spectra() -> PySpectrumCollection
- Read all spectraget_spectrum(scan_id: str) -> Optional[PySpectrum]
- Get specific spectrumget_metadata() -> Dict[str, str]
- Get file metadata
PyPeakDetector
Advanced peak detection with noise estimation.
Constructor:
PyPeakDetector(min_intensity: float, min_signal_to_noise: float, window_size: int)
Methods:
detect_peaks(mz: List[float], intensity: List[float], retention_time: float) -> List[PyPeak]
Performance Optimization
Memory Management
The Rust extensions use several optimization techniques:
- Memory Mapping: Large files are memory-mapped for efficient access
- Zero-Copy Operations: Data is processed without unnecessary copying
- SIMD Instructions: Vectorized operations for numerical computations
- Parallel Processing: Multi-threaded execution using Rayon
Compilation Optimizations
The release build uses aggressive optimizations:
[profile.release]
lto = true # Link-time optimization
codegen-units = 1 # Single codegen unit for better optimization
panic = "abort" # Smaller binary size
opt-level = 3 # Maximum optimization
Memory Usage
Monitor memory usage during processing:
from lavoisier.rust_extensions import get_memory_stats
# Process data
# ... your processing code ...
# Check memory statistics
stats = get_memory_stats()
print(f"Peak memory usage: {stats.peak_allocated / 1024**2:.1f} MB")
print(f"Current memory usage: {stats.current_allocated / 1024**2:.1f} MB")
Integration with Python Lavoisier
The Rust extensions integrate seamlessly with the Python codebase:
# In your Python code
try:
from lavoisier.rust_extensions import PyMzMLReader, PySpectrum
USE_RUST = True
except ImportError:
from lavoisier.io.MZMLReader import MZMLReader as PyMzMLReader
USE_RUST = False
def load_spectra(file_path):
if USE_RUST:
# Use high-performance Rust implementation
reader = PyMzMLReader(file_path)
reader.build_index()
return reader.read_spectra()
else:
# Fallback to Python implementation
reader = PyMzMLReader(file_path)
return reader.read_all_spectra()
Testing
Run the test suite:
# Rust tests
cargo test --workspace
# Python integration tests
python -m pytest tests/test_rust_extensions.py
# Benchmarks
cargo bench
Contributing
- Code Style: Use
cargo fmt
for formatting - Linting: Run
cargo clippy
for lints - Testing: Add tests for new functionality
- Documentation: Update docs for API changes
Troubleshooting
Common Issues
- Compilation Errors:
# Update Rust toolchain rustup update # Clean build cargo clean cargo build --release
- Python Import Errors:
# Reinstall extensions python setup_rust.py # Check installation python -c "import lavoisier.rust_extensions; print('OK')"
- Performance Issues:
- Ensure you’re using the release build (
--release
) - Check that SIMD instructions are available on your CPU
- Monitor memory usage to avoid swapping
- Ensure you’re using the release build (
Platform-Specific Notes
Windows:
- Install Visual Studio Build Tools
- Use
maturin develop --release
for development
macOS:
- Install Xcode command line tools:
xcode-select --install
Linux:
- Install build essentials:
sudo apt-get install build-essential
License
Same as main Lavoisier project - see LICENSE file.