OSOpt is a drop-in PyTorch optimizer that cuts optimizer memory by 30-50%. Same convergence as Adam. No code changes required.
Adam optimizer stores 2 momentum buffers per parameter. That's 2x your model size eating your VRAM.
| Model | Parameters | Adam State | OSOpt State | Savings |
|---|---|---|---|---|
| BERT-base | 110M | 840 MB | 420 MB | 50% |
| GPT-2 | 774M | 5.9 GB | 2.9 GB | 50% |
| LLaMA-7B | 7B | 53 GB | 26 GB | 50% |
| Fine-tuning | Any | 2x params | ~0.1x params | 96% |
OSOpt classifies parameters as hot or cold based on gradient activity, then uses the right algorithm for each.
Built for production ML workflows
Cold parameters don't store momentum buffers. Automatically deallocates on reclassification.
Hot params use Adam. Cold params use SGD. Right tool for each parameter.
Monitors gradient activity per parameter. Data-driven classification, not heuristics.
Works with your existing training code. No architecture changes needed.
Ready-made integrations for HuggingFace Transformers and PyTorch Lightning.
Automatically detects stuck training and intervenes to recover.
Scenarios with natural parameter sparsity
Most backbone params stay cold. Only the head is hot.
Recommendation systems with millions of items. Power-law access patterns.
Only K of N experts active per sample. Unused experts are cold.
Memory-constrained environments where every MB matters.
Real numbers from real workloads
Install with pip and replace your optimizer