Train larger models on
smaller GPUs

OSOpt is a drop-in PyTorch optimizer that cuts optimizer memory by 30-50%. Same convergence as Adam. No code changes required.

Install Now View on GitHub

50%

Memory Reduction

Code Changes

100%

Convergence Parity

The Problem

Adam optimizer stores 2 momentum buffers per parameter. That's 2x your model size eating your VRAM.

Model	Parameters	Adam State	OSOpt State	Savings
BERT-base	110M	840 MB	420 MB	50%
GPT-2	774M	5.9 GB	2.9 GB	50%
LLaMA-7B	7B	53 GB	26 GB	50%
Fine-tuning	Any	2x params	~0.1x params	96%

The Solution

OSOpt classifies parameters as hot or cold based on gradient activity, then uses the right algorithm for each.

                    
# Before: Adam using all your VRAM
from torch.optim import Adam
optimizer = Adam(model.parameters(), lr=1e-3)

# After: OSOpt using half
from osopt import OSOpt2
optimizer = OSOpt2(model.parameters(), lr=1e-3)

# That's it. Same training loop. Less memory.

Features

Built for production ML workflows

💾

Memory-Save Mode

Cold parameters don't store momentum buffers. Automatically deallocates on reclassification.

🔀

Algorithm Switching

Hot params use Adam. Cold params use SGD. Right tool for each parameter.

📊

Access Tracking

Monitors gradient activity per parameter. Data-driven classification, not heuristics.

🔧

Drop-in Replacement

Works with your existing training code. No architecture changes needed.

🤗

Framework Integration

Ready-made integrations for HuggingFace Transformers and PyTorch Lightning.

🛡️

Deadlock Detection

Automatically detects stuck training and intervenes to recover.

Where OSOpt Shines

Scenarios with natural parameter sparsity

Fine-tuning

Most backbone params stay cold. Only the head is hot.

96.5% saved

Large Embeddings

Recommendation systems with millions of items. Power-law access patterns.

100% saved

Mixture of Experts

Only K of N experts active per sample. Unused experts are cold.

48.5% saved

Edge Deployment

Memory-constrained environments where every MB matters.

30-50% saved

Benchmarks

Real numbers from real workloads

                    
# Mixture of Experts - 8 experts, top-2 routing

Epoch    OSOpt2         Adam          
------------------------------------
1        0.942288       0.942288      
50       0.005483       0.005483      
100      0.000767       0.000767      
200      0.000000       0.000000      
300      0.000000       0.000000      

# Identical convergence. Half the memory.

Memory saved: 48.5%
Classification: Hot=11, Medium=15, Cold=10

Train larger models on smaller GPUs