Train larger models on
smaller GPUs

OSOpt is a drop-in PyTorch optimizer that cuts optimizer memory by 30-50%. Same convergence as Adam. No code changes required.

50%
Memory Reduction
0
Code Changes
100%
Convergence Parity

The Problem

Adam optimizer stores 2 momentum buffers per parameter. That's 2x your model size eating your VRAM.

Model Parameters Adam State OSOpt State Savings
BERT-base 110M 840 MB 420 MB 50%
GPT-2 774M 5.9 GB 2.9 GB 50%
LLaMA-7B 7B 53 GB 26 GB 50%
Fine-tuning Any 2x params ~0.1x params 96%

The Solution

OSOpt classifies parameters as hot or cold based on gradient activity, then uses the right algorithm for each.

# Before: Adam using all your VRAM from torch.optim import Adam optimizer = Adam(model.parameters(), lr=1e-3) # After: OSOpt using half from osopt import OSOpt2 optimizer = OSOpt2(model.parameters(), lr=1e-3) # That's it. Same training loop. Less memory.

Features

Built for production ML workflows

๐Ÿ’พ

Memory-Save Mode

Cold parameters don't store momentum buffers. Automatically deallocates on reclassification.

๐Ÿ”€

Algorithm Switching

Hot params use Adam. Cold params use SGD. Right tool for each parameter.

๐Ÿ“Š

Access Tracking

Monitors gradient activity per parameter. Data-driven classification, not heuristics.

๐Ÿ”ง

Drop-in Replacement

Works with your existing training code. No architecture changes needed.

๐Ÿค—

Framework Integration

Ready-made integrations for HuggingFace Transformers and PyTorch Lightning.

๐Ÿ›ก๏ธ

Deadlock Detection

Automatically detects stuck training and intervenes to recover.

Where OSOpt Shines

Scenarios with natural parameter sparsity

Fine-tuning

Most backbone params stay cold. Only the head is hot.

96.5% saved

Large Embeddings

Recommendation systems with millions of items. Power-law access patterns.

100% saved

Mixture of Experts

Only K of N experts active per sample. Unused experts are cold.

48.5% saved

Edge Deployment

Memory-constrained environments where every MB matters.

30-50% saved

Benchmarks

Real numbers from real workloads

# Mixture of Experts - 8 experts, top-2 routing Epoch OSOpt2 Adam ------------------------------------ 1 0.942288 0.942288 50 0.005483 0.005483 100 0.000767 0.000767 200 0.000000 0.000000 300 0.000000 0.000000 # Identical convergence. Half the memory. Memory saved: 48.5% Classification: Hot=11, Medium=15, Cold=10

Get Started in 30 Seconds

Install with pip and replace your optimizer

$ pip install osopt