How we made Python ML containers smaller and faster to deploy

Problem

Python ML containers can become huge without anyone making an explicit decision. The common PyTorch install pulls CUDA libraries even for CPU-only workloads, which slows installs, bloats images, and can make serverless or autoscaled deploys slower.

Mechanism

The default package path optimizes for broad hardware support. If your workload does not need a GPU, the CUDA baggage is pure deployment weight.

# Before
pip install torch torchvision
# Can pull roughly 2.7 GB of CUDA-related packages

# After
pip install torch torchvision --index-url https://download.pytorch.org/whl/cpu
# CPU-only path, roughly 200 MB in the source note

Fix

Install CPU-only PyTorch before the rest of your requirements, and pin versions so a future upstream change does not alter your image unexpectedly.

RUN pip install --no-cache-dir torch==2.6.0+cpu     --extra-index-url https://download.pytorch.org/whl/cpu &&     pip install --no-cache-dir -r requirements.txt

Combine that with normal Docker hygiene: slim base images, fewer layers, same-layer cleanup for apt caches, --no-install-recommends, and pip --no-cache-dir.

What changed in practice

Install path	Approx package weight	Deploy effect
Default PyTorch	~2.7 GB CUDA libraries	Large image, slower install
CPU-only PyTorch	~200 MB	Smaller image, faster deploy

Production lesson

Container size is often dependency selection disguised as infrastructure. If a workload is CPU-only, make that choice explicit in the Dockerfile.