Wednesday, December 10, 2025

I Built A GPU Profiler!

Quick Recap

I wrote about seeing ML engineers waste money on misconfigured training jobs. Coming from performance engineering, it felt weird that there's no standard "pre-flight check" before launching expensive GPU runs.

I said I was working on something. Well, it's done and the results surprised me.

The Test

I trained Llama 2 7B on 16 A100 GPUs ($131/hour). Just a standard multi-node DDP setup, nothing fancy.

First run (no optimization):

  • Step time: 6234ms
  • Scaling efficiency: 78.2%
  • Communication overhead: 25.9%

That last number jumped out. Over a quarter of each training step was just GPUs waiting to sync gradients. Not computing, just waiting.

At 78.2% efficiency, I was wasting about $28/hour doing nothing.

The Fixes

Change 1: Increased DDP bucket size from 25MB to 100MB

  • Communication down from 25.9% to 21.6%
  • Throughput increased 5.8%

Better, but still not great.

Change 2: Added gradient accumulation (sync every 4 batches instead of every batch)

  • Communication decreased from 21.6% to 2.7%
  • Throughput shot up 16.5%

This one worked because it let the GPU compute while gradients synced. Instead of "compute, wait, compute, wait," it became "compute compute compute compute, sync while computing the next batch."

Final results:

It saved $50 for 100K samples. Extrapolated to 10 millions samples is $5,000! There is some value here after all :)

What I Learned

The same approach I used in my previous roles (measure baseline, find bottleneck, test fix, validate) works here. Different domain, same principles.

You iterate based on data, so it took me a few tries

I kept this basic on purpose. No complex setup, just works out of the box, I am not an expert.

No comments:

Post a Comment