What I Keep Noticing
I've been studying GPU training for the past few months as I transition from financial services to ML infrastructure. One thing keeps coming up everywhere is:
"My training is way slower than I expected and I have no idea why."
Someone on r/MachineLearning mentioned they spent $12K on a training run that took 3 days, only to find out later they had a config issue and it should've taken 1 day. Another person said they just accept that multi-GPU training is "inefficient" and budget extra time.
A Weights & Biases survey found 62% of ML engineers say their training takes 20-50% longer than expected. That's not a small problem.
This Feels Really Familiar
I spent 10 years in performance engineering, that means strict rules about deploying code:
- Run it through automated performance tests first
- If it doesn't meet SLAs, it doesn't ship
- Catch issues before they hit production, not after
The key was that last part. Find problems early when they're cheap to fix, not after you've burned through resources.
Watching people launch $10K GPU jobs without any validation feels like watching someone deploy to production without testing. You're going to find problems eventually, but it's going to cost you.
The Weird Part
What surprises me is that ML infrastructure doesn't have this. There's no standard "pre-flight check" for training jobs. No simple tool that says "hey, your config is going to waste 30% of your GPU time."
People have profilers (Nsight, PyTorch Profiler, etc.) but they're complex and most people don't use them until something's already wrong.
What's missing is something really simple:
- Run this for 5 minutes before your real training
- It tells you if you have obvious issues
- You fix them before spending money
What I'm Thinking
I've been thinking about a basic idea:
Profile a training job for a small number of steps, measure where time goes at each layer, compare to reasonable benchmarks, flag anything weird, and give specific suggestions on what to fix.
Not trying to build some complex monitoring system. Just a simple "run this first, save money later" tool. Something I am personally familiar with.
I can prove it out against a Llama 2 7B setup with 16 GPUs to see if it actually catches real issues. Something like identifying a bottleneck that was eating 25% resources unnecessarily or some underutilization.
Still figuring out the details, but the idea feels right. Will share more once I have something that I can prove actually works.
If you've hit similar problems or have thoughts on what would be useful, let me know. Always good to hear what people actually need vs what I think they need.
No comments:
Post a Comment