Monday, November 24, 2025

Everyone's GPU Training Runs Slow

What I Keep Noticing

I've been studying GPU training for the past few months as I transition from financial services to ML infrastructure. One thing keeps coming up everywhere is:

"My training is way slower than I expected and I have no idea why."

Someone on r/MachineLearning mentioned they spent $12K on a training run that took 3 days, only to find out later they had a config issue and it should've taken 1 day. Another person said they just accept that multi-GPU training is "inefficient" and budget extra time.

A Weights & Biases survey found 62% of ML engineers say their training takes 20-50% longer than expected. That's not a small problem.

This Feels Really Familiar

I spent 10 years in performance engineering, that means strict rules about deploying code:

  • Run it through automated performance tests first
  • If it doesn't meet SLAs, it doesn't ship
  • Catch issues before they hit production, not after

The key was that last part. Find problems early when they're cheap to fix, not after you've burned through resources.

Watching people launch $10K GPU jobs without any validation feels like watching someone deploy to production without testing. You're going to find problems eventually, but it's going to cost you.

The Weird Part

What surprises me is that ML infrastructure doesn't have this. There's no standard "pre-flight check" for training jobs. No simple tool that says "hey, your config is going to waste 30% of your GPU time."

People have profilers (Nsight, PyTorch Profiler, etc.) but they're complex and most people don't use them until something's already wrong.

What's missing is something really simple:

  • Run this for 5 minutes before your real training
  • It tells you if you have obvious issues
  • You fix them before spending money

What I'm Thinking

I've been thinking about a basic idea:

Profile a training job for a small number of steps, measure where time goes at each layer, compare to reasonable benchmarks, flag anything weird, and give specific suggestions on what to fix.

Not trying to build some complex monitoring system. Just a simple "run this first, save money later" tool. Something I am personally familiar with.

I can prove it out against a Llama 2 7B setup with 16 GPUs to see if it actually catches real issues. Something like identifying a bottleneck that was eating 25% resources unnecessarily or some underutilization.

Still figuring out the details, but the idea feels right. Will share more once I have something that I can prove actually works.

If you've hit similar problems or have thoughts on what would be useful, let me know. Always good to hear what people actually need vs what I think they need.

Saturday, November 22, 2025

It's Alive! My First Personal Cloud Deployment

I did it. My FastAPI application is running in the cloud, deployed through code, built automatically from commits, and accessible from anywhere. This moment feels surreal after weeks of learning individual pieces that finally came together today.

The Moment It Clicked

When I ran terraform apply and watched AWS resources spin up from my configuration files, something shifted. This wasn't a tutorial anymore. This was real infrastructure I could touch, modify, and tear down at will. The terminal showed each resource being created: cluster, task definition, service, security groups. Then the magic words appeared: "Apply complete!"

I grabbed the public IP from the ECS console, pasted it into my browser, and there it was. My FastAPI docs page, served from a container running on AWS infrastructure I defined in code. No clicking through console menus. No manual configuration. Just code that describes what I want and tools that make it happen.

How It All Fits Together

GitHub Actions watches my repository. When I push code, it builds a Docker image and sends it to ECR. Terraform defines the ECS infrastructure that pulls that image and runs it. CloudWatch captures the logs so I can see what's happening inside the container. Every piece has a purpose, and they work together seamlessly.

The workflow is elegant. I write Python code for my application. I commit it. Minutes later, a fresh container with that code is running in AWS. That's the power of automation.

My Certification Paid Off

Studying for the AWS Solutions Architect Associate certification felt theoretical at times. I memorized service names, learned about networking concepts, and practiced designing architectures on paper. Today, all that theory became practice.

I knew exactly what ECS Fargate meant because I studied it. I understood why security groups needed specific ingress rules. I recognized the relationship between tasks, services, and clusters. The IAM permissions made sense. The CloudWatch integration was obvious. My certification wasn't just a credential. It was a foundation that made this entire project possible.

Without that knowledge, I'd be guessing at every step. Instead, I made informed decisions about architecture, understood the cost implications, and knew which services to use and why.

What I Actually Learned

The technical skills are valuable: Docker, CI/CD, Terraform, AWS services. But the real learning was about systems thinking. Modern cloud applications aren't just code. They're pipelines, infrastructure, security, monitoring, and automation working together.

I learned that breaking complex goals into small steps makes everything achievable. Containerize the app first. Set up CI/CD next. Learn Terraform separately. Then combine them. Trying to do everything at once would have been overwhelming. Taking it piece by piece made it manageable and educational.

I also learned that professional tools aren't as scary as they seem. GitHub Actions looked intimidating until I wrote my first workflow. Terraform seemed complex until I created my first resource. AWS felt massive until I focused on just the services I needed. The key was starting simple and building up.

The Real Goal: AI in Production

This FastAPI app is just a vehicle for learning. The real goal has always been deploying machine learning models to production. I've trained models before. I've built notebooks full of experiments. But I never knew how to take those models from my laptop to a place where real users could interact with them.

Now I know the path. Take the model, wrap it in an API, containerize it, build a CI/CD pipeline, deploy it with infrastructure as code, and suddenly that model is accessible to the world. The framework is in place. The skills are learned. The infrastructure is ready.

What's Coming Next

Here's where things get interesting. I have the deployment pipeline working. I understand the cloud architecture. I can ship code to production automatically. Now I need to decide what to build.

I'm thinking about combining my ML experience with this new deployment knowledge. Maybe a model that does something useful, wrapped in a simple interface, deployed through this exact pipeline. Or perhaps something that solves a problem I've personally encountered. The possibilities are wide open.

I haven't decided yet. I want it to be meaningful, practical, and maybe even a little fun. Something that demonstrates both technical capability and thoughtful application of AI. Something that makes someone's life slightly better or solves a real problem, even if it's a small one.

The Journey Continues

A few weeks ago, I had a FastAPI app running locally. Today, it's deployed to AWS through an automated pipeline defined entirely in code. That progression represents real learning and real capability gained.

But this is just the foundation. The infrastructure is ready. The skills are sharp. The certification knowledge is fresh. Now comes the creative part: deciding what to actually build with all of this.

The only question left is: what will I build? You'll be the first to know.

Friday, November 21, 2025

Learning Infrastructure as Code with Terraform

After automating my container builds with GitHub Actions, I faced a new challenge: how do I deploy these containers to AWS without clicking through endless console menus? The answer was Terraform, a tool that lets you define infrastructure in code files. Today I went from manual AWS console work to managing infrastructure like a developer manages code.

The Mental Shift

I've clicked through the AWS console plenty while studying for my Solutions Architect certification. Create a bucket here, configure some settings there, click save. But this approach doesn't scale and leaves no record of what you did. Infrastructure as code flips this model completely. You describe what you want in a file, and Terraform figures out how to make it happen.

The hardest part wasn't the syntax. It was changing how I think about infrastructure. Instead of "first do this, then do that," I had to think "here's the end state I want" and let Terraform work out the steps. This declarative approach felt strange at first but quickly became natural.

Getting Started

I installed Terraform with Homebrew and created my first project to make a simple S3 bucket. The configuration file was surprisingly readable: define an AWS provider, describe a bucket with some properties, and that's it. Running terraform init downloaded the AWS provider plugin. Running terraform plan showed me what would be created. Running terraform apply actually created it.

That first successful apply was satisfying. I wrote about ten lines of configuration, ran two commands, and infrastructure appeared in AWS. No console clicking required.

The Power of Plan and Apply

The terraform plan command became my favorite feature immediately. It shows exactly what will change before anything actually happens. Resources marked with a plus will be created. A tilde means modified. A minus means destroyed. This preview eliminated the fear of making changes. I could see the impact before committing to it.

I practiced by creating multiple resources, modifying their properties, and watching Terraform show me the precise differences. The tool understood dependencies automatically. When I created a file inside a bucket, Terraform knew to create the bucket first without me specifying the order.

Variables and State

Hardcoding values isn't scalable, so I learned about variables. I created a variables.tf file defining configurable values like region, environment, and resource names. A separate terraform.tfvars file set the actual values. This separation means I can reuse the same infrastructure code across different projects just by changing the variable file.

The state file was the key to understanding how Terraform works. After creating resources, Terraform writes a JSON file tracking everything it manages. This state lets Terraform compare what exists to what you want and calculate the minimal set of changes needed. It's Terraform's memory, and losing it means Terraform forgets what it created.

How My Certification Helped

My AWS Solutions Architect certification provided crucial context. When Terraform creates an S3 bucket, it's making AWS API calls. I understood those APIs from studying for the exam. I knew about IAM permissions, regions, and resource naming constraints. I understood why some changes require resource replacement while others can be updated in place.

This foundation meant I wasn't learning AWS and Terraform simultaneously. I was applying existing knowledge through a new tool, which made the learning curve much gentler.

What Changed

I went from clicking through the AWS console to defining infrastructure in version-controlled files. My infrastructure is now documented, repeatable, and shareable. I can destroy everything with one command and recreate it identically minutes later. Changes are reviewable through Git diffs just like application code.

More importantly, I understand why professionals work this way. Manual infrastructure management doesn't scale. Infrastructure as code does.

Next: Bringing It All Together

I now have every piece needed for deployment. GitHub Actions builds my container images automatically. ECR stores them. Terraform can define AWS infrastructure. Tomorrow I combine these skills to deploy my FastAPI application to ECS using Terraform.

The configuration will be more complex than an S3 bucket, but the workflow is identical: define resources in code, plan the changes, review them, and apply. I'll describe an ECS cluster, task definition pointing to my ECR image, and a service to run the task. Terraform will create everything, and my application will be running in the cloud.

This is the milestone I've been working toward. From local development to automated deployment in AWS, defined entirely in code. Every skill I've learned contributes to this moment. The pieces are ready. Tomorrow I put them together.

Thursday, November 20, 2025

Building My CI/CD Pipeline: From Local Builds to Automated Deployments

Today marked a significant milestone in my cloud journey. I moved from manually building Docker images on my laptop to having a fully automated CI/CD pipeline that builds, tests, and deploys container images to AWS. This is the kind of automation that separates hobby projects from production-grade systems.

What I Needed to Learn

Before today, I understood containers conceptually and could build Docker images locally. But I had never set up a real deployment pipeline. I needed to learn how GitHub Actions works, how to securely store and use AWS credentials, how to configure Amazon ECR, and how to implement security controls that prevent bad code from reaching production.

The authentication piece seemed particularly daunting. GitHub Actions needed to authenticate with my AWS account, access my private container registry, and push images. All of this had to happen securely without exposing credentials. I also needed to understand the difference between building images for testing versus building them for deployment.

Setting Up GitHub Actions

I created my first GitHub Actions workflow file in the .github/workflows directory of my repository. The workflow is defined in YAML and describes exactly what should happen when code is pushed. It checks out the latest code, authenticates with AWS, logs into ECR, builds the Docker image, and pushes it to the registry.

The workflow uses actions from the GitHub marketplace, which are pre-built components that handle complex tasks. The aws-actions/configure-aws-credentials action manages AWS authentication, and aws-actions/amazon-ecr-login handles the ECR-specific login process. I didn't have to figure out the low-level AWS API calls myself.

What impressed me most was seeing it work for the first time. I pushed code to GitHub, switched to the Actions tab, and watched the workflow execute in real time. Each step showed green checkmarks as it progressed. Within two minutes, my new Docker image appeared in ECR, built entirely from code without any manual intervention.

Implementing Security Controls

With automation comes responsibility. At an enterprise level, you cannot let any code push and deploy directly to production. They need security and for that they need controls.

The GitHub Actions workflow can distinguish between pull requests and main branch commits. On PRs, it can build the Docker image to verify everything works while not pushing to ECR. I practiced this and it gave me fast feedback about whether my changes are valid. Only when code merges to main does the workflow push to ECR, ensuring my container registry only contains images from approved code.

I stored AWS access keys as GitHub secrets, which are encrypted and never exposed in logs. The IAM user I created for GitHub Actions has limited permissions, following the principle of least privilege. If those credentials were somehow compromised, they can only push to ECR and nothing else.

Configuring Amazon ECR

Setting up ECR required more than just creating a repository. I had to create an IAM user specifically for automation, generate access keys, authenticate with ECR, and configure everything to work together.

ECR uses a two-step authentication process. First, you authenticate with AWS using IAM credentials. Then you use that authenticated session to get a temporary Docker login token for ECR. This token expires after 12 hours, so the workflow has to generate a fresh one on every run.

I configured the same credentials locally using the AWS CLI so I could test the process manually before automating it. Understanding the authentication flow can make debugging much easier when things don't work the first time.

The repository itself is straightforward. I gave it a name, chose private visibility, and left most settings at their defaults. The interesting part was seeing how the workflow tags images. I experimented with different types of tagging: one labeled "latest" for easy reference and another with a focus on precise version tracking using Git commit SHA. With my app in such a premature state I can probably just focus on using a simple one like latest, but it's great to know that there are multiple tagging capabilities.

Adding Automated Testing

One of the key benefits of CI/CD is catching problems early. I configured the workflow to run on every pull request, building the Docker image to verify it works. This seems simple but catches many issues automatically.

If I add a Python package to my code without updating requirements.txt, the Docker build fails immediately. If there's a syntax error in the Dockerfile, I find out right in the PR before merging. If I accidentally break an import, the build catches it. All of this happens automatically within minutes of pushing code.

The workflow provides immediate feedback directly in the GitHub interface. Green checkmarks mean the code is ready. Red indicators mean something needs fixing. I can see the full logs of what went wrong without having to reproduce the issue locally.

This is the continuous integration part of CI/CD. Every integration of new code into the main branch is automatically verified. As my project grows, I can add more sophisticated tests like unit tests, integration tests, and security scans, all running automatically.

How My AWS Certification Helped

Having the AWS Solutions Architect Associate certification made this entire process significantly smoother. I already understood what ECR and IAM were and how they fit into the AWS ecosystem. The certification covered IAM best practices extensively, which made it obvious why I needed a dedicated user for automation rather than using root credentials.

I understood the security model behind AWS credentials, why access keys should be rotated, and how to scope IAM policies to minimum necessary permissions. This theoretical knowledge translated directly into practical decisions about how to set up my pipeline securely.

The certification also taught me about AWS service limits, regions, and cost optimization. I knew to check which region offered the best pricing, how ECR storage costs work, and when to clean up old images to save money. These might seem like small details, but they add up in real projects.

What I Accomplished

I now have a complete CI/CD pipeline from code to container registry. When I push code to GitHub, it automatically builds a Docker image and pushes it to ECR within minutes. The process is secured with proper IAM policies and GitHub secrets. Branch protection ensures only tested code reaches production.

My current costs are minimal. ECR charges $0.10 per GB per month for storage, and my Docker image is under 200MB. I'm paying pennies. GitHub Actions is free for public repositories and includes generous free minutes for private repositories.

Moving Toward My End Goal

This pipeline is a critical piece of the larger puzzle. My end goal is deploying AI workloads to AWS, which requires reliable container deployment. I now have half of that equation solved: reliable container building and storage.

The next phase is using these containers. I need to deploy them to ECS using Terraform, which means learning infrastructure as code. Once I can define my infrastructure in code and deploy containers automatically, I'll have a complete system that takes me from a code commit to running containers in the cloud.

Every step builds on the previous one. The momentum is building, and I can see the path forward clearly. Tomorrow I'll tackle infrastructure as code with Terraform, defining the AWS resources needed to actually run these containers. The pieces are coming together and I'm genuinely excited about what I'm building.

Wednesday, November 19, 2025

My First Steps into Containerization and Deployment

I have completed something I've been wanting to learn for a while: building a web application from scratch and preparing it for cloud deployment. This post documents what I learned and the steps I took to get there.

Starting Point

I knew I wanted to eventually deploy an AI workload to AWS, but I realized I was missing fundamental knowledge about containerization and infrastructure. Rather than jumping straight into the complex stuff, I decided to start with the basics and build up from there.

Building the Application

I chose Python and FastAPI for this project. FastAPI is a modern web framework that's simple enough for beginners but powerful enough for production use. I started by creating a basic "Hello World" application with a few essential endpoints.

The initial app had just two routes: a root endpoint that returned a simple greeting and a health check endpoint. The health check is important because load balancers and container orchestration systems need a way to verify that your application is running properly.

Once I had the basics working, I expanded the application to include more realistic API patterns. I added endpoints that demonstrate path parameters, query parameters, and POST requests with JSON bodies. This helped me understand how modern APIs handle different types of data and requests.

Version Control

Before going any further, I set up Git and pushed my code to GitHub. This was an important step because it meant I had a backup of my work and a foundation for automated deployments later. I created a proper .gitignore file to exclude things like virtual environments and temporary files that shouldn't be in version control.

Containerization with Docker

The next step was containerizing the application. I installed Docker Desktop and learned how to create a Dockerfile. A Dockerfile is basically a recipe that tells Docker how to build your application into a container image.

My Dockerfile started with a Python base image, installed the required dependencies from a requirements.txt file, copied my application code, and specified the command to run when the container starts. The whole process of building and running the container locally took just a few commands.

What impressed me most about containers is how they solve the "it works on my machine" problem. The container includes everything the application needs to run, so it behaves the same way whether it's running on my laptop or in the cloud.

Pushing to a Container Registry

The final step was pushing my Docker image to GitHub Container Registry. This involved creating a personal access token for authentication, tagging my image with the proper format, and uploading it to GitHub's registry.

Now my container image is stored in a central location where it can be pulled from anywhere. This is exactly what I'll need when I deploy to AWS, the registry will be the source for my container images.

What I Learned

The most valuable lesson from today was understanding how all these pieces fit together. A modern application deployment involves several layers: the application code itself, the dependencies it needs, the container that packages everything together, version control to track changes, and a registry to store and distribute container images.

I also learned that taking things one step at a time makes complex topics much more manageable. Instead of trying to learn Docker, Git, FastAPI, and AWS all at once, I focused on getting each piece working before moving to the next.

What's Next

Now that I have a working application containerized and stored in a registry, I'm ready to start using AWS. The next steps will involve leveraging my AWS-SAA skills to deploy my own ECS Fargate for running containers, and using Terraform to define my infrastructure as code.

It's imperative to keep costs low while learning, so I'll be focusing on services like ECS Fargate that offer a good balance between simplicity and cost-effectiveness. I am excited to set up proper monitoring and logging so I can see what's happening with my deployed application.

Final Thoughts

If you're thinking about learning containerization and cloud deployment, my advice is to start small and build incrementally. Don't worry about getting everything perfect on the first try. Focus on understanding what each tool does and why you need it.

The journey from a simple Python script to a containerized application ready for cloud deployment taught me more in one day than weeks of reading tutorials. There's no substitute for actually building something and working through the problems as they come up.

I'm looking forward to the AWS portion of this project and seeing my containerized application running in the cloud. Stay tuned for the next post where I'll document that process.

Oh and here is the link to my project: https://github.com/ka-learns/hello-world-api