Case Study

Enabling AI self-improvement at scale through LLM fine-tuning pipeline

Driven by results:

Improved GPU utilization and reduced training time during model training processes
Substantially reduced compute costs through dynamic batch sizing and efficient GPU utilization
Abstracted infrastructure management allows teams to focus on model development
Industry
Artificial Intelligence (AI), Machine Learning
Services
Generative AI Infrastructure, LLM Fine-Tuning, Data Engineering
Share
Osmosis

Need AI models with better context and performance? Let's talk.

Book a discovery call

Overview

Osmosis (Gulp.ai) is a San Francisco-based AI company that enables AI self-improvement through real-time reinforcement learning. The Y Combinator-backed startup focuses on unlocking AI agent productivity at production scale by providing the missing piece for truly effective AI systems: the ability to learn from experience. Osmosis addresses the critical gap in deploying AI agents that can continuously improve their performance through real-world interactions.

Osmosis needed to integrate SGLang as a backend for model inference during LLM fine-tuning via VeRL, a reinforcement learning framework specifically designed for fine-tuning large language models, to improve GPU utilization efficiency. The challenge was to create a fully functional, custom Docker image that would seamlessly integrate with Amazon SageMaker Hyperpod on EKS while meeting specific technical requirements for CUDA, PyTorch, and Python compatibility.

The existing LLM training infrastructure lacked efficient model inference capabilities during training processes, leading to suboptimal GPU utilization. Traditional training pipelines often left GPU resources underutilized during inference phases, creating bottlenecks in the training workflow. Additionally, the team needed a solution that could leverage VeRL while maintaining compatibility with AWS’s managed training infrastructure.

The project required not only the creation of a custom Docker image but also the implementation of a scalable, predictable, and managed infrastructure solution using Amazon SageMaker Hyperpod to support distributed training workloads, including EKS cluster management, Ray cluster deployment, and integration with various AWS services for storage, networking, and monitoring.

AWS architecture

The AWS architecture supporting Osmosis VeRL integration comprises the following components:

Network layer

  • Amazon VPC: Provides isolated networking environment with public and private subnets across multiple availability zones
  • Security groups: Controls access to EKS cluster and Hyperpod instances with specific rules for VeRL communication
  • EFA support: Enables high-performance networking for distributed LLM training workloads, allowing secure and low latency communication between nodes
  • VPC endpoints: Provides secure access to AWS services without internet gateway traversal

Compute layer

  • Amazon SageMaker Hyperpod: Central orchestration service that manages the entire training infrastructure lifecycle, including cluster provisioning, scaling, and resource optimization
  • Amazon EKS: Managed Kubernetes service orchestrated by Hyperpod for running container orchestration of VeRL workloads
  • Ray cluster: Distributed computing framework managed within the Hyperpod environment for training and inference jobs
  • GPU instance types: Support for high-performance instances optimized for machine learning workloads, managed by Hyperpod

Storage Layer

  • Amazon S3: Stores training data, model artifacts, checkpoints, and Docker images
  • Amazon FSx for Lustre: Provides high-performance file system for training data access
  • Amazon EBS: Persistent block storage for container instances
  • Amazon ECR: Container registry for storing custom Docker images

Orchestration layer

  • SageMaker Hyperpod: Primary orchestration layer that manages cluster lifecycle, job scheduling, and resource allocation
  • KubeRay operator: Manages Ray cluster lifecycle within Kubernetes under Hyperpod supervision
  • Helm charts: Deployment management for NVIDIA device plugins and EFA drivers
  • Kubernetes jobs: Manages training job execution and resource allocation

Benefits

The VeRL Docker integration with SageMaker Hyperpod delivered significant improvements across multiple dimensions:

Performance Optimization

  • Enhanced GPU utilization: SGLang’s efficient inference capabilities during training improved GPU utilization significantly compared to traditional training pipelines
  • Reduced training time: The integration of SGLang for inference phases reduced overall training time through optimized memory usage and faster token generation
  • Scalable architecture: The SageMaker Hyperpod and Ray cluster integration enables seamless scaling from single-node to multi-node training configurations
  • Amazon SageMaker training plans: Provides capability to reserve and maximize the use of GPU capacity for large-scale AI model training workloads.

Operational Excellence

  • Managed infrastructure: SageMaker Hyperpod eliminates the complexity of managing distributed training infrastructure, providing automated scaling and lifecycle management
  • Containerized deployment: Docker-based approach ensures consistent environments across development, staging, and production
  • Infrastructure-as-Code: Terraform-based infrastructure management provides version control, reproducibility, and easy environment provisioning
  • Predictable access: Reserved GPU capacity for machine learning workloads within specified time frames
  • Automated resource management: SageMaker training plans handle the provisioning and management of infrastructure
  • Flexibility: Capacity to create training plans for various resources, including SageMaker training jobs and SageMaker HyperPod clusters
  • Fault tolerance: Automatic recovery from infrastructure failures and workload migration across Availability Zones for SageMaker AI training jobs.

Cost Efficiency

  • Optimized resource usage: Dynamic batch sizing and efficient GPU utilization reduce compute costs substantially
  • Managed services integration: Leveraging AWS managed services including SageMaker Hyperpod reduces operational overhead and maintenance costs
  • Flexible deployment options: Support for both EKS node groups and Hyperpod clusters provides cost optimization flexibility
  • Pay-per-use model: SageMaker Hyperpod’s managed infrastructure eliminates the need for persistent cluster management
  • Cost management: Plan and budget for large-scale training requirements in advance.

Developer Experience

  • Simplified deployment: Helm-based deployment process streamlines the setup of complex distributed training environments
  • Comprehensive monitoring: Integration with CloudWatch and Ray dashboard provides detailed insights into training performance
  • Flexible configuration: Environment-specific configurations enable proper CI/CD practices
  • Reduced complexity: SageMaker Hyperpod abstracts infrastructure management, allowing teams to focus on model development

The VeRL integration with SageMaker Hyperpod transformed Osmosis’ LLM fine-tuning capabilities, delivering measurable improvements in GPU utilization and training efficiency. By seamlessly integrating SGLang for inference during reinforcement learning fine-tuning, the solution reduced training time while maintaining model quality. Tech 42 was proud to partner with Osmosis to architect this scalable, cost-effective infrastructure that enables them to continuously improve their AI agents through real-world feedback loops at production scale.

Explore Case Studies

Case Study

Enabling AI self-improvement at scale through LLM fine-tuning pipeline

learn more
Case Study

AI agent delivering time-savings and technical consistency

learn more
Case Study

Slack-integrated AI chatbot for company knowledge

learn more
Case Study

Blazing-fast embedding search across billions of vectors in biotech

learn more