Osmosis (Gulp.ai) is a San Francisco-based AI company that enables AI self-improvement through real-time reinforcement learning. The Y Combinator-backed startup focuses on unlocking AI agent productivity at production scale by providing the missing piece for truly effective AI systems: the ability to learn from experience. Osmosis addresses the critical gap in deploying AI agents that can continuously improve their performance through real-world interactions.
Osmosis needed to integrate SGLang as a backend for model inference during LLM fine-tuning via VeRL, a reinforcement learning framework specifically designed for fine-tuning large language models, to improve GPU utilization efficiency. The challenge was to create a fully functional, custom Docker image that would seamlessly integrate with Amazon SageMaker Hyperpod on EKS while meeting specific technical requirements for CUDA, PyTorch, and Python compatibility.
The existing LLM training infrastructure lacked efficient model inference capabilities during training processes, leading to suboptimal GPU utilization. Traditional training pipelines often left GPU resources underutilized during inference phases, creating bottlenecks in the training workflow. Additionally, the team needed a solution that could leverage VeRL while maintaining compatibility with AWS’s managed training infrastructure.
The project required not only the creation of a custom Docker image but also the implementation of a scalable, predictable, and managed infrastructure solution using Amazon SageMaker Hyperpod to support distributed training workloads, including EKS cluster management, Ray cluster deployment, and integration with various AWS services for storage, networking, and monitoring.
AWS architecture
The AWS architecture supporting Osmosis VeRL integration comprises the following components:
Network layer
Amazon VPC: Provides isolated networking environment with public and private subnets across multiple availability zones
Security groups: Controls access to EKS cluster and Hyperpod instances with specific rules for VeRL communication
EFA support: Enables high-performance networking for distributed LLM training workloads, allowing secure and low latency communication between nodes
VPC endpoints: Provides secure access to AWS services without internet gateway traversal
Compute layer
Amazon SageMaker Hyperpod: Central orchestration service that manages the entire training infrastructure lifecycle, including cluster provisioning, scaling, and resource optimization
Amazon EKS: Managed Kubernetes service orchestrated by Hyperpod for running container orchestration of VeRL workloads
Ray cluster: Distributed computing framework managed within the Hyperpod environment for training and inference jobs
GPU instance types: Support for high-performance instances optimized for machine learning workloads, managed by Hyperpod
Storage Layer
Amazon S3: Stores training data, model artifacts, checkpoints, and Docker images
Amazon FSx for Lustre: Provides high-performance file system for training data access
Amazon EBS: Persistent block storage for container instances
Amazon ECR: Container registry for storing custom Docker images
Orchestration layer
SageMaker Hyperpod: Primary orchestration layer that manages cluster lifecycle, job scheduling, and resource allocation
KubeRay operator: Manages Ray cluster lifecycle within Kubernetes under Hyperpod supervision
Helm charts: Deployment management for NVIDIA device plugins and EFA drivers
Kubernetes jobs: Manages training job execution and resource allocation
Benefits
The VeRL Docker integration with SageMaker Hyperpod delivered significant improvements across multiple dimensions:
Performance Optimization
Enhanced GPU utilization: SGLang’s efficient inference capabilities during training improved GPU utilization significantly compared to traditional training pipelines
Reduced training time: The integration of SGLang for inference phases reduced overall training time through optimized memory usage and faster token generation
Scalable architecture: The SageMaker Hyperpod and Ray cluster integration enables seamless scaling from single-node to multi-node training configurations
Amazon SageMaker training plans: Provides capability to reserve and maximize the use of GPU capacity for large-scale AI model training workloads.
Operational Excellence
Managed infrastructure: SageMaker Hyperpod eliminates the complexity of managing distributed training infrastructure, providing automated scaling and lifecycle management
Containerized deployment: Docker-based approach ensures consistent environments across development, staging, and production
Infrastructure-as-Code: Terraform-based infrastructure management provides version control, reproducibility, and easy environment provisioning
Predictable access: Reserved GPU capacity for machine learning workloads within specified time frames
Automated resource management: SageMaker training plans handle the provisioning and management of infrastructure
Flexibility: Capacity to create training plans for various resources, including SageMaker training jobs and SageMaker HyperPod clusters
Fault tolerance: Automatic recovery from infrastructure failures and workload migration across Availability Zones for SageMaker AI training jobs.
Reduced complexity: SageMaker Hyperpod abstracts infrastructure management, allowing teams to focus on model development
The VeRL integration with SageMaker Hyperpod transformed Osmosis’ LLM fine-tuning capabilities, delivering measurable improvements in GPU utilization and training efficiency. By seamlessly integrating SGLang for inference during reinforcement learning fine-tuning, the solution reduced training time while maintaining model quality. Tech 42 was proud to partner with Osmosis to architect this scalable, cost-effective infrastructure that enables them to continuously improve their AI agents through real-world feedback loops at production scale.
Explore Case Studies
Case Study
Enabling AI self-improvement at scale through LLM fine-tuning pipeline