Announcement

Free to view yesterday and today

Customer Service: cat_manager

Volcano: Cloud Native Batch System on Kubernetes

Volcano is a Cloud Native Batch System built on Kubernetes, providing a powerful and flexible platform for running high-performance workloads like AI/ML, HPC, and genomics. It extends Kubernetes to support job-centric features such as gang scheduling, fair-share scheduling, and resource management.

Added on 2025年6月11日

View on GitHub

Volcano: Cloud Native Batch System on Kubernetes preview

4,730

Stars

1,098

Forks

Language

Project Introduction

Summary

Volcano is the first cloud-native batch system building upon Kubernetes. It aims to provide a unified platform for managing all types of compute-intensive workloads, including High-Performance Computing (HPC), Artificial Intelligence (AI), Machine Learning (ML), and data processing.

Problem Solved

Standard Kubernetes is primarily designed for long-running services. Running batch jobs, HPC tasks, and AI/ML training that require specific scheduling semantics (like gang scheduling) and efficient resource sharing can be challenging. Volcano addresses these gaps by providing a specialized scheduler and controllers optimized for these types of workloads.

Core Features

Gang Scheduling

Ensures all tasks within a job start or terminate together, preventing deadlocks and improving resource utilization for tightly coupled workloads.

Advanced Scheduling Policies

Provides advanced job queuing, prioritization, and resource fairness policies across different tenants and applications.

Heterogeneous Resource Management

Manages heterogeneous resources like GPUs and FPGAs effectively for compute-intensive tasks.

Tech Stack

Kubernetes

Docker

ETCD

使用场景

Volcano is designed to efficiently handle a wide range of batch and high-performance workloads, including but not limited to:

场景一：大规模机器学习/深度学习训练

Details

Running distributed training jobs for deep learning models across multiple GPUs, ensuring efficient resource allocation and gang scheduling.

User Value

Accelerate AI/ML development cycles by efficiently utilizing shared GPU clusters.

场景二：科学计算和HPC工作负载

Details

Managing and scheduling complex pipelines for genomic data processing, simulations, and other scientific computing tasks.

User Value

Enable researchers to run demanding computational tasks on scalable Kubernetes infrastructure.

场景三：大数据处理和CI/CD批处理

Details

Handling large volumes of data processing tasks (like Spark, Flink) or CI/CD pipelines that require batch execution and specific resource guarantees.

User Value

Improve efficiency and resource utilization for data processing and automated build/test jobs.

Recommended Projects

You might be interested in these projects

tinygradtinygrad

tinygrad is a revolutionary neural network library designed for simplicity and minimalism. Inspired by PyTorch and Micrograd, it aims to provide a clear, concise framework for deep learning research and development, making complex concepts accessible.

Python

293543445

View Details

usebrunobruno

Bruno is a Fast and Open Source API client, designed as a lightweight alternative to tools like Postman and Insomnia. It helps developers explore, test, and document APIs efficiently with a unique text-based collection format.

JavaScript

352141712

View Details

oxters168Pluvia

Pluvia is a lightweight unofficial Steam client for Android, offering essential features like chat, library browsing, and store access with optimized performance for mobile devices.

125835

View Details