Announcement

Free to view yesterday and today

Customer Service: cat_manager

加载中

正在获取最新内容，请稍候...

Apache Iceberg: High-Performance Table Format for Large-Scale Analytic Data

Apache Iceberg is an open source table format for huge analytic datasets. Iceberg adds high-performance table capabilities to open cloud formats like Parquet and ORC, and lets users query petabytes of data.

Java

Added on 2025年6月12日

View on GitHub

Apache Iceberg: High-Performance Table Format for Large-Scale Analytic Data preview

7,566

Stars

2,605

Forks

Java

Language

Project Introduction

Summary

Apache Iceberg is a standard, open table format designed to manage huge collections of data files. It provides reliable, high-performance SQL-like table operations for data lakes.

Problem Solved

Addresses key challenges with traditional file-based data lakes, such as slow metadata operations, complex schema evolution, data correctness issues during concurrent writes, and difficulty integrating with various query engines like Spark, Trino, and Flink.

Core Features

Schema Evolution

Safely add, delete, update, or rename columns, tracking changes over time without breaking existing queries.

Time Travel

Access historical versions of a table using snapshot IDs for reproducible reports or rollbacks.

Hidden Partitioning

Partitioning logic is handled by Iceberg, preventing query bugs and accelerating performance.

Data Compaction

Tools for optimizing data files and metadata to maintain query performance and manage storage.

Tech Stack

Java

Scala

Python

Spark

Trino

Flink

PrestoDB

Hive

Kafka

HDFS

GCS

Azure Data Lake Storage

Use Cases

Apache Iceberg is ideal for scenarios involving large-scale analytical data, enabling more reliable and performant data lake architectures across various industries.

Building Modern Data Lakes

Details

Implement reliable, scalable data lakes on object storage (S3, ADLS, GCS) or HDFS with ACID transactions, schema evolution, and snapshot isolation.

User Value

Provides a foundation for a flexible, maintainable, and performant data lake with data reliability guarantees.

Building Robust ETL/ELT Pipelines

Details

Manage incremental data ingestion and transformations with efficient appends, merges, and deletes, supporting concurrent operations from multiple engines.

User Value

Ensures data consistency and simplifies complex data pipelines by providing atomicity and isolation.

Data Warehousing Alternative/Extension

Details

Leverage data lake economics (cheap storage) with data warehousing performance and governance features (updates, deletes, time travel).

User Value

Lower storage costs and avoid vendor lock-in while maintaining high query performance and ACID compliance.

Recommended Projects

You might be interested in these projects

alibabahigress

This project aims to automate specific task processing flows through automation technology, significantly improving efficiency and accuracy. Suitable for developers and analysts who need to handle large amounts of data.

5554703

View Details

PaperMCFolia

A high-performance fork of Paper, introducing regionised multithreading to Minecraft servers for improved scalability and performance under high player counts.

Java

3921529

View Details

raysan5raylib

raylib is a simple and easy-to-use library to enjoy videogames programming, designed to encourage beginners and hobbyists to create games and graphical applications without external dependencies.

267372543

View Details