Skip to content

javyxu/LakeSoul

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1,325 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

LakeSoul

LF AI & Data Sandbox Project

OpenSSF Best Practices

Maven Test Flink CDC Test Build Python CI Ask DeepWiki

中文介绍

Beyond Table Formats — A Complete Lakehouse Solution

While Apache Iceberg provides a de-factor open table format, LakeSoul aims to deliver a batteries-included, production-ready lakehouse platform. Beyond the table format itself, LakeSoul comes with built-in automated disaggregated multi-level compaction, fine-grained RBAC (including S3 proxy-based access control), high-performance OLAP queries, vector retrieval, and native multimodal data processing powered by Ray and Daft. Instead of assembling and maintaining separate catalogs, compaction services, and auth layers, you get a production-ready lakehouse out of the box.

Rust-Native Core, Consistent Everywhere

LakeSoul's metadata management and file format IO are implemented entirely in Rust — a single, high-performance core — with idiomatic bindings for Java, Python, and C++. Whether you're querying via Spark, streaming via Flink, or training models via PyTorch, Ray, or Daft, every engine and every language shares the same ACID guarantees, the same upsert semantics, and the same read performance. There are no per-language/per-engine re-implementations of the table format, no subtle behavioral divergences between bindings, and no fragmented compatibility matrix to navigate.

Compute framework support matrix:

Engine Version Read Write Interface
Spark 3.5 ✓ Batch ✓ Batch Java / Python / Scala / SQL
Flink 1.20 ✓ Streaming ✓ Streaming Java / SQL
Presto 0.296(velox) ✓ Batch - SQL
Ray 2.55 ✓ Distributed ✓ Distributed Python
Daft 0.7+ ✓ Distributed ✓ Distributed Python
DuckDB latest ✓ Standalone Python
PyArrow 16+ ✓ Standalone ✓ Standalone Python
Pandas 2.0+ ✓ Standalone ✓ Standalone Python

Core Features

LakeSoul is a cloud-native Lakehouse framework that supports scalable metadata management, ACID transactions, efficient and flexible upsert operation, schema evolution, and unified streaming & batch processing.

LakeSoul supports multiple computing engines to read and write lake warehouse table data, including Spark, Flink, Presto, PyTorch, Ray and Daft. LakeSoul supports storage systems such as HDFS and S3.

LakeSoul supports two file formats: parquet(default) and vortex. Vortex file format can be used to store multimodal data and vector embeddings.

LakeSoul Arch

LakeSoul was originally created by DMetaSoul company and was donated to Linux Foundation AI & Data as a sandbox project since May 2023.

LakeSoul implements incremental upserts for both row and column and allows concurrent updates.

LakeSoul uses LSM-Tree like structure to support updates on hash partitioning table with primary key, and achieves very high write throughput while providing optimized merge on read performance (refer to Performance Benchmarks). LakeSoul scales metadata management and achieves ACID control by using PostgreSQL.

LakeSoul uses Rust to implement the native metadata layer and IO layer, and provides C/Java/Python interfaces to support the connecting of multiple computing frameworks such as big data and AI.

LakeSoul supports concurrent batch or streaming read and write. Both read and write supports CDC semantics, and together with auto schema evolution and exacly-once guarantee, constructing realtime data warehouses is made easy.

LakeSoul supports multi-workspace and RBAC. LakeSoul uses Postgres's RBAC and row-level security policies to implement permission isolation for metadata. Together with the S3 proxy authorization layer, physical data isolation can be achieved. LakeSoul's permission isolation is effective for SQL/Java/Python jobs.

LakeSoul supports automatic disaggregated size-tiered multi-level compaction, automatic table life cycle maintenance, automatic data asset statistics, and automatic redundant data cleaning, reducing operation costs and improving usability.

More detailed features please refer to our doc page: Documentations

Quick Start

Follow the Quick Start to quickly set up a test env.

Tutorials

Please find tutorials in doc site:

Usage Documentations

Please find usage documentations in doc site: Usage Doc

快速开始

教程

使用文档

Feature Roadmap

Roadmap 2026

  • Compute Engine Version
    • Spark 4.0+
    • Flink 2.0+
  • Multimodality
    • Vortex file format
    • Daft integration
    • Vector ANN search on lakehouse (on object store), with upserts
  • Performance
    • 2x faster merge-on-read with window-sliding merge (for both full and paritial merge).
    • 50% memory usage reduction with spill-sort in primary key table writer
    • Disk LRU cache for object store
    • Apache Gluten integration
    • Velox integration
    • Up to 100x faster partition pruning and partition snapshot query with meta data index and query optimizations
    • Optionally route read-only meta data queries to PG standby instances
    • Secondary index
    • Metadata cache
  • Maintenance
    • (auto) Leveled compaction strategy
    • (auto) Async cleanup(vacuum) via Flink CDC on PG replication slot
  • Security
    • S3 proxy with table rbac verification

Roadmap history

  • Data Science and AI
    • Native Python Reader (without PySpark)
    • PyTorch Dataset and distributed training
    • Ray/Daft support
  • Meta Management (#23)
    • Multiple Level Partitioning: Multiple range partition and at most one hash partition
    • Concurrent write with auto conflict resolution
    • MVCC with read isolation
    • Write transaction (two-stage commit) through Postgres Transaction
    • Schema Evolution: Column add/delete supported
  • Table operations
    • LSM-Tree style upsert for hash partitioned table
    • Merge on read for hash partition with upsert delta file
    • Copy on write update for non hash partitioned table
    • Automatic Disaggregated Compaction Service
  • Data Warehousing
    • CDC stream ingestion with auto ddl sync
    • Incremental and Snapshot Query
      • Snapshot Query (#103)
      • Incremental Query (#103)
      • Incremental Streaming Source (#130)
      • Flink Stream/Batch Source
    • Multi Workspaces and RBAC
  • Spark Integration
    • Table/Dataframe API
    • SQL support with catalog except upsert
    • Query optimization
      • Shuffle/Join elimination for operations on primary key
    • Merge UDF (Merge operator)
    • Merge Into SQL support
      • Merge Into SQL with match on Primary Key (Merge on read)
  • Flink Integration and CDC Ingestion (#57)
    • Table API
      • Batch/Stream Sink
      • Batch/Stream source
      • Stream Source/Sink for ChangeLog Stream Semantics
      • Exactly Once Source and Sink
    • Flink CDC
      • Auto Schema Change (DDL) Sync
      • Auto Table Creation (depends on #78)
      • Support sink multiple source tables with different schemas (#84)
  • Hive Integration
    • Export to Hive partition after compaction
    • Apache Kyuubi (Hive JDBC) Integration
  • Realtime Data Warehousing
    • CDC ingestion
    • Time Travel (Snapshot read)
    • Snapshot rollback
    • Automatic global compaction service
    • MPP Engine Integration (depends on #66)
      • Presto
      • Compatibility with Presto Native Execution(with Velox)
      • Apache Doris
  • Cloud and Native IO (#66)
    • Object storage IO optimization
    • Native vectorized merge on read
    • Multi-layer storage classes support with local-disk data cache

Community guidelines

Community guidelines

Feedback and Contribution

Please feel free to open an issue or dicussion if you have any questions.

Join our Discord server for discussions.

Contact Us

Email us at lakesoul-technical-discuss@lists.lfaidata.foundation.

Opensource License

LakeSoul is opensourced under Apache License v2.0.

About

A Table Structure Storage on Data Lakes to Unify Batch and Streaming Data Processing

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages

  • Java 38.5%
  • Scala 30.1%
  • Rust 25.9%
  • Python 3.0%
  • MDX 1.1%
  • Shell 0.6%
  • Other 0.8%