Beyond Table Formats — A Complete Lakehouse Solution

While Apache Iceberg provides a de-factor open table format, LakeSoul aims to deliver a batteries-included, production-ready lakehouse platform. Beyond the table format itself, LakeSoul comes with built-in automated disaggregated multi-level compaction, fine-grained RBAC (including S3 proxy-based access control), high-performance OLAP queries, vector retrieval, and native multimodal data processing powered by Ray and Daft. Instead of assembling and maintaining separate catalogs, compaction services, and auth layers, you get a production-ready lakehouse out of the box.

Rust-Native Core, Consistent Everywhere

LakeSoul's metadata management and file format IO are implemented entirely in Rust — a single, high-performance core — with idiomatic bindings for Java, Python, and C++. Whether you're querying via Spark, streaming via Flink, or training models via PyTorch, Ray, or Daft, every engine and every language shares the same ACID guarantees, the same upsert semantics, and the same read performance. There are no per-language/per-engine re-implementations of the table format, no subtle behavioral divergences between bindings, and no fragmented compatibility matrix to navigate.

Compute framework support matrix:

Engine	Version	Read	Write	Interface
Spark	3.5	✓ Batch	✓ Batch	Java / Python / Scala / SQL
Flink	1.20	✓ Streaming	✓ Streaming	Java / SQL
Presto	0.296(velox)	✓ Batch	-	SQL
Ray	2.55	✓ Distributed	✓ Distributed	Python
Daft	0.7+	✓ Distributed	✓ Distributed	Python
DuckDB	latest	✓ Standalone	—	Python
PyArrow	16+	✓ Standalone	✓ Standalone	Python
Pandas	2.0+	✓ Standalone	✓ Standalone	Python

Core Features

LakeSoul is a cloud-native Lakehouse framework that supports scalable metadata management, ACID transactions, efficient and flexible upsert operation, schema evolution, and unified streaming & batch processing.

LakeSoul supports multiple computing engines to read and write lake warehouse table data, including Spark, Flink, Presto, PyTorch, Ray and Daft. LakeSoul supports storage systems such as HDFS and S3.

LakeSoul supports two file formats: parquet(default) and vortex. Vortex file format can be used to store multimodal data and vector embeddings.

LakeSoul was originally created by DMetaSoul company and was donated to Linux Foundation AI & Data as a sandbox project since May 2023.

LakeSoul implements incremental upserts for both row and column and allows concurrent updates.

LakeSoul uses LSM-Tree like structure to support updates on hash partitioning table with primary key, and achieves very high write throughput while providing optimized merge on read performance (refer to Performance Benchmarks). LakeSoul scales metadata management and achieves ACID control by using PostgreSQL.

LakeSoul uses Rust to implement the native metadata layer and IO layer, and provides C/Java/Python interfaces to support the connecting of multiple computing frameworks such as big data and AI.

LakeSoul supports concurrent batch or streaming read and write. Both read and write supports CDC semantics, and together with auto schema evolution and exacly-once guarantee, constructing realtime data warehouses is made easy.

LakeSoul supports multi-workspace and RBAC. LakeSoul uses Postgres's RBAC and row-level security policies to implement permission isolation for metadata. Together with the S3 proxy authorization layer, physical data isolation can be achieved. LakeSoul's permission isolation is effective for SQL/Java/Python jobs.

LakeSoul supports automatic disaggregated size-tiered multi-level compaction, automatic table life cycle maintenance, automatic data asset statistics, and automatic redundant data cleaning, reducing operation costs and improving usability.

More detailed features please refer to our doc page: Documentations

Quick Start

Follow the Quick Start to quickly set up a test env.

Tutorials

Please find tutorials in doc site:

Checkout Examples of Python Data Processing and AI Model Training on LakeSoul on how LakeSoul connecting AI to Lakehouse to build a unified and modern data infrastructure.
Checkout LakeSoul Flink CDC Whole Database Synchronization Tutorial on how to sync an entire MySQL database into LakeSoul in realtime, with auto table creation, auto DDL sync and exactly once guarantee.
Checkout Flink SQL Usage on using Flink SQL to read or write LakeSoul in both batch and streaming mode, with the supports of Flink Changelog Stream semantics and row-level upsert and delete.
Checkout Multi Stream Merge and Build Wide Table Tutorial on how to merge multiple stream with same primary key (and different other columns) concurrently without join.
Checkout Upsert Data and Merge UDF Tutorial on how to upsert data and Merge UDF to customize merge logic.
Checkout Snapshot API Usage on how to do snapshot read (time travel), snapshot rollback and cleanup.
Checkout Incremental Query Tutorial on how to do incremental query in Spark in batch or stream mode.

Usage Documentations

Please find usage documentations in doc site: Usage Doc

快速开始

教程

使用文档

Feature Roadmap

Roadmap 2026

Compute Engine Version
- Spark 4.0+
- Flink 2.0+
Multimodality
- Vortex file format
- Daft integration
- Vector ANN search on lakehouse (on object store), with upserts
Performance
- 2x faster merge-on-read with window-sliding merge (for both full and paritial merge).
- 50% memory usage reduction with spill-sort in primary key table writer
- Disk LRU cache for object store
- Apache Gluten integration
- Velox integration
- Up to 100x faster partition pruning and partition snapshot query with meta data index and query optimizations
- Optionally route read-only meta data queries to PG standby instances
- Secondary index
- Metadata cache
Maintenance
- (auto) Leveled compaction strategy
- (auto) Async cleanup(vacuum) via Flink CDC on PG replication slot
Security
- S3 proxy with table rbac verification

Roadmap history

Data Science and AI
- Native Python Reader (without PySpark)
- PyTorch Dataset and distributed training
- Ray/Daft support
Meta Management (#23)
- Multiple Level Partitioning: Multiple range partition and at most one hash partition
- Concurrent write with auto conflict resolution
- MVCC with read isolation
- Write transaction (two-stage commit) through Postgres Transaction
- Schema Evolution: Column add/delete supported
Table operations
- LSM-Tree style upsert for hash partitioned table
- Merge on read for hash partition with upsert delta file
- Copy on write update for non hash partitioned table
- Automatic Disaggregated Compaction Service
Data Warehousing
- CDC stream ingestion with auto ddl sync
- Incremental and Snapshot Query
  - Snapshot Query (#103)
  - Incremental Query (#103)
  - Incremental Streaming Source (#130)
  - Flink Stream/Batch Source
- Multi Workspaces and RBAC
Spark Integration
- Table/Dataframe API
- SQL support with catalog except upsert
- Query optimization
  - Shuffle/Join elimination for operations on primary key
- Merge UDF (Merge operator)
- Merge Into SQL support
  - Merge Into SQL with match on Primary Key (Merge on read)
Flink Integration and CDC Ingestion (#57)
- Table API
  - Batch/Stream Sink
  - Batch/Stream source
  - Stream Source/Sink for ChangeLog Stream Semantics
  - Exactly Once Source and Sink
- Flink CDC
  - Auto Schema Change (DDL) Sync
  - Auto Table Creation (depends on #78)
  - Support sink multiple source tables with different schemas (#84)
Hive Integration
- Export to Hive partition after compaction
- Apache Kyuubi (Hive JDBC) Integration
Realtime Data Warehousing
- CDC ingestion
- Time Travel (Snapshot read)
- Snapshot rollback
- Automatic global compaction service
- MPP Engine Integration (depends on #66)
  - Presto
  - Compatibility with Presto Native Execution(with Velox)
  - Apache Doris
Cloud and Native IO (#66)
- Object storage IO optimization
- Native vectorized merge on read
- Multi-layer storage classes support with local-disk data cache

Community guidelines

Feedback and Contribution

Please feel free to open an issue or dicussion if you have any questions.

Join our Discord server for discussions.

Contact Us

Email us at lakesoul-technical-discuss@lists.lfaidata.foundation.

Opensource License

LakeSoul is opensourced under Apache License v2.0.

Name		Name	Last commit message	Last commit date
Latest commit History 1,325 Commits
.agents/skills/git-commit-signoff		.agents/skills/git-commit-signoff
.cargo		.cargo
.claude		.claude
.github/workflows		.github/workflows
docker		docker
javadoc		javadoc
lakesoul-common		lakesoul-common
lakesoul-flink		lakesoul-flink
lakesoul-presto		lakesoul-presto
lakesoul-spark-gluten		lakesoul-spark-gluten
lakesoul-spark		lakesoul-spark
native-io/lakesoul-io-java		native-io/lakesoul-io-java
python		python
rust		rust
script		script
website		website
.dockerignore		.dockerignore
.gitignore		.gitignore
.scalafmt.conf		.scalafmt.conf
.taplo.toml		.taplo.toml
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
Cross.toml		Cross.toml
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README-CN.md		README-CN.md
README.md		README.md
clippy.toml		clippy.toml
community-guideline-cn.md		community-guideline-cn.md
community-guideline.md		community-guideline.md
community-roles-cn.md		community-roles-cn.md
community-roles.md		community-roles.md
devenv.lock		devenv.lock
devenv.nix		devenv.nix
devenv.yaml		devenv.yaml
justfile		justfile
lakesoul.properties		lakesoul.properties
lefthook.yml		lefthook.yml
pg.property		pg.property
pom.xml		pom.xml
pyrightconfig.json		pyrightconfig.json
rust-toolchain.toml		rust-toolchain.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Beyond Table Formats — A Complete Lakehouse Solution

Rust-Native Core, Consistent Everywhere

Core Features

Quick Start

Tutorials

Usage Documentations

Feature Roadmap

Roadmap 2026

Roadmap history

Community guidelines

Feedback and Contribution

Contact Us

Opensource License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Beyond Table Formats — A Complete Lakehouse Solution

Rust-Native Core, Consistent Everywhere

Core Features

Quick Start

Tutorials

Usage Documentations

Feature Roadmap

Roadmap 2026

Roadmap history

Community guidelines

Feedback and Contribution

Contact Us

Opensource License

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages