Unified Streaming & Batch

LakeInsight adopts a lakehouse real-time data warehouse architecture with native support for unified streaming and batch data processing. From real-time CDC data synchronization and incremental stream computing to batch offline modeling, it provides full-chain coverage, achieving the trinity of "unified streaming and batch, unified lakehouse, and unified AI and BI," ensuring data traceability, manageability, observability, and elastic cluster scaling.

Real-Time Data Synchronization

(1) Multi-Source CDC Synchronization

Supports single-database single-table and single-database multi-table synchronization, capturing data changes incrementally via database CDC events
Automatically parses database names, table names, and schema information, enabling fully automated database and table creation
Automatic schema change synchronization: automatically detects DDL changes (add/drop columns, column type changes such as int→long, float→double) and synchronizes to the target storage platform
Logical column deletion: when source tables drop columns, the platform can retain the deleted column schema, allowing historical field information to remain queryable

(2) Rich Data Type Support

Supports synchronization of all common data types including boolean, bit, binary, varbinary, blob, bigint, int, integer, float, double, date, datetime, timestamp, decimal, char, varchar, string, text, and json

(3) Data Accuracy Guarantees

End-to-end Exactly-once semantics ensuring no data loss or duplication during transmission
Data delay detection mechanism to prevent anomalies caused by upstream data latency
Multiple checkpoint recovery mechanisms: timestamp-based consumption, latest-data consumption, and more for rapid sync task recovery
Data source security mechanisms

Real-Time Compute & Storage

(1) Metadata Management

High-availability, distributed deployment supporting tens of millions of data objects per node
Multi-level management: Domain, Namespace, Table, Partition, and Data File
High-concurrency writes with ACID transactions ensuring read-write consistency
TimeTravel support: data rollback, snapshot reads, and incremental reads
Listen-Trigger-Notify mechanism for automatic compaction and data cleanup

(2) Flexible Data Updates & Reads

Append mode for tables without primary keys
Upsert mode for tables with primary keys, merging updates on read by primary key to ensure the latest data
Multiple read modes: MOR (Merge on Read), incremental read, and snapshot read

(3) Multi-Engine Support

Engine Type	Supported Engines
Batch Computing	Spark
Stream Computing	Flink, Spark Streaming
AI Computing	PyTorch, Pandas, Spark MLLib
MPP Analytics	Presto, Doris

Unified API interface supporting integration with various open-source engines.

(4) Streaming & Batch Warehouse Modeling

Real-Time Incremental Modeling: Incrementally reads upstream data in streaming mode with Changelog semantics; supports Flink stream-stream Join, LookupJoin, and Aggregate operations; supports CDC output with real-time data persistence and push to downstream data services
Batch Computation Modeling: Periodically scheduled batch execution of modeling tasks; supports Overwrite and Upsert result update modes; supports Spark SQL and Spark DataFrame API development
Unified Storage & Query: Streaming and batch writes share the same lakehouse storage; the query layer automatically merges streaming and batch data, making underlying data sources transparent to users

Unified Streaming & Batch

Real-Time Data Synchronization​

Real-Time Compute & Storage​

Real-Time Data Synchronization

Real-Time Compute & Storage