Unified Streaming & Batch
LakeInsight adopts a lakehouse real-time data warehouse architecture with native support for unified streaming and batch data processing. From real-time CDC data synchronization and incremental stream computing to batch offline modeling, it provides full-chain coverage, achieving the trinity of "unified streaming and batch, unified lakehouse, and unified AI and BI," ensuring data traceability, manageability, observability, and elastic cluster scaling.
Real-Time Data Synchronization
(1) Multi-Source CDC Synchronization
- Supports single-database single-table and single-database multi-table synchronization, capturing data changes incrementally via database CDC events
- Automatically parses database names, table names, and schema information, enabling fully automated database and table creation
- Automatic schema change synchronization: automatically detects DDL changes (add/drop columns, column type changes such as int→long, float→double) and synchronizes to the target storage platform
- Logical column deletion: when source tables drop columns, the platform can retain the deleted column schema, allowing historical field information to remain queryable
(2) Rich Data Type Support
Supports synchronization of all common data types including boolean, bit, binary, varbinary, blob, bigint, int, integer, float, double, date, datetime, timestamp, decimal, char, varchar, string, text, and json
(3) Data Accuracy Guarantees
- End-to-end Exactly-once semantics ensuring no data loss or duplication during transmission
- Data delay detection mechanism to prevent anomalies caused by upstream data latency
- Multiple checkpoint recovery mechanisms: timestamp-based consumption, latest-data consumption, and more for rapid sync task recovery
- Data source security mechanisms
Real-Time Compute & Storage
(1) Metadata Management
- High-availability, distributed deployment supporting tens of millions of data objects per node
- Multi-level management: Domain, Namespace, Table, Partition, and Data File
- High-concurrency writes with ACID transactions ensuring read-write consistency
- TimeTravel support: data rollback, snapshot reads, and incremental reads
- Listen-Trigger-Notify mechanism for automatic compaction and data cleanup
(2) Flexible Data Updates & Reads
- Append mode for tables without primary keys
- Upsert mode for tables with primary keys, merging updates on read by primary key to ensure the latest data
- Multiple read modes: MOR (Merge on Read), incremental read, and snapshot read
(3) Multi-Engine Support
| Engine Type | Supported Engines |
|---|---|
| Batch Computing | Spark |
| Stream Computing | Flink, Spark Streaming |
| AI Computing | PyTorch, Pandas, Spark MLLib |
| MPP Analytics | Presto, Doris |
Unified API interface supporting integration with various open-source engines.
(4) Streaming & Batch Warehouse Modeling
- Real-Time Incremental Modeling: Incrementally reads upstream data in streaming mode with Changelog semantics; supports Flink stream-stream Join, LookupJoin, and Aggregate operations; supports CDC output with real-time data persistence and push to downstream data services
- Batch Computation Modeling: Periodically scheduled batch execution of modeling tasks; supports Overwrite and Upsert result update modes; supports Spark SQL and Spark DataFrame API development
- Unified Storage & Query: Streaming and batch writes share the same lakehouse storage; the query layer automatically merges streaming and batch data, making underlying data sources transparent to users