Apache Paimon
On this page
Apache Paimon is a dynamic data lake storage format that supports high-throughput data writing and low-latency data querying. It is compatible with Flink-based Ververica Cloud and can be deployed on Hadoop Distributed File System (HDFS) or Ververica Cloud. See the Apache Paimon documentation for a comprehensive understanding.
Functionality Overview
Apache Paimon offers:
- Data lake storage using HDFS or VERA
- Read/write capabilities for large-scale datasets, both in streaming and batch modes
- Fast batch and OLAP queries
- Incremental data handling; suitable for both offline and streaming data warehouses
- Data pre-aggregation, storage optimization, and computation
- Tracing for historical data versions
- Data filtering mechanisms
- Table schema modifications
Restrictions
Apache Paimon connector is only supported by Ververica Cloud for Apache Flink with Ververica Runtime (VERA) 1.0.3 and later.
Workload Identity
When deploying your applications on Azure, you can configure Workload Identity authentication. Using a managed identity allows your applications, such as Apache Paimon, to securely access external Azure resources like Azure Blob Storage without storing long-lived credentials in your applications.
Read more about configuring Workload Identity on Azure
Usage
Creating an Apache Paimon table within an Apache Paimon catalog does not necessitate the configuration of the connector parameter. The syntax for this process is demonstrated in the sample code that follows.
1CREATE TABLE `<your-paimon-catalog>`.`<your-db>`.paimon_table (
2 id BIGINT,
3 data STRING,
4 PRIMARY KEY (id) NOT ENFORCED
5) WITH (
6 ...
7);If you have created an Apache Paimon table in the Apache Paimon catalog, you can directly use the table without the need to recreate a table.
In case you’re aiming to create a temporary Apache Paimon table within a catalog that belongs to a storage system other than Apache Paimon, it is mandatory to set the connector parameter and specify the storage path of the Apache Paimon table. Below is an example demonstrating the syntax needed to create an Apache Paimon table in such circumstances.
1CREATE TEMPORARY TABLE paimon_table (
2 id BIGINT,
3 data STRING,
4 PRIMARY KEY (id) NOT ENFORCED
5) WITH (
6 'connector' = 'paimon',
7 'path' = '<path-to-paimon-table-files>',
8 'auto-create'='true' -- If no Apache Paimon table file exists in the specified path, a file is automatically created.
9);Parameters for the WITH Clause
For more information about the configuration items, see Apache Paimon Documentation.
Feature Overview
Assuring Data Freshness and Consistency
The Apache Paimon result table commits data using the two-phase commit protocol (2PC) every time a Flink deployment triggers a checkpoint. This ensures that the data’s freshness corresponds to the Flink deployment’s checkpoint interval. During each data commit, up to two snapshots may be created.
In scenarios where two separate Flink deployments concurrently write to an Apache Paimon table, but target different buckets, serializable consistency is maintained. However, if they target the same bucket, only snapshot-level consistency is maintained. This might lead to the table reflecting the results from both deployments, but the assurance is that no data will go missing.
Data Merging Mechanism
If an Apache Paimon result table detects multiple records with identical primary keys, it combines these records into a singular entry to maintain primary key uniqueness. This merging strategy can be dictated by the merge-engine parameter. A subsequent section delves deeper into the various available data merging strategies.
1CREATE TABLE MyTable (
2 product_id BIGINT,
3 price DOUBLE,
4 sales BIGINT,
5 PRIMARY KEY (product_id) NOT ENFORCED
6) WITH (
7 'merge-engine' = 'aggregation',
8 'fields.price.aggregate-function' = 'max',
9 'fields.sales.aggregate-function' = 'sum'
10);In this example, data in the price column is aggregated based on the max function, and data in the sales column is aggregated based on the sum function. If the input data records are <1, 23.0, 15> and <1, 30.2, 20>, the result is <1, 30.2, 35>. Mappings between the supported aggregate functions and data types:
- sum: DECIMAL, TINYINT, SMALLINT, INTEGER, BIGINT, FLOAT, and DOUBLE
- min and max: DECIMAL, TINYINT, SMALLINT, INTEGER, BIGINT, FLOAT, DOUBLE, DATE, TIME, TIMESTAMP, and TIMESTAMP_LTZ
- last_value and last_non_null_value: all data types
- listagg: STRING
- bool_and and bool_or: BOOLEAN
Only the sum function supports data retraction and deletion. If you want specific columns to ignore retraction and deletion messages, you can configure fields.${field_name}.ignore-retract'='true'. If you want the Apache Paimon result table to read the aggregation result in streaming mode, you must set the changelog-producer parameter to lookup or full-compaction.
Mechanism for Producing Incremental Data
The method for producing incremental data is determined by the changelog-producer parameter. Apache Paimon has the capability to produce comprehensive incremental data from any given input stream. Every UPDATE_AFTER data entry has a matching UPDATE_BEFORE record. A detailed list of all the mechanisms for generating incremental data can be found below. For an in-depth understanding, refer to the official Apache Paimon documentation.
Write Mode
The table below outlines the write modes that Apache Paimon tables can accommodate.