Parquet

The Apache Parquet format allows to read and write Parquet data.

Example of Use

    CREATE TABLE user_behavior (
        user_id BIGINT,
        item_id BIGINT,
        category_id BIGINT,
        behavior STRING,
        ts TIMESTAMP(3),
        dt STRING
    ) PARTITIONED BY (dt) WITH (
        'connector' = 'filesystem',
        'path' = '/tmp/user_behavior',
        'format' = 'parquet'
    )

Format Options

Option	Required	Default	Type of data	Description
format	required	(none)	String	Specify what format to use, here should be ‘parquet’.
parquet.utc-timezone	optional	false	Boolean	Use UTC timezone or local timezone to the conversion between epoch time and LocalDateTime. Hive 0.x/1.x/2.x use local timezone. But Hive 3.x use UTC timezone.

Data Type Mapping

Currently, Parquet format type mapping is compatible with Apache Hive, but different with Apache Spark:

Timestamp: mapping timestamp type to int96 whatever the precision is.
Decimal: mapping decimal type to fixed length byte array according to the precision.

The following table lists the type mapping from Flink type to Parquet type.

Flink Data Type	Parquet type	Parquet logical type
CHAR / VARCHAR / STRING	BINARY	UTF8
BOOLEAN	BOOLEAN
BINARY / VARBINARY	BINARY / VARBINARY
DECIMAL	FIXED_LEN_BYTE_ARRAY	DECIMAL
TINYINT	INT32	INT_8
SMALLINT	INT32	INT_16
INT	INT32
BIGINT	INT64
FLOAT	FLOAT
DOUBLE	DOUBLE
DATE	INT32	DATE
TIME	INT32	TIME_MILLIS
TIMESTAMP	INT96
ARRAY	LIST
MAP	MAP
ROW	STRUCT

Example of Use​

Format Options​

Data Type Mapping​

Example of Use

Format Options

Data Type Mapping