Skip to main content

Parquet

The Apache Parquet format allows to read and write Parquet data.

Example of use

    CREATE TABLE user_behavior (
user_id BIGINT,
item_id BIGINT,
category_id BIGINT,
behavior STRING,
ts TIMESTAMP(3),
dt STRING
) PARTITIONED BY (dt) WITH (
'connector' = 'filesystem',
'path' = '/tmp/user_behavior',
'format' = 'parquet'
)

Format Options

OptionRequiredDefaultType of dataDescription
formatrequired(none)StringSpecify what format to use, here should be ‘parquet’.
parquet.utc-timezoneoptionalfalseBooleanUse UTC timezone or local timezone to the conversion between epoch time and LocalDateTime. Hive 0.x/1.x/2.x use local timezone. But Hive 3.x use UTC timezone.

Data type mapping

Currently, Parquet format type mapping is compatible with Apache Hive, but different with Apache Spark:

  • Timestamp: mapping timestamp type to int96 whatever the precision is.
  • Decimal: mapping decimal type to fixed length byte array according to the precision.

The following table lists the type mapping from Flink type to Parquet type.

Flink Data TypeParquet typeParquet logical type
CHAR / VARCHAR / STRINGBINARYUTF8
BOOLEANBOOLEAN
BINARY / VARBINARYBINARY / VARBINARY
DECIMALFIXED_LEN_BYTE_ARRAYDECIMAL
TINYINTINT32INT_8
SMALLINTINT32INT_16
INTINT32
BIGINTINT64
FLOATFLOAT
DOUBLEDOUBLE
DATEINT32DATE
TIMEINT32TIME_MILLIS
TIMESTAMPINT96
ARRAYLIST
MAPMAP
ROWSTRUCT
note

This page is derived from the official Apache Flink® documentation.

Refer to the Credits page for more information.