Skip to main content

Raw

This article introduces you how to use the Raw format, configuration options, and type mapping.

Background information

The Raw format allows reading and writing of byte-based raw values as a single column. Raw format connector is built-in.

Instructions

For example, have the following log data in raw format in Kafka and want to read and analyze such data using Flink SQL.

47.29.201.179 - - [28/Feb/2019:13:17:10 +0000] "GET /?p=1 HTTP/2.0" 200 5316"Mozilla/ 5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36" "2.75"

An example of reading from a Kafka topic as an anonymous string value encoded in UTF-8 using Raw format is as follows:

    CREATE TABLE nginx_log (
log STRING
) WITH (
'connector' = 'kafka',
'topic' = 'nginx_log',
'properties.bootstrap.servers' = 'localhost:9092',
'properties.group.id' = 'testGroup' ,
'format' = 'raw'
);

After reading the original data into a plain string through the above statement, you can use a custom function to split the string into multiple strings for further analysis, such as the my_split function in the following SQL statement.

    SELECT t.hostname, t.datetime, t.url, t.browser, ...
FROM(
SELECT my_split(log) as t FROM nginx_log
);

Likewise, a column of type STRING can be written to a Kafka topic as an anonymous string value encoded in UTF-8.

Configuration options

ParameterRequiredDefaultTypeDescription
formatyes(none)StringThe format to use for the declaration. When the Raw format is used, the parameter value is raw.
raw.charsetnoUTF-8StringSpecifies the coded character set for text strings. The default is UTF-8.
raw. endiannessnobig-endianStringSpecifies the encoding endianness of the bytes of the numeric value. The parameter values are as follows: big-endian (default), little-endian.

Type mapping

The Flink SQL types supported by Raw format are as follows.

Flink SQL typeValue
CHAR/VARCHAR/STRINGUTF-8 (default) encoded text string. Note: The coded character set can be configured via raw.charset.
BINARY / VARBINARY / BYTESsequence of bytes.
BOOLEANA single byte of a boolean value.
TINYINTA single byte of signed value.
SMALLINTTwo bytes encoded in big-endian (the default). Note: Endianness can be configured via raw.endianness.
INTFour bytes encoded in big-endian (the default). Note: Endianness can be configured via raw.endianness.
BIGINTEight bytes encoded in big-endian (the default). Note: Endianness can be configured via raw.endianness.
FLOATFour bytes in IEEE 754 format and big-endian (default) encoding. Note: Endianness can be configured via raw.endianness.
DOUBLEEight bytes in IEEE 754 format and big-endian (default) encoding. Note: Endianness can be configured via raw.endianness.
RAWSequence of bytes serialized by the underlying TypeSerializer of the RAW type.

Other instructions for use

The Raw format encodes the NULL value into a byte[] type NULL, and Upsert-Kafka regards the NULL value as a tombstone message and deletes the value on the key. So it is recommended to avoid Upsert-Kafka connector and Raw format as value.format if the field has NULL value.

note

This page is derived from the official Apache Flink® documentation.

Refer to the Credits page for more information.