Raw
This article introduces you how to use the Raw format, configuration options, and type mapping.
Background Information
The Raw format allows reading and writing of byte-based raw values as a single column. Raw format connector is built-in.
Instructions
For example, have the following log data in raw format in Kafka and want to read and analyze such data using Flink SQL.
47.29.201.179 - - [28/Feb/2019:13:17:10 +0000] "GET /?p=1 HTTP/2.0" 200 5316"Mozilla/ 5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36" "2.75"
An example of reading from a Kafka topic as an anonymous string value encoded in UTF-8 using Raw format is as follows:
CREATE TABLE nginx_log (
log STRING
) WITH (
'connector' = 'kafka',
'topic' = 'nginx_log',
'properties.bootstrap.servers' = 'localhost:9092',
'properties.group.id' = 'testGroup' ,
'format' = 'raw'
);
After reading the original data into a plain string through the above statement, you can use a custom function to split the string into multiple strings for further analysis, such as the my_split function in the following SQL statement.
SELECT t.hostname, t.datetime, t.url, t.browser, ...
FROM(
SELECT my_split(log) as t FROM nginx_log
);
Likewise, a column of type STRING can be written to a Kafka topic as an anonymous string value encoded in UTF-8.
Configuration Options
Parameter | Required | Default | Type | Description |
---|---|---|---|---|
format | yes | (none) | String | The format to use for the declaration. When the Raw format is used, the parameter value is raw. |
raw.charset | no | UTF-8 | String | Specifies the coded character set for text strings. The default is UTF-8. |
raw. endianness | no | big-endian | String | Specifies the encoding endianness of the bytes of the numeric value. The parameter values are as follows: big-endian (default), little-endian. |
Type Mapping
The Flink SQL types supported by Raw format are as follows.
Flink SQL type | Value |
---|---|
CHAR/VARCHAR/STRING | UTF-8 (default) encoded text string. Note: The coded character set can be configured via raw.charset. |
BINARY / VARBINARY / BYTES | sequence of bytes. |
BOOLEAN | A single byte of a boolean value. |
TINYINT | A single byte of signed value. |
SMALLINT | Two bytes encoded in big-endian (the default). Note: Endianness can be configured via raw.endianness. |
INT | Four bytes encoded in big-endian (the default). Note: Endianness can be configured via raw.endianness. |
BIGINT | Eight bytes encoded in big-endian (the default). Note: Endianness can be configured via raw.endianness. |
FLOAT | Four bytes in IEEE 754 format and big-endian (default) encoding. Note: Endianness can be configured via raw.endianness. |
DOUBLE | Eight bytes in IEEE 754 format and big-endian (default) encoding. Note: Endianness can be configured via raw.endianness. |
RAW | Sequence of bytes serialized by the underlying TypeSerializer of the RAW type. |
Other Instructions for Use
The Raw format encodes the NULL value into a byte[] type NULL, and Upsert-Kafka regards the NULL value as a tombstone message and deletes the value on the key. So it is recommended to avoid Upsert-Kafka connector and Raw format as value.format if the field has NULL value.
This page is derived from the official Apache Flink® documentation.
Refer to the Credits page for more information.