Universal Blob Storage

Ververica Platform provides centralized configuration of blob storage for its services.

Configuration

In order to enable universal blob storage configure a base URI for your blob storage provider. Add the following snippet to your Helm values.yaml file:

    vvp:
      blobStorage:
        baseUri: s3://my-bucket/vvp

The provided base URI will be picked up by all services that can make use of blob storage, for example Application Manager or Artifact Management.

Storage Providers

Storage Provider	Scheme	Artifact Management	Flink 1.13	Flink 1.12	Flink 1.11
Storage Provider	Scheme	Artifact Management	State Snapshots
AWS S3	`s3://`	✓	✓	✓	✓
Microsoft ABS	`wasbs://`//	✓	✓	✓	✓
Apache Hadoop® HDFS	`hdfs://`	✓	✓	✓	✓
Google GCS	`gs://`	✓	(✓)	(✓)	(✓)
Alibaba OSS	`oss://`	✓	x	x	x

(✓): With custom Flink image

Additional Provider Configuration

Some supported storage providers have additional options that can be configured in the blobStorage section of the values.yaml file, scoped by provider.

The following is a complete listing of supported additional storage provider configuration options:

    blobStorage:
      s3:
        endpoint: ""
        region: ""
      oss:
        endpoint: ""

Credentials

Ververica Platform supports using a single set of credentials to access your configured blob storage, and will automatically distribute these credentials to Flink jobs that require them.

These credentials can be either specified directly in values.yaml, or added to a Kubernetes secret out-of-band and referenced in values.yaml by name.

Option 1: values.yaml

The following is a complete listing of the credentials that can be given for each storage provider, with example values:

    blobStorageCredentials:
      azure:
        connectionString: DefaultEndpointsProtocol=https;EndpointSuffix=core.windows.net;AccountName=vvpArtifacts;AccountKey=VGhpcyBpcyBub3QgYSB2YWxpZCBBQlMga2V5LiAgVGhhbmtzIGZvciB0aG9yb3VnaGx5IHJlYWRpbmcgdGhlIGRvY3MgOikgIA==;
      s3:
        accessKeyId: AKIAEXAMPLEACCESSKEY
        secretAccessKey: qyRRoU+/4d5yYzOGZVz7P9ay9fAAMrexamplesecretkey
      hdfs:
        # hadoop configuration files (core-site.xml, hdfs-site.xml)
        # and optional Kerberos configuration files. Note that the keytab
        # has to be base64 encoded.
        core-site.xml: |
          <?xml version="1.0" ?>
          <configuration>
          ...
          </configuration>
        hdfs-site.xml: |
          <?xml version="1.0" ?>
          <configuration>
          ...
          </configuration>
        krb5.conf: |
          [libdefaults]
            ticket_lifetime = 10h
          ...
        keytab: BQIAA...AAAC
        keytab-principal: flink

Option 2: Pre-create Kubernetes Secret

To use a pre-created Kubernetes secret, its keys must match the pattern <provider>.<key>. For example, s3.accessKeyId and s3.secretAccessKey. To configure Ververica Platform to use this secret, add the following snippet to your Helm values.yaml file:

    blobStorageCredentials:
      existingSecret: my-blob-storage-credentials

important

The values in a Kubernetes secret must be base64-encoded.

Example: Apache Hadoop HDFS

For UBS with Apache Hadoop® HDFS we recommend to pre-create a Kubernetes secret with the required configuration files in order to avoid duplication of the configuration files in the Ververica Platform values.yaml file.

    
    kubectl create secret generic my-blob-storage-credentials \
      --from-file hdfs.core-site.xml=core-site.xml \
      --from-file hdfs.hdfs-site.xml=hdfs-site.xml \
      --from-file hdfs.krb5.conf=krb5.conf \
      --from-file hdfs.keytab=keytab \ 
      --from-file hdfs.keytab-principal=keytab-principal

After you have created the Kubernetes secret, you can reference it in the values.yaml as an existing secret. Note that the Kerberos configuration is optional.

Advanced Configuration

AWS EKS

When running on AWS EKS or AWS ECS your Kubernetes Pods inherit the roles attached to the underlying EC2 instances. If these roles already grant access to the required S3 resources, then you only need to configure vvp.blobStorage.baseUri without configuring any blobStorageCredentials.

Apache Hadoop® Versions

UBS with Apache Hadoop® HDFS uses a Hadoop 2 client for communication with the HDFS cluster. Hadoop 3 preserves wire compatibility with Hadoop 2 clients, and you are able to use HDFS blob storage with both Hadoop 2 and Hadoop 3 HDFS clusters.

Note that there may be incompatabilities between Hadoop 2 and 3 with respect to the configuration files core-site.xml and hdfs-site.xml. As an example, Hadoop 3 allows configuring durations with a unit suffix such as 30s, which results in a configuration parsing error with Hadoop 2 clients. It's generally possible to work around these issues by limiting configuration to Hadoop 2 compatible keys/values.

Apache Flink® Hadoop Dependency

When using HDFS UBS, Ververica Platform dynamically adds the Hadoop dependency flink-shaded-hadoop-2-uber to the classpath. You can use the following annotation to skip this step:

    kind: Deployment
    spec:
      template:
        metadata:
          annotations:
            ubs.hdfs.hadoop-jar-provided: true

This is useful if your Docker image provides a Hadoop dependency. If you use this annotation without a Hadoop dependency on the classpath, your Flink application will fail.

Services

The following services make use of the universal blob storage configuration.

Apache Flink® Jobs

Flink jobs are configured to store blobs at the following locations:

Blob	Storage Location
Checkpoints	`${baseUri}/flink-jobs/namespaces/${ns}/jobs/${jobId}/checkpoints`
Savepoints	`${baseUri}/flink-savepoints/namespaces/${ns}/deployments/${deploymentId}`
High Availability	`${baseUri}/flink-savepoints/namespaces/${ns}/deployments/${deploymentId}`

User-provided configuration has precedence over universal blob storage.

Artifact Management

Artifacts are stored in the following location:

    ${baseUri}/artifacts/namespaces/${ns}

SQL Service

The SQL Service depends on blob storage for storing deployment information and JAR files of user-defined functions.

SQL Deployments

Before a SQL query can be deployed, it needs to be optimized and translated to a Flink job. SQL Service stores the Flink job and all JAR files that contain an implementation of a user-defined function which is used by the query at the following locations:

Blob	Storage Location
Job	`${baseUri}/flink-jobs/namespaces/${ns}/jobs/${jobId}/jobgraph`
UDF JAR Files	`${baseUri}/flink-jobs/namespaces/${ns}/jobs/${jobId}/udfs`

After a query has been deployed, Application Manager maintains the same blobs as for regular Flink jobs, i.e., checkpoints, savepoints, and high-availability files.

UDF Artifacts

The JAR files of UDF Artifacts that are uploaded via the UI are stored in the following location:

    ${baseUri}/sql-artifacts/namespaces/${ns}/udfs/${udfArtifact}

Connectors and Formats

The JAR files of Custom Connectors and Formats that are uploaded via the UI are stored in the following location:

    ${baseUri}/sql-artifacts/namespaces/${ns}/custom-connectors/

Configuration​

Storage Providers​

Additional Provider Configuration​

Credentials​

Option 1: values.yaml​

Option 2: Pre-create Kubernetes Secret​

Example: Apache Hadoop HDFS​

Advanced Configuration​

AWS EKS​

Apache Hadoop® Versions​

Apache Flink® Hadoop Dependency​

Services​

Apache Flink® Jobs​

Artifact Management​

SQL Service​

SQL Deployments​

UDF Artifacts​

Connectors and Formats​