Universal Blob Storage
Ververica Platform provides centralized configuration of blob storage for its services.
Configuration
In order to enable universal blob storage, configure a base URI for your blob storage
provider. Add the following snippet to your Helm values.yaml
file:
vvp:
blobStorage:
baseUri: s3://my-bucket/vvp
The provided base URI will be picked up by all services that can make use of blob storage, for example Application Manager or Artifact Management.
Storage Providers
Storage Provider | Scheme | Artifact Management | Flink 1.19 | Flink 1.18 | Flink 1.17 | Flink 1.16 | Flink 1.15 | Flink 1.14 | Flink 1.13 | Flink 1.12 |
---|---|---|---|---|---|---|---|---|---|---|
AWS S3 | s3:// | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
Microsoft ABS | wasbs:// | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
Apache Hadoop® HDFS | hdfs:// | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
Google GCS | gs:// | ✓ | (✓) | (✓) | (✓) | (✓) | (✓) | (✓) | ✓ | ✓ |
Alibaba OSS | oss:// | ✓ | x | x | x | x | x | x | x | x |
Microsoft ABS Workload Identity | wiaz:// | ✓ | ✓* | ✓* | ✓* | ✓* | ✓* | x | x | x |
(✓): With custom Flink image
* : With VVP Flink image
Additional Provider Configuration
Some supported storage providers have additional options that can be configured in the
blobStorage
section of the values.yaml
file, scoped by
provider. The following options are configurable:
blobStorage:
s3:
endpoint: "<if_applicable>"
region: "<if_applicable>"
oss:
endpoint: "<if_applicable>"
Microsoft ABS Workload Identity
For Microsoft ABS Workload Identity add the following snippet to your Helm values.yaml
file:
vvp:
blobStorage:
baseUri: wiaz://<blob-container-name>@<your account name>.blob.core.windows.net/<path>
You do not need to provide any credentials to set up access to Azure Blob Storage using Microsoft ABS Workload Identity. You just need to provide your Azure client-id and optionally the tenant-id.
workloadIdentity:
azure:
clientId: xxxx–xxxx-xxxx-xxxx
(tenantId: yyyy-yyyy-yyyy-yyyy)
If you want to run Flink jobs in a namespace other than VVP itself (the recommended way), you need to create a Kubernetes service account in that namespace and a federated identity for your Azure principal yourself.
Before you can run a Deployment you must assign the service account names to pods:
spec:
template:
spec:
kubernetes:
pods:
labels:
azure.workload.identity/use: 'true'
serviceAccountName: ververica-platform-ververica-platform
Alternatively, you can configure the taskManager
and jobManager
independently, for example:
spec:
template:
spec:
kubernetes:
jobManagerPodTemplate:
metadata:
labels:
azure.workload.identity/use: 'true'
spec:
serviceAccountName: ververica-platform-ververica-platform
taskManagerPodTemplate:
metadata:
labels:
azure.workload.identity/use: 'true'
spec:
serviceAccountName: ververica-platform-ververica-platform
You cannot mix configuration methods. Either specify the pods
attribute or the jobManagerPodTemplate
and taskManagerPodTemplate
.
If you have created your own namespace and related service account dedicated for deployments, you need to replace serviceAccountName: ververica-platform-ververica-platform
with your service account name: serviceAccountName: <deployment-related-service-account-name>
.
Credentials
Ververica Platform supports using a single set of credentials to access your configured blob storage, and will automatically distribute these credentials to Flink jobs that require them.
These credentials can be either specified directly in values.yaml
, or added to a Kubernetes
secret out-of-band and referenced in values.yaml
by name.
Option 1: values.yaml
The following options are configurable, example values are shown:
blobStorageCredentials:
azure:
connectionString: DefaultEndpointsProtocol=https;EndpointSuffix=core.windows.net;AccountName=vvpArtifacts;AccountKey=VGhpcyBpcyBub3QgYSB2YWxpZCBBQlMga2V5LiAgVGhhbmtzIGZvciB0aG9yb3VnaGx5IHJlYWRpbmcgdGhlIGRvY3MgOikgIA==;
s3:
accessKeyId: AKIAEXAMPLEACCESSKEY
secretAccessKey: qyRRoU+/4d5yYzOGZVz7P9ay9fAAMrexamplesecretkey
hdfs:
# Apache Hadoop® configuration files (core-site.xml, hdfs-site.xml)
# and optional Kerberos configuration files. Note that the keytab
# has to be base64 encoded.
core-site.xml: |
<?xml version="1.0" ?>
<configuration>
...
</configuration>
hdfs-site.xml: |
<?xml version="1.0" ?>
<configuration>
...
</configuration>
krb5.conf: |
[libdefaults]
ticket_lifetime = 10h
...
keytab: BQIAA...AAAC
keytab-principal: flink
Option 2: Pre-create Kubernetes Secret
To use a pre-created Kubernetes secret, its keys must match the pattern <provider>.<key>
. For
example, s3.accessKeyId
and s3.secretAccessKey
. To configure Ververica Platform to use this
secret, add the following snippet to your Helm values.yaml
file:
blobStorageCredentials:
existingSecret: my-blob-storage-credentials
The values in a Kubernetes secret must be base64-encoded.
Example: Apache Hadoop® HDFS
For UBS with Apache Hadoop® HDFS we recommend to pre-create a Kubernetes secret with the required configuration
files in order to avoid duplication of the configuration files in the Ververica Platform values.yaml
file.
kubectl create secret generic my-blob-storage-credentials \
--from-file hdfs.core-site.xml=core-site.xml \
--from-file hdfs.hdfs-site.xml=hdfs-site.xml \
--from-file hdfs.krb5.conf=krb5.conf \
--from-file hdfs.keytab=keytab \
--from-file hdfs.keytab-principal=keytab-principal
After you have created the Kubernetes secret, you can reference it in the values.yaml
as an existing secret. Note that the Kerberos configuration is optional.
Advanced Configuration
AWS EKS
When running on AWS EKS or AWS ECS your Kubernetes Pods inherit the roles attached to the underlying EC2 instances.
If these roles already grant access to the required S3 resources you only need to configure vvp.blobStorage.baseUri
without configuring any blobStorageCredentials
.
Apache Hadoop® Versions
UBS with Apache Hadoop® HDFS uses a Hadoop 2 client for communication with the HDFS cluster. Hadoop 3 preserves wire compatibility with Hadoop 2 clients and you are able to use HDFS blob storage with both Hadoop 2 and Hadoop 3 HDFS clusters.
However, note that there may be incompatabilities between Hadoop 2 and 3 with respect to the configuration files core-site.xml
and hdfs-site.xml
. As an example, Hadoop 3 allows to configure durations with a unit suffix such as 30s
which results in a configuration parsing error with Hadoop 2 clients. It's generally possible to work around these issues by limiting configuration to Hadoop 2 compatible keys/values.
Apache Flink® Hadoop Dependency
When using HDFS UBS, Ververica Platform dynamically adds the Hadoop dependency flink-shaded-hadoop-2-uber
to the classpath. You can use the following annotation to skip this step:
kind: Deployment
spec:
template:
metadata:
annotations:
ubs.hdfs.hadoop-jar-provided: true
This is useful if you your Docker image provides a Hadoop dependency. If you use this annotation without a Hadoop dependency on the classpath, your Flink application will fail.
Services
The following services make use of the universal blob storage configuration.
Apache Flink® Jobs
Flink jobs are configured to store blobs at the following locations:
Blob | Storage Location |
---|---|
Checkpoints | ${baseUri}/flink-jobs/namespaces/${ns}/jobs/${jobId}/checkpoints |
Savepoints | ${baseUri}/flink-savepoints/namespaces/${ns}/deployments/${deploymentId} |
High Availability | ${baseUri}/flink-savepoints/namespaces/${ns}/deployments/${deploymentId} |
User-provided configuration has precedence over universal blob storage.
Artifact Management
Artifacts are stored in the following location:
${baseUri}/artifacts/namespaces/${ns}
SQL Service
The SQL Service depends on blob storage for storing deployment information and JAR files of user-defined functions.
SQL Deployments
Before a SQL query can be deployed it needs to be optimized and translated to a Flink job. SQL Service stores the Flink job and all JAR files that contain an implementation of a user-defined function which is used by the query at the following locations:
Blob | Storage Location |
---|---|
Job | ${baseUri}/flink-jobs/namespaces/${ns}/jobs/${jobId}/jobgraph |
UDF JAR Files | ${baseUri}/flink-jobs/namespaces/${ns}/jobs/${jobId}/udfs |
After a query has been deployed, Application Manager maintains the same blobs as for regular Flink jobs, i.e., checkpoints, savepoints, and high-availability files.
UDF Artifacts
The JAR files of UDF Artifacts that are uploaded via the UI are stored in the following location:
${baseUri}/sql-artifacts/namespaces/${ns}/udfs/${udfArtifact}
Connectors, Formats, and Catalogs
The JAR files of Custom Connectors and Formats and Custom Catalogs that are uploaded via the UI are stored in the following location:
${baseUri}/sql-artifacts/namespaces/${ns}/custom-connectors/