Skip to main content

Configure Paimon with Azure Workload Identity

This page describes how to integrate Azure Workload Identity with Apache Paimon for secure access to Azure Blob Storage.

To enable this secure access, you must map an Azure managed identity to the specific Kubernetes service accounts used by your Flink workloads. This setup ensures your Flink SQL jobs can seamlessly interact with Paimon catalogs and tables without relying on hardcoded storage keys.

Resource Setup

Before you configure a Workload Identity, create the necessary Azure resources:

  1. Azure BYOC workspace: Ensure you have an active Azure BYOC workspace running the latest Pyxis version.
  2. Storage account: Create an Azure storage account to store your data. For instructions on creating a new storage account, see the Azure Blob Storage documentation.
  3. Managed identity: Create a user-assigned managed identity. This identity establishes trust between your BYOC cluster and the storage account. For more information, see the Azure Blob Storage documentation.

Establish Trust

To allow your BYOC cluster to interact with the storage account, assign the appropriate roles and configure the identity.

Assign the Storage Blob Data Contributor Role

To allow your managed identity to read and write blob data, you must create a role assignment scoped to your storage account.

  1. In the Azure portal, go to your target storage account.
  2. In the left navigation pane, click Access Control (IAM).
  3. Click Add > Add role assignment.
  4. Select the Storage Blob Data Contributor role and click Next.
  5. Under Assign access to, select Managed identity.
  6. Click Select members, choose the user-assigned managed identity you created, and click Review + assign.

Enable OIDC and Workload Identity in AKS

Your Azure Kubernetes Service (AKS) cluster must have OIDC and Workload Identity enabled.

  1. In the Azure portal, go to your AKS cluster.
  2. In the navigation pane, click Settings > Security Configuration.
  3. Enable OIDC Issuer and Workload Identity.
  4. Retrieve the Issuer URL from the Security Configuration page. You need this URL to configure your federated credentials.

Configure Federated Credentials

Federated credentials allow applications inside your Kubernetes cluster to act on behalf of the managed identity. Because federated credentials bind to a specific Kubernetes namespace and service account pair, you must configure them after creating the BYOC cluster.

The Flink SQL Gateway pod (vvp-sql-0) and the JobManager (JM) / TaskManager (TM) pods run under different service accounts and potentially different namespaces. Therefore, you must configure three distinct federated credentials for your managed identity.

The required configurations are:

  • SQL Gateway: The namespace where VVC is installed and the service account used by the vvp-sql-0 pod (verify the exact service account name using kubectl).
  • JobManager: The namespace of your Flink workloads and the flink-job-sa-vvc service account.
  • TaskManager: The namespace of your Flink workloads and the vvr-task-manager service account.
important

Ensure that your Kubernetes Service Accounts or Pod templates include the required Azure Workload Identity label (typically azure.workload.identity/use: "true"). This is required for the AKS mutating webhook to inject the Azure identity token file into your pods.

To create the federated credentials:

  1. In the Azure portal, go to your managed identity.
  2. In the navigation pane, click Settings > Federated credentials.
  3. Click Add credential.
  4. In the Federated credential scenario drop-down list, select Kubernetes accessing Azure resources.
  5. Enter the Cluster issuer URL you retrieved from your AKS cluster.
  6. Enter the appropriate Namespace and Service account. Use the Azure CLI (az aks get-credentials) and run kubectl get namespaces or kubectl get serviceaccounts -n <namespace> to find the exact namespace and service accounts in your cluster.
  7. Repeat these steps to create all three required federated credentials.

Configure the Paimon Catalog

With trust established, you can set up an Apache Paimon catalog to read and write data. For more information about working with Paimon catalogs, see the Apache Paimon Catalog documentation.

Retrieve Managed Identity Details

You need the tenant ID and client ID for the DDL statement.

  1. In the Azure portal, go to your managed identity.
  2. On the overview page, click JSON View.
  3. Locate and record the clientId and tenantId.

Create the Catalog

Run the following DDL statement in your Flink environment to create the catalog. Replace <container>, <storage-account>, <your-tenant-id>, and <your-client-id> with your specific values.

CREATE CATALOG `my_awesome_paimon_catalog` WITH (
'type' = 'paimon',
'warehouse' = 'abfs://<container>@<storage-account>.dfs.core.windows.net/path/to/paimon',
'fs.azure.account.auth.type' = 'OAuth',
'fs.azure.account.oauth.provider.type' ='org.apache.hadoop.fs.azurebfs.oauth2.WorkloadIdentityTokenProvider',
'fs.azure.account.oauth2.msi.tenant' = '<your-tenant-id>',
'fs.azure.account.oauth2.client.id' = '<your-client-id>',
'fs.azure.account.oauth2.token.file' = '/var/run/secrets/azure/tokens/azure-identity-token'
);
note

You must set the fs.azure.account.oauth2.token.file to /var/run/secrets/azure/tokens/azure-identity-token.

Verify the Configuration

You can run SQL scripts using the catalog to verify that it can successfully read and write data to your Storage Account.

1. Create a Database and Table

Run the following SQL statements from the Flink SQL Gateway (vvp-sql-0) pod:

-- Create the database
USE CATALOG `my_awesome_paimon_catalog`;
CREATE DATABASE IF NOT EXISTS `demo_db`;

-- Create the sink table
USE CATALOG `my_awesome_paimon_catalog`;
USE `demo_db`;
CREATE TABLE IF NOT EXISTS `paimon_sink` (
dt STRING,
id BIGINT,
data STRING,
PRIMARY KEY (dt, id) NOT ENFORCED
) PARTITIONED BY (dt)
WITH (
'bucket' = '4'
);

2. Run a Bounded Job to Write Data

Use the Datagen connector to generate dummy data and insert it into the Paimon table:

-- Flink bounded job writing data using Paimon connector
USE CATALOG `my_awesome_paimon_catalog`;
USE `demo_db`;
CREATE TEMPORARY TABLE `gen` (
id BIGINT,
data STRING,
dt AS DATE_FORMAT(LOCALTIMESTAMP, 'yyyy-MM-dd')
) WITH (
'connector' = 'datagen',
'rows-per-second' = '1000',
'number-of-rows' = '10000',
'fields.id.kind' = 'sequence',
'fields.id.start' = '1',
'fields.id.end' = '10000',
'fields.data.length' = '16'
);

INSERT INTO `paimon_sink`
SELECT dt, id, data
FROM `gen`;

Verify that the files appear in your Azure Blob Storage container.

3. Run a Streaming Job to Read Data

Read the data back from the Paimon table and output it using the Print connector:

-- Flink streaming job reading back data using Paimon connector
USE CATALOG `my_awesome_paimon_catalog`;
USE `demo_db`;
CREATE TEMPORARY TABLE `print_sink` (
dt STRING,
id BIGINT,
data STRING
) WITH (
'connector' = 'print'
);

INSERT INTO `print_sink`
SELECT dt, id, data
FROM `paimon_sink`;

Monitor the task manager logs to confirm the data is successfully read and printed.