PyFlink development with Apache Flink®
PyFlink is a Python API for Apache Flink®. It provides Python bindings for a subset of the Flink API, so you can write Python code that uses Flink functions and that can be executed on a Flink cluster. Your Python code executes as a PyFlink job, and runs just like other Flink jobs.
Prerequisites
On Ververica Platform 2.10, follow the Getting Started guide to deploy a platform instance and set up the example playground. This will configure a local working environment that includes Python and PyFlink where you can test the examples in this guide.
If you are running an older version of Ververica Platform (2.8 or 2.9) you may need to build a custom Docker image to include Python and PyFlink, depending on your platform/Flink combination. To do so, follow the steps in Building a Platform Image, below.
If you are just getting started with Ververica Platform and its core concepts, then this short video is a good overview of basic deployment and job management.
Concepts
Using PyFlink, you can write Python code that interoperates with core Flink functionality.
Jobs
Jobs are the high level executable resource abstraction in Flink. At a lower level, pipelines execute sequences of highly parallelised tasks, for example database queries, map functions, reduce functions and so on, across source data streams. Jobs bundle tasks with input streams and manage task execution and statefulness, via checkpoints and snapshots.
PyFlink jobs are Flink jobs you create from Python code using PyFlink.
PyFlink UDFs
UDFs are User Defined Functions. Most SQL queries you write use functions to perform logic inside the query, and Flink includes a rich set of built-in functions. When you need to extend these to implement custom logic you define and register UDFs.
Using PyFlink, you can write your custom functions in Python.
Table API
The Flink Table API provides a unified SQL API for Java applications that abstracts the differences between different kinds of data source, for example stream or batch sources, and creates a common semantics.
PyFlink allows you to call Python UDFs from Java Table API Flink applications.
Getting Started
In this tutorial we take you through the steps to:
- Write and register a custom Python UDF.
- Configure and run the custom UDF in a PyFlink Job from Ververica Platform.
- Call your Python UDF from a Java Table API application.
In Ververica Platform v2.10 PyFlink and Python are already included. For earlier versions of the platform, or for any version if you are using a Flink version earlier than 1.15.3 (e.g. Flink 1.13/1.14 with platform versions 2.8/2.9), then follow the steps to build and push a custom platform image. With PyFlink and Python included in your image you can follow this tutorial.