Application Lifecycle Management

Ververica Platform continuously runs a reconciliation loop to detect deviations between the desired and the actual state of your Deployments. The reasons for such a deviation might be an update of the Deployment Resources by the user (e.g. to perform an upgrade) or a change in the status of the running Deployment. Ververica Platform creates a physical Flink cluster for each Flink job on Kubernetes. Consequentially, in order to align the actual state of a Deployment with the desired state of a Deployment Ververica Platform usually needs to tear down the existing, running Flink job and setup a new Flink job.

In order to flexibly manage the lifecycle of your applications, you can control the behaviour of Ververica Platform during setup and tear down of a Flink job.

In the following sections the different Restore and Upgrades Strategies are described in detail.

Restore Strategy

Ververica Platform will use the restore strategy whenever a Deployment needs to transition to the “RUNNING” state (configuration key spec.restoreStrategy).

The following options are supported:

  • LATEST_STATE - Use the latest successful checkpoint (requires a High Availability setup) or savepoint known to Ververica Platform.
  • LATEST_SAVEPOINT - Use the latest successful savepoint known to Ververica Platform. It may have been previously triggered by a user request or by Ververica Platform (for example during a suspension or a stateful upgrade).
  • NONE - Do not start from any checkpoint or savepoint.

Note

The stateful upgrade strategy as described in Upgrade Strategy only works in conjunction with spec.restoreStrategy set to LATEST_SAVEPOINT or LATEST_STATE. If you instead set restoreStrategy to NONE, you might run into an an unexpected situation and start from an empty state after your job is upgraded.

Requirements for LATEST_STATE Restore Strategy

To use “LATEST_STATE” restore strategy you need to setup Flink Master Failover with Zookeeper and configure checkpoints to be retained on cancellation in the Flink job:

StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.getCheckpointConfig().enableExternalizedCheckpoints(
  CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION
);

This lets your Flink application resume from the latest checkpointed state available in ZooKeeper instead of solely relying on savepoints.

Without proper setup, the effect would be the same as just setting the LATEST_SAVEPOINT restore strategy.

Note

Deployment upgrades still respect the configured upgrade strategy and are independent of the configured restore strategy. A stateful upgrade will still trigger a savepoint.

Allowing non-restored state

When restoring from a savepoint that is not fully compatible with a job, because the savepoint contains state for a Flink task that is not present in the new job, you can parameterize restore strategies LATEST_STATE and LATEST_SAVEPOINT with the allowNonRestoredState flag:

kind: Deployment
spec:
  restoreStrategy:
    kind: LATEST_SAVEPOINT
    allowNonRestoredState: true

Upgrade Strategy

Ververica Platform will automatically orchestrate upgrades of running Jobs (if present) depending on the configured upgrade strategy. Currently, there are three supported upgrade strategies:

  • STATELESS: Ververica Platform will terminate any currently running Flink job without taking a savepoint and then will start the new job.
  • STATEFUL: Ververica Platform will first save the state of the currently running Flink job by performing a savepoint, then terminate the currently running job, and finally will start a new job by using the savepoint taken prior to termination. To use this strategy please read the section titled Restore Strategy.
  • NONE: Ververica Platform will not perform an automatic upgrade of a running Flink job. If this strategy is selected, the user is expected to manually cancel or suspend the currently running job (by setting spec.state to cancelled), and start a new job manually (by setting spec.state back to running).

As noted earlier, Ververica Platform will eventually achieve the desired state using the specified upgrade strategy. Executing an upgrade strategy happens in an fault-tolerant way, e.g. when performing a stateful upgrade and triggering a savepoint fails, Ververica Platform will retry until a savepoint succeeds or maxSavepointCreationAttempts is exhausted. The Flink job will not be terminated before a savepoint succeeds.

Note

Deployment state transitions from “RUNNING” to “CANCELLED” and “RUNNING” to “SUSPENDED” are not controlled via the Upgrade Strategy although they involve the tear down of the underlying Flink job. Please see Desired State for details.

Summary

Upgrade and Restart Strategy are used to configure the behaviour of Ververica Platform during upgrades and recovery. The table below depicts the most common combination for stateful or stateless applications.

NONE STATELESS STATEFUL
NONE X Stateless Application X
LATEST_SAVEPOINT X X Stateful Application
LATEST_STATE X X Stateful Application

For stateless applications no savepoint needs to be taken during teardown (“STATELESS”) and the application can start without previous state (“NONE”). For stateful applications a savepoint needs to be taken during teardown (“STATEFUL”) and the application starts from this savepoint (“LATEST_SAVEPOINT”), optionally falling back to the latest checkpoint (“LATEST_STATE”).