Application Lifecycle Management¶
Ververica Platform continuously runs a reconciliation loop to detect deviations between the desired and the actual state of your Deployments. The reasons for such a deviation might be an update of the Deployment Resources by the user (e.g. to perform an upgrade) or a change in the status of the running Deployment. Ververica Platform creates a physical Flink cluster for each Flink job on Kubernetes. Consequentially, in order to align the actual state of a Deployment with the desired state of a Deployment Ververica Platform usually needs to tear down the existing, running Flink job and setup a new Flink job.
In the following sections the different Restore and Upgrades Strategies are described in detail.
Ververica Platform will use the restore strategy whenever a Deployment needs to transition to the “RUNNING” state (configuration key
The following options are supported:
- LATEST_STATE - Use the latest successful checkpoint (requires a High Availability setup) or savepoint known to Ververica Platform.
- LATEST_SAVEPOINT - Use the latest successful savepoint known to Ververica Platform. It may have been previously triggered by a user request or by Ververica Platform (for example during a suspension or a stateful upgrade).
- NONE - Do not start from any checkpoint or savepoint.
The stateful upgrade strategy as described in Upgrade Strategy only works in conjunction with
spec.restoreStrategy set to LATEST_SAVEPOINT or LATEST_STATE.
If you instead set restoreStrategy to NONE, you might run into an an unexpected situation and start from an empty state after your job is upgraded.
Requirements for LATEST_STATE Restore Strategy¶
To use “LATEST_STATE” restore strategy you need to setup Flink Master Failover with Zookeeper and configure checkpoints to be retained on cancellation in the Flink job:
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); env.getCheckpointConfig().enableExternalizedCheckpoints( CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION );
This lets your Flink application resume from the latest checkpointed state available in ZooKeeper instead of solely relying on savepoints.
Without proper setup, the effect would be the same as just setting the LATEST_SAVEPOINT restore strategy.
Deployment upgrades still respect the configured upgrade strategy and are independent of the configured restore strategy. A stateful upgrade will still trigger a savepoint.
Allowing non-restored state¶
When restoring from a savepoint that is not fully compatible with a job, because the savepoint contains state for a Flink task that is not present in the new job, you can parameterize restore strategies LATEST_STATE and LATEST_SAVEPOINT with the allowNonRestoredState flag:
kind: Deployment spec: restoreStrategy: kind: LATEST_SAVEPOINT allowNonRestoredState: true
Ververica Platform will automatically orchestrate upgrades of running Jobs (if present) depending on the configured upgrade strategy. Currently, there are three supported upgrade strategies:
- STATELESS: Ververica Platform will terminate any currently running Flink job without taking a savepoint and then will start the new job.
- STATEFUL: Ververica Platform will first save the state of the currently running Flink job by performing a savepoint, then terminate the currently running job, and finally will start a new job by using the savepoint taken prior to termination. To use this strategy please read the section titled Restore Strategy.
- NONE: Ververica Platform will not perform an automatic upgrade of a running Flink job. If this strategy is selected, the user is expected to manually cancel or suspend the currently running job (by setting
spec.stateto cancelled), and start a new job manually (by setting
spec.stateback to running).
As noted earlier, Ververica Platform will eventually achieve the desired state using the specified upgrade strategy. Executing an upgrade strategy happens in an fault-tolerant way, e.g. when performing a stateful upgrade and triggering a savepoint fails, Ververica Platform will retry until a savepoint succeeds or
maxSavepointCreationAttempts is exhausted. The Flink job will not be terminated before a savepoint succeeds.
Upgrade and Restart Strategy are used to configure the behaviour of Ververica Platform during upgrades and recovery. The table below depicts the most common combination for stateful or stateless applications.
For stateless applications no savepoint needs to be taken during teardown (“STATELESS”) and the application can start without previous state (“NONE”). For stateful applications a savepoint needs to be taken during teardown (“STATEFUL”) and the application starts from this savepoint (“LATEST_SAVEPOINT”), optionally falling back to the latest checkpoint (“LATEST_STATE”).