Ververica Platform continuously runs a reconciliation loop to detect deviations between the desired and the actual state of your Deployments.
The reasons for such a deviation might be an update of the Deployment Resources by the user (e.g. to perform an upgrade) or a change in the status of the running Deployment.
In order to align the actual state of a Deployment with the desired state of a Deployment Ververica Platform usually needs to tear down the existing, running Flink job and set up a new Flink job.
In the following sections the different Restore and Upgrades Strategies are described in detail.
Ververica Platform will use the restore strategy (configuration key
spec.restoreStrategy) whenever a Deployment needs to transition to the “RUNNING” state.
The restore strategy determines which Savepoint resource is used for state restore when deploying a Apache Flink® job.
- LATEST_STATE - Use the latest completed Savepoint known to Ververica Platform. This restore strategy will consider Savepoints with any origin for restore, including RETAINED_CHECKPOINT. See the notes below for requirements to use retained Flink checkpoints.
- LATEST_SAVEPOINT - Use the latest completed Savepoint known to Ververica Platform except those with origin RETAINED_CHECKPOINT.
- NONE - Do not restore from any Savepoint.
Note that the web user interface provides a hint about which Savepoint will be used to restore the state of the next Apache Flink® job from.
The “STATEFUL” upgrade strategy as described in Upgrade Strategy only works in conjunction with
spec.restoreStrategy set to
If you instead set restoreStrategy to
NONE, you might run into an an unexpected situation and start from an empty state after your job is upgraded.
To use the “LATEST_STATE” restore strategy you need to setup Flink JobManager Failover with Kubernetes and configure checkpoints to always be retained in your application code
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); env.getCheckpointConfig().enableExternalizedCheckpoints( CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION );
or via the Flink configuration (preferred)
flinkConfiguration: execution.checkpointing.interval: <interval> execution.checkpointing.externalized-checkpoint-retention: RETAIN_ON_CANCELLATION
When you satisfy both requirements, you will see a Savepoint resource with origin
RETAINED_CHECKPOINT for each Flink checkpoint that has not been discarded after your Flink application terminates. Using the
LATEST_STATE restore strategy will restore your Flink job state from such a Savepoint.
If Kubernetes-based JobManager failover or checkpoint retention is not configured, the
LATEST_STATE restore strategy will behave in the same way as the
LATEST_SAVEPOINT restore strategy.
Deployment upgrades still respect the configured upgrade strategy and are independent of the configured restore strategy. A stateful upgrade will still trigger a savepoint.
In session mode, there are limitations in the Kubernetes high-availability service that might result in missed retained checkpoints.
When restoring from a savepoint that is not fully compatible with a job, because the savepoint contains state for a Flink task that is not present in the new job, you can parameterize restore strategies LATEST_STATE and LATEST_SAVEPOINT with the allowNonRestoredState flag:
kind: Deployment spec: restoreStrategy: kind: LATEST_SAVEPOINT allowNonRestoredState: true
Ververica Platform will automatically orchestrate upgrades of running Jobs (if present) depending on the configured upgrade strategy. Currently, there are three supported upgrade strategies:
- STATELESS: Ververica Platform will terminate any running Flink job by cancelling it and then deploy the upgraded job. No state snapshot will be triggered.
- STATEFUL: Ververica Platform will terminate any running Flink job by suspending it and then deploy the upgraded job. Job suspension will trigger a state snapshot.
- NONE: Ververica Platform will not perform an automatic upgrade of a running Flink job. If this strategy is selected, the user is expected to manually cancel or suspend the currently running job (by setting
spec.stateto cancelled), and start a new job manually (by setting
spec.stateback to running).
Please read the section titled Restore Strategy as any newly deployed job during an upgrade will respect the configured restore strategy.
As noted earlier, Ververica Platform will eventually achieve the desired state using the specified upgrade strategy. Executing an upgrade strategy happens in a fault-tolerant way, e.g. if performing a stateful upgrade and triggering a Savepoint fails, Ververica Platform will retry until a Savepoint succeeds or
maxSavepointCreationAttempts is exhausted. The Flink job will not be terminated before a savepoint succeeds.
Upgrade and Restart Strategy are used to configure the behaviour of Ververica Platform during upgrades and recovery. The table below depicts the most common combination for stateful or stateless applications.
For stateless applications no savepoint needs to be taken during teardown (“STATELESS”) and the application can start without previous state (“NONE”). For stateful applications a savepoint needs to be taken during teardown (“STATEFUL”) and the application starts from this savepoint, optionally falling back to the latest checkpoint (“LATEST_STATE”).