Deployment Upgrades
Ververica Platform continuously runs a reconciliation loop to detect deviations between the desired and the actual state of your Deployments.
The reasons for such a deviation might be an update of the Deployment Resources by the user (e.g. to perform an upgrade) or a change in the status of the running Deployment.
In order to align the actual state of a Deployment with the desired state of a Deployment Ververica Platform usually needs to tear down the existing, running Flink job and set up a new Flink job.
In order to flexibly manage the lifecycle of your applications, you can control the behaviour of Ververica Platform during setup and tear down of a Flink job.
In the following sections the different Restore and Upgrades Strategies are described in detail.
Restore Strategy
Ververica Platform will use the restore strategy (configuration key spec.restoreStrategy) whenever a Deployment needs to transition to the "RUNNING" state.
The restore strategy determines which Savepoint resource is used for state restore when deploying an Apache Flink® job.
- LATEST_STATE- Use the latest completed Savepoint known to Ververica Platform. This restore strategy will consider Savepoints with any origin for restore, including- RETAINED_CHECKPOINT. See the notes below for requirements to use retained Flink checkpoints.
- LATEST_SAVEPOINT- Use the latest completed Savepoint known to Ververica Platform except those with origin- RETAINED_CHECKPOINT.
- NONE- Do not restore from any Savepoint.
Note that the web user interface provides a hint about which Savepoint will be used to restore the state of the next Apache Flink® job.
The STATEFUL upgrade strategy as described in Upgrade Strategy only works in conjunction with spec.restoreStrategy set to LATEST_SAVEPOINT or LATEST_STATE.
If you instead set restoreStrategy to NONE, you might run into an an unexpected situation and start from an empty state after your job is upgraded.
Requirements for LATEST_STATE
To use the LATEST_STATE restore strategy you need to setup Flink JobManager Failover with Kubernetes and configure checkpoints to always be retained in your application code
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.getCheckpointConfig().enableExternalizedCheckpoints(
  CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION
);
or via the Flink configuration (preferred)
flinkConfiguration:
  execution.checkpointing.interval: <interval>
  execution.checkpointing.externalized-checkpoint-retention: RETAIN_ON_CANCELLATION
When you satisfy both requirements, you will see a Savepoint resource with origin RETAINED_CHECKPOINT for each Flink checkpoint that has not been discarded after your Flink application terminates. Using the LATEST_STATE restore strategy will restore your Flink job state from such a Savepoint.
If Kubernetes-based JobManager failover or checkpoint retention is not configured, the LATEST_STATE restore strategy will behave in the same way as the LATEST_SAVEPOINT restore strategy.
Deployment upgrades still respect the configured upgrade strategy and are independent of the configured restore strategy. A stateful upgrade will still trigger a savepoint.
In session mode, there are limitations in the Kubernetes high-availability service that might result in missed retained checkpoints.
Allowing non-restored state
When restoring from a savepoint that is not fully compatible with a job, because the savepoint contains state for a Flink task that is not present in the new job, you can parameterize restore strategies LATEST_STATE and LATEST_SAVEPOINT with the allowNonRestoredState flag:
kind: Deployment
spec:
  restoreStrategy:
    kind: LATEST_SAVEPOINT
    allowNonRestoredState: true
Upgrade Strategy
Ververica Platform will automatically orchestrate upgrades of running Jobs (if present) depending on the configured upgrade strategy. Currently, there are three supported upgrade strategies:
- STATELESS: Ververica Platform will terminate any running Flink job by cancelling it and then deploy the upgraded job. No state snapshot will be triggered.
- STATEFUL: Ververica Platform will terminate any running Flink job by suspending it and then deploy the upgraded job. Job suspension will trigger a state snapshot.
- NONE: Ververica Platform will not perform an automatic upgrade of a running Flink job. If this strategy is selected, the user is expected to manually cancel or suspend the currently running job (by setting- spec.stateto cancelled), and start a new job manually (by setting- spec.stateback to running).
Please read the section titled Restore Strategy as any newly deployed job during an upgrade will respect the configured restore strategy.
As noted earlier, Ververica Platform will eventually achieve the desired state using the specified upgrade strategy. Executing an upgrade strategy happens in a fault-tolerant way, e.g. if performing a stateful upgrade and triggering a Savepoint fails, Ververica Platform will retry until a Savepoint succeeds or maxSavepointCreationAttempts is exhausted. The Flink job will not be terminated before a savepoint succeeds.
Deployment state transitions from "RUNNING" to "CANCELLED" and "RUNNING" to "SUSPENDED" are not controlled via the upgrade_strategy although they involve the tear down of the underlying Flink job. Please see Deployment Lifecycle for details.
Summary
Upgrade and Restart Strategy are used to configure the behaviour of Ververica Platform during upgrades and recovery. The table below depicts the most common combination for stateful or stateless applications.
| NONE | STATELESS | STATEFUL | |
|---|---|---|---|
| NONE | X | Stateless Application | X | 
| LATEST_SAVEPOINT | X | X | X | 
| LATEST_STATE | X | X | Stateful Application | 
For stateless applications no savepoint needs to be taken during teardown (STATELESS) and the application can start without previous state (NONE).
For stateful applications a savepoint needs to be taken during teardown (STATEFUL) and the application starts from this savepoint, optionally falling back to the latest checkpoint (LATEST_STATE).