Savepoints
A Savepoint Resource points to a single savepoint in Apache Flink®. A single Flink savepoint can be referenced by multiple Ververica Platform Savepoint resources.
Specification
There are different metadata.origin
values for Savepoints:
- USER_REQUEST: The Savepoint has been requested manually by a user through Ververica Platform.
- SUSPEND: The Savepoint has been requested when the corresponding Deployment was suspended.
- COPIED: The Savepoint is a copy of another Savepoint resource. Both Savepoint resources point to the same physical Flink savepoint.
- RETAINED_CHECKPOINT: The Savepoint is a retained Flink checkpoint that was not discarded after the Flink job was shut down.
The Restore Strategy of your Deployment resources controls which Savepoint will be used to restore the state of a Flink job.
Ververica Platform does not keep track of Flink savepoints not created through Ververica Platform.
Requirements
Triggering Savepoints requires configuration of a path under which to store savepoints. If Ververica Platform was configured with blob storage, it will preconfigure each Deployment for checkpoints, savepoints and high-availability.
Otherwise, please provide an entry in the flinkConfiguration
map with the key state.savepoints.dir
:
kind: Deployment
spec:
template:
spec:
flinkConfiguration:
state.savepoints.dir: s3://flink/savepoints
The provided blob storage location needs to be accessible by all nodes of your cluster. If Ververica Platform was configured with blob storage, the platform will handle the credentials distribution transparently and no further actions is required. Otherwise, you can, for instance, use a custom volume mount or a custom filesystem configuration.
Please consult the official Flink documentation on savepoints for more details on savepoints in Flink.
Manually Adding a Savepoint Resource
Savepoints triggered by or through Ververica Platform are automatically added to the Deployment. Yet, in some cases you might want to recover or start your Deployment from a specific Apache Flink state snapshot that is not yet tracked by Ververica Platform. In such a scenario you need to manually add a Savepoint resource to your Deployment.
In the following, we assume that you already have a savepoint or (externalized) checkpoint at hand to resume from. The following steps will allow you to resume from your desired snapshot:
POST /api/v1/namespaces/{namespace}/savepoints
metadata:
deploymentId: ${deploymentId}
annotations:
com.dataartisans.appmanager.controller.deployment.spec.version: ${deploymentSpecVersion}
spec:
savepointLocation: ${savepointLocation}
flinkSavepointId: 00000000-0000-0000-0000-000000000000
status:
state: COMPLETED
This will create a Savepoint resource for the Deployment with ID deploymentId
, point it to the snapshot at savepointLocation
and set its type as type
. If the type is not specified, it will default to UNKNOWN
. You have to extract the deploymentSpecVersion
from Deployment.metadata.annotations."com.dataartisans.appmanager.controller.deployment.spec.version"
of the corresponding Deployment and assign it to the posted Savepoint. Afterwards the web user interface for this Deployment will show (in the Snapshots Tab) that the Deployment will be started from this Savepoint. Its origin should be COPIED
.
You have to ensure that the provided savepointLocation
is valid and accessible by the Apache Flink® pods. If this is not the case, you will notice errors only during runtime of the job(s) that try to restore from this location.
If the com.dataartisans.appmanager.controller.deployment.spec.version
annotation is missing, the Savepoint is added to the Deployment but may not be used automatically during restore.