Savepoints

A Savepoint Resource points to a single savepoint in Apache Flink®. A single Flink savepoint can be referenced by multiple Ververica Platform Savepoint resources.

Specification

There are different metadata.origin values for Savepoints:

  • USER_REQUEST: The Savepoint has been requested manually by a user through Ververica Platform.
  • SUSPEND: The Savepoint has been requested when the corresponding Deployment was suspended.
  • COPIED: The Savepoint is a copy of another Savepoint resource. Both Savepoint resources point to the same physical Flink savepoint.
  • RETAINED_CHECKPOINT: The Savepoint is a retained Flink checkpoint that was not discarded after the Flink job was shut down.

The Restore Strategy of your Deployment resources controls which Savepoint will be used to restore the state of a Flink job.

Ververica Platform does not keep track of Flink savepoints not created through Ververica Platform.

Requirements

Triggering Savepoints requires configuration of a path under which to store savepoints. If Ververica Platform was configured with blob storage, it will preconfigure each Deployment for checkpoints, savepoints and high-availability.

Otherwise, please provide an entry in the flinkConfiguration map with the key state.savepoints.dir:

kind: Deployment
spec:
  template:
    spec:
      flinkConfiguration:
        state.savepoints.dir: s3://flink/savepoints

The provided blob storage location needs to be accessible by all nodes of your cluster. If Ververica Platform was configured with blob storage, the platform will handle the credentials distribution transparently and no further actions is required. Otherwise, you can, for instance, use a custom volume mount or a custom filesystem configuration.

Please consult the official Flink documentation on savepoints for more details on savepoints in Flink.

Manually Adding a Savepoint Resource

Savepoints triggered by or through Ververica Platform are automatically added to the Deployment. Yet, in some cases you might want to recover or start your Deployment from a specific Apache Flink state snapshot that is not yet tracked by Ververica Platform. In such a scenario you need to manually add a Savepoint resource to your Deployment.

In the following, we assume that you already have a savepoint or (externalized) checkpoint at hand to resume from. The following steps will allow you to resume from your desired snapshot:

POST /api/v1/namespaces/{namespace}/savepoints
metadata:
  deploymentId: ${deploymentId}
  annotations:
    com.dataartisans.appmanager.controller.deployment.spec.version: ${deploymentSpecVersion}
spec:
  savepointLocation:  ${savepointLocation}
  flinkSavepointId: 00000000-0000-0000-0000-000000000000
status:
  state: COMPLETED

This will create a Savepoint resource for the Deployment with ID deploymentId and point it to the snapshot at savepointLocation. You have to extract the deploymentSpecVersion from Deployment.metadata.annotations."com.dataartisans.appmanager.controller.deployment.spec.version" of the corresponding Deployment and assign it to the posted Savepoint. Afterwards the web user interface for this Deployment will show (in the Snapshots Tab) that the Deployment will be started from this Savepoint. Its origin should be “COPIED”.

Note

You have to ensure that the provided savepointLocation is valid and accessible by the Apache Flink pods. If this is not the case, you will notice errors only during runtime of the job(s) that try to restore from this location.

Note

If the com.dataartisans.appmanager.controller.deployment.spec.version annotation is missing, the Savepoint will not be used during restore.