Savepoints

A Savepoint resource points to a single savepoint or retained checkpoint in Apache Flink®. A single Apache Flink® savepoint can be referenced by multiple Ververica Platform Savepoint resources.

Please consult the official Apache Flink® documentation on savepoints and checkpoints for more details on savepoints and checkpoints in Apache Flink®.

Specification

The Restore Strategy of your Deployment resources controls which Savepoint will be used to restore the state of a Apache Flink® job.

Ververica Platform only keeps track of Apache Flink® savepoints that are created within the Ververica Platform.

Savepoint Origins

A Savepoint can be created in various ways. Its origin is described by the metadata.origin attribute:

  • USER_REQUEST: The Savepoint was requested manually by a user through Ververica Platform.
  • SUSPEND: The Savepoint was requested when the corresponding Deployment was suspended.
  • COPIED: The Savepoint is either a copy of another Savepoint resource, or was created manually using an existing savepointLocation (see below). Both Savepoint resources point to the same physical Apache Flink® savepoint.
  • RETAINED_CHECKPOINT: The Savepoint is a retained Apache Flink® checkpoint that was not discarded after the Apache Flink® job was cancelled.

Savepoint States

The current state of a Savepoint resource is described by the status.state attribute:

  • STARTED: The Savepoint was started, but is not completed yet.
  • COMPLETED: The Savepoint was completed successfully and can be restored from.
  • FAILED: Creation of the Savepoint failed. Details on the cause of failure can be found in the status.failure field.
  • PENDING_DELETION: The Savepoint was marked for deletion. It will automatically be deleted if it meets all prerequisites. It can no longer be restored from.
  • DELETING: The Savepoint is currently being deleted. It can no longer be restored from.
  • FAILED_DELETION: Deletion of the Savepoint failed. Details on the cause of failure can be found in the status.failure field. It can no longer be restored from. Deletion can be retried.

Savepoint Types

The metadata.type attribute of a Savepoint resource describes the structure of the underlying savepoint or checkpoint in Apache Flink®. More information in incremental checkpoints can be found in the Apache Flink® documentation.

  • INCREMENTAL: The Savepoint resource references an incremental checkpoint.
  • FULL: The Savepoint resource references a savepoint or a full checkpoint.
  • UNKNOWN: The type of the underlying savepoint or checkpoint is not known.

Savepoint resources created using a version of Ververica Platform prior to 2.5 do not have metadata.type populated. They will be treated as if the type was UNKNOWN.

Requirements

Triggering Savepoints requires configuration of a path under which to store savepoints. If Ververica Platform was configured with blob storage, it will preconfigure each Deployment for checkpoints, savepoints and high-availability.

Otherwise, please provide an entry in the flinkConfiguration map with the key state.savepoints.dir:

kind: Deployment
spec:
  template:
    spec:
      flinkConfiguration:
        state.savepoints.dir: s3://flink/savepoints

The provided blob storage location needs to be accessible by all nodes of your cluster. If Ververica Platform was configured with blob storage, the platform will handle the credentials distribution transparently and no further actions is required. Otherwise, you can, for instance, use a custom volume mount or a custom filesystem configuration.

Manually Adding a Savepoint Resource

Savepoints triggered by or through Ververica Platform are automatically added to the Deployment. Yet, in some cases you might want to recover or start your Deployment from a specific Apache Flink® state snapshot that is not yet tracked by Ververica Platform. In such a scenario you need to manually add a Savepoint resource to your Deployment.

In the following, we assume that you already have a savepoint or checkpoint at hand to resume from. The following steps will allow you to resume from your desired snapshot:

POST /api/v1/namespaces/{namespace}/savepoints
metadata:
  deploymentId: ${deploymentId}
  annotations:
    com.dataartisans.appmanager.controller.deployment.spec.version: ${deploymentSpecVersion}
  type: ${type}
spec:
  savepointLocation:  ${savepointLocation}
  flinkSavepointId: 00000000-0000-0000-0000-000000000000
status:
  state: COMPLETED

This will create a Savepoint resource for the Deployment with ID deploymentId, point it to the snapshot at savepointLocation and set its type as type. If the type is not specified, it will default to UNKNOWN. You have to extract the deploymentSpecVersion from Deployment.metadata.annotations."com.dataartisans.appmanager.controller.deployment.spec.version" of the corresponding Deployment and assign it to the posted Savepoint. Afterwards the web user interface for this Deployment will show (in the Snapshots Tab) that the Deployment will be started from this Savepoint. Its origin should be COPIED.

Note

You have to ensure that the provided savepointLocation is valid and accessible by the Apache Flink® pods. If this is not the case, you will notice errors only during runtime of the job(s) that try to restore from this location.

Note

If the com.dataartisans.appmanager.controller.deployment.spec.version annotation is missing, the Savepoint will not be used during restore.

Deleting a Savepoint Resource

Savepoint resources which are no longer needed can be deleted to free up space. The underlying data in the configured blob storage will be deleted automatically.

Both Savepoint resources referencing Apache Flink® savepoints as well as those referencing Apache Flink® retained checkpoints can be deleted.

When deleting a Savepoint resource, Ververica Platform will also attempt to delete all other Savepoint resources that point to the same physical location in blob storage.

Prerequisites

To delete a Savepoint resource, the following conditions must be true:

  • Universal blob storage is enabled.
  • The user requesting the deletion has the editor or owner role inside the Namespace.

Additionally, the below conditions must be true for all Savepoint resources pointing to the same physical location:

  • The Savepoint resource is in state COMPLETED, FAILED, or FAILED_DELETION.
  • The Savepoint resource references a savepoint or a full checkpoint (precise logic below).
  • If the Savepoint resource is associated with an active Deployment, it must not be the latest snapshot (savepoint or checkpoint) to ensure its deletion will not impact the underlying Job’s failure recovery.

In order to provide better handling of Savepoint resources created using Ververica Platform 2.4 or below, a Savepoint resource passes the second condition if either of the following is true:

  • metadata.type is FULL.
  • metadata.type is UNKNOWN or not set, and metadata.origin is USER_REQUEST or SUSPEND.

Additionally, if multiple Savepoint resources share the same physical location, it is sufficient if one of them passes and none are incremental (according to metadata.type).

Note

Some limited checks are performed synchronously, and any failure will result in an immediate failure response. Most conditions are checked asynchronously, however, and any failure will result in the Savepoint moving to state FAILED_DELETION. In both cases, nothing will be deleted.

Force Deletion

To skip the above prerequisites and to ensure that the Savepoint resource will always be removed, regardless of any failures while deleting the underlying data, it is possible to “force delete” the resource. Deletion of the underlying data will be attempted, but regardless of its outcome, deletion of the Savepoint resource will proceed.

Warning

When a resource is force deleted there are no guarantees that the underlying data will be removed, so it should be used with caution. It may also cause running Jobs to be unable to recover in case of failure if the deleted Savepoint resource represented the latest state.

Force deleting a Savepoint resource referencing an incremental checkpoint will only attempt to remove the part of the underlying data that is used exclusively by this checkpoint. Files which may be used by other incremental checkpoints will not be removed.

Force deletion will also be propagated to all other Savepoint resources that point to the same physical location and force delete them as well.

Methods of deletion

The following call to the REST API marks the affected Savepoint resource for deletion:

DELETE /api/v1/namespaces/{namespace}/savepoints/{savepointId}[?force=true]

To trigger force deletion, specify the force=true query parameter.

Regular responses will exhibit one of the following status codes:

  • 202: The request for deletion was accepted, both the Savepoint resource as well as the underlying physical Apache Flink® savepoint or checkpoint are scheduled for deletion.
    • Additionally, all Savepoint resources with the same physical location will be scheduled for deletion.
  • 409: One of the prerequisites was not met. Check the error message for details.
  • 400: A user-error occurred, likely a data type issue. Check the error message for details.

Savepoint resources can also be deleted using the web user interface. To do so, navigate to the Snapshots tab of the corresponding Deployment and choose the Delete Snapshot or Force delete Snapshot action.

Note

The Delete Snapshot action is only available for Snapshots in state COMPLETED, FAILED, FAILED_DELETION.

Warning

Copying or manually creating a Savepoint resource with the same spec.savepointLocation as a Savepoint in state PENDING_DELETION will not stop the deletion and can therefore result in a Savepoint resource pointing to an invalid savepoint location.

Limitations

  • Job-specific Apache Flink® configuration is only picked up if set via the Deployment template. If incremental checkpoints are configured directly in the code of the submitted JAR, this will not be recognized.
  • While Savepoint resources referencing incremental checkpoints can be force deleted, this will not remove the part of the underlying data that is shared with other incremental checkpoints.