Savepoints
A Savepoint resource points to a single savepoint or retained checkpoint in Apache Flink®. A single Apache Flink® savepoint can be referenced by multiple Ververica Platform Savepoint resources.
Please consult the official Apache Flink® documentation on savepoints and checkpoints for more details on savepoints and checkpoints in Apache Flink®.
Specification
The Restore Strategy of your Deployment resources controls which Savepoint will be used to restore the state of a Apache Flink® job.
Ververica Platform only keeps track of Apache Flink® savepoints that are created within the Ververica Platform.
Savepoint Origins
A Savepoint can be created in various ways. Its origin is described by the metadata.origin
attribute:
USER_REQUEST
: The Savepoint was requested manually by a user through Ververica Platform.SUSPEND
: The Savepoint was requested when the corresponding Deployment was suspended.COPIED
: The Savepoint is either a copy of another Savepoint resource, or was created manually using an existingsavepointLocation
(see below). Both Savepoint resources point to the same physical Apache Flink® savepoint.RETAINED_CHECKPOINT
: The Savepoint is a retained Apache Flink® checkpoint that was not discarded after the Apache Flink® job was cancelled.
Savepoint States
The current state of a Savepoint resource is described by the status.state
attribute:
STARTED
: The Savepoint was started, but is not completed yet.COMPLETED
: The Savepoint was completed successfully and can be restored from.FAILED
: Creation of the Savepoint failed. Details on the cause of failure can be found in thestatus.failure
field.PENDING_DELETION
: The Savepoint was marked for deletion. It will automatically be deleted if it meets all prerequisites. It can no longer be restored from.DELETING
: The Savepoint is currently being deleted. It can no longer be restored from.FAILED_DELETION
: Deletion of the Savepoint failed. Details on the cause of failure can be found in thestatus.failure
field. It can no longer be restored from. Deletion can be retried.
Savepoint Types
The metadata.type
attribute of a Savepoint resource describes the structure of the underlying savepoint or
checkpoint in Apache Flink®. More information in incremental checkpoints can be found in the Apache Flink® documentation.
INCREMENTAL
: The Savepoint resource references an incremental checkpoint.FULL
: The Savepoint resource references a savepoint or a full checkpoint.UNKNOWN
: The type of the underlying savepoint or checkpoint is not known.
Savepoint resources created using a version of Ververica Platform prior to 2.5 do not have metadata.type
populated.
They will be treated as if the type was UNKNOWN
.
Requirements
Triggering Savepoints requires configuration of a path under which to store savepoints. If Ververica Platform was configured with blob storage, it will preconfigure each Deployment for checkpoints, savepoints and high-availability.
Otherwise, please provide an entry in the flinkConfiguration
map with the key state.savepoints.dir
:
kind: Deployment
spec:
template:
spec:
flinkConfiguration:
state.savepoints.dir: s3://flink/savepoints
The provided blob storage location needs to be accessible by all nodes of your cluster. If Ververica Platform was configured with blob storage, the platform will handle the credentials distribution transparently and no further actions is required. Otherwise, you can, for instance, use a custom volume mount or a custom filesystem configuration.
Manually Adding a Savepoint Resource
Savepoints triggered by or through Ververica Platform are automatically added to the Deployment. Yet, in some cases you might want to recover or start your Deployment from a specific Apache Flink® state snapshot that is not yet tracked by Ververica Platform. In that case, you need to manually add a Savepoint resource to your Deployment provided that you already have a savepoint or checkpoint at hand to resume from. This can be done in either of the following ways:
Using the Ververica Platform user interface
In the Deployment list view, select the Deployment you want your Savepoint to be added to. In the Snapshots Tab, find the Add Savepoint Manually button and fill out the form that opens.
Using the REST API
Send a request with body like the example below to the following endpoint, specifying the ID of the Deployment to add the Savepoint to:
POST /api/v1/namespaces/{namespace}/savepoints
metadata:
deploymentId: ${deploymentId}
annotations:
com.dataartisans.appmanager.controller.deployment.spec.version: ${deploymentSpecVersion}
type: ${type}
spec:
savepointLocation: ${savepointLocation}
flinkSavepointId: ${flinkSavepointId}
status:
state: COMPLETED
Using either method, savepointLocation
is required. The flinkSavepointId
is optional. If not specified, the Deployment annotation com.dataartisans.appmanager.controller.deployment.spec.version
will be set to the
one of the current Deployment. The type
of the Savepoint will default to UNKNOWN
. The origin of the Savepoint will be COPIED
.
Afterwards the web user interface for this Deployment will show (in the Snapshots Tab) that the Deployment will be started from this Savepoint.
The same REST API endpoint can be used to create new Savepoints by sending spec == null
.
You have to ensure that the provided savepointLocation
is valid and accessible by the Apache Flink® pods. If this is not the case, you will notice errors only during runtime of any jobs that try to restore from this location.
Deleting a Savepoint Resource
Savepoint resources which are no longer needed can be deleted to free up space. The underlying data in the configured blob storage will be deleted automatically.
Both Savepoint resources referencing Apache Flink® savepoints as well as those referencing Apache Flink® retained checkpoints can be deleted.
When deleting a Savepoint resource, Ververica Platform will also attempt to delete all other Savepoint resources that point to the same physical location in blob storage.
Prerequisites
To delete a Savepoint resource, the following conditions must be true:
- Universal blob storage is enabled.
- The user requesting the deletion has the
editor
orowner
role inside the Namespace.
Additionally, the below conditions must be true for all Savepoint resources pointing to the same physical location:
- The Savepoint resource is in state
COMPLETED
,FAILED
, orFAILED_DELETION
. - The Savepoint resource references a savepoint or a full checkpoint (precise logic below).
- If the Savepoint resource is associated with an active Deployment, it must not be the latest snapshot (savepoint or checkpoint) to ensure its deletion will not impact the underlying Job's failure recovery.
In order to provide better handling of Savepoint resources created using Ververica Platform 2.4 or below, a Savepoint resource passes the second condition if either of the following is true:
metadata.type
isFULL
.metadata.type
isUNKNOWN
or not set, andmetadata.origin
isUSER_REQUEST
orSUSPEND
.
Additionally, if multiple Savepoint resources share the same physical location, it is sufficient if one of them passes
and none are incremental (according to metadata.type
).
Some limited checks are performed synchronously, and any failure will result in an immediate
failure response.
Most conditions are checked asynchronously, however, and any
failure will result in the Savepoint moving to state FAILED_DELETION
. In both cases, nothing will be deleted.
Force Deletion
To skip the above prerequisites and to ensure that the Savepoint resource will always be removed, regardless of any failures while deleting the underlying data, it is possible to "force delete" the resource. Deletion of the underlying data will be attempted, but regardless of its outcome, deletion of the Savepoint resource will proceed.
When a resource is force deleted there are no guarantees that the underlying data will be removed, so it should be used with caution. It may also cause running Jobs to be unable to recover in case of failure if the deleted Savepoint resource represented the latest state.
Force deleting a Savepoint resource referencing an incremental checkpoint will only attempt to remove the part of the underlying data that is used exclusively by this checkpoint. Files which may be used by other incremental checkpoints will not be removed.
Force deletion will also be propagated to all other Savepoint resources that point to the same physical location and force delete them as well.
Methods of deletion
The following call to the REST API marks the affected Savepoint resource for deletion:
DELETE /api/v1/namespaces/{namespace}/savepoints/{savepointId}[?force=true]
To trigger force deletion, specify the force=true
query parameter.
Regular responses will exhibit one of the following status codes:
-
202
: The request for deletion was accepted, both the Savepoint resource as well as the underlying physical Apache Flink® savepoint or checkpoint are scheduled for deletion. -
Additionally, all Savepoint resources with the same physical location will be scheduled for deletion.
-
409
: One of the prerequisites was not met. Check the error message for details. -
400
: A user-error occurred, likely a data type issue. Check the error message for details.
Savepoint resources can also be deleted using the web user interface. To do so, navigate to the Snapshots tab of the
corresponding Deployment and choose the Delete Snapshot
or Force delete Snapshot
action.
The Delete Snapshot
action is only available for Snapshots in state COMPLETED
, FAILED
, FAILED_DELETION
.
Copying or manually creating a Savepoint resource with the same spec.savepointLocation
as a Savepoint in state
PENDING_DELETION
will not stop the deletion and can therefore result in a Savepoint resource pointing to an invalid
savepoint location.
Limitations
- Job-specific Apache Flink® configuration is only picked up if set via the Deployment template. If incremental checkpoints are configured directly in the code of the submitted JAR, this will not be recognized.
- While Savepoint resources referencing incremental checkpoints can be force deleted, this will not remove the part of the underlying data that is shared with other incremental checkpoints.