Savepoints
A Savepoint resource points to a single savepoint or retained checkpoint in Apache Flink®. Multiple Ververica Platform Savepoint resources can reference a single Apache Flink® savepoint.
For more information, see the official Apache Flink® documentation on savepoints and checkpoints.
Specification
The Restore Strategy of your Deployment resources controls which Savepoint is used to restore the state of an Apache Flink® job.
Ververica Platform tracks only Apache Flink® savepoints created within Ververica Platform.
Savepoint Origins
A Savepoint can be created in various ways. The metadata.origin attribute describes its origin:
USER_REQUEST: The Savepoint was requested manually by a user through Ververica Platform.SUSPEND: The Savepoint was requested when the corresponding Deployment was suspended.COPIED: The Savepoint is either a copy of another Savepoint resource, or was created manually using an existingsavepointLocation(see below). Both Savepoint resources point to the same physical Apache Flink® savepoint.RETAINED_CHECKPOINT: The Savepoint is a retained Apache Flink® checkpoint that was not discarded after the Apache Flink® job was cancelled.
Savepoint States
The status.state attribute describes the current state of a Savepoint resource:
STARTED: The Savepoint was started but has not completed yet.COMPLETED: The Savepoint completed successfully and can be restored from.FAILED: Creation of the Savepoint failed. Details on the cause of failure can be found in thestatus.failurefield.PENDING_DELETION: The Savepoint was marked for deletion. It is automatically deleted if it meets all prerequisites. It can no longer be restored from.DELETING: The Savepoint is currently being deleted. It can no longer be restored from.FAILED_DELETION: Deletion of the Savepoint failed. Details on the cause of failure can be found in thestatus.failurefield. It can no longer be restored from. Deletion can be retried.
Savepoint Types
The metadata.type attribute of a Savepoint resource describes the structure of the underlying savepoint or
checkpoint in Apache Flink®. For more information about incremental checkpoints, see the Apache Flink® documentation.
INCREMENTAL: The Savepoint resource references an incremental checkpoint.FULL: The Savepoint resource references a savepoint or a full checkpoint.UNKNOWN: The type of the underlying savepoint or checkpoint is not known.
Savepoint resources created using a version of Ververica Platform prior to 2.5 do not have metadata.type populated.
They are treated as if the type is UNKNOWN.
Savepoint Format Types
When triggering a savepoint through the API, you can specify the format using the appmanager.ververica.com/savepoint.format-type annotation. The supported values are:
CANONICAL: The default format, compatible with all state backends.NATIVE: A format native to the state backend, which might offer improved performance.
To trigger a savepoint with NATIVE format, include the annotation in your request to POST /api/v1/namespaces/{namespace}/savepoints:
metadata:
annotations:
appmanager.ververica.com/savepoint.format-type: "NATIVE"
On job suspend, Ververica Platform 2.15.8 still takes CANONICAL savepoints.
Requirements
To trigger savepoints, configure a path under which to store them. If Ververica Platform is configured with blob storage, it preconfigures each Deployment for checkpoints, savepoints, and high availability.
Otherwise, add an entry to the flinkConfiguration map with the key state.savepoints.dir:
kind: Deployment
spec:
template:
spec:
flinkConfiguration:
state.savepoints.dir: s3://flink/savepoints
The blob storage location must be accessible to all nodes of your cluster. If Ververica Platform is configured with blob storage, the platform handles credential distribution transparently and no further action is required. Otherwise, you can use a custom volume mount or a custom filesystem configuration.
Manually Adding a Savepoint Resource
Savepoints triggered by or through Ververica Platform are automatically added to the Deployment. However, you might want to restore or start your Deployment from a specific Apache Flink® state snapshot that Ververica Platform does not yet track. In that case, you must manually add a Savepoint resource to your Deployment, if you already have a savepoint or checkpoint to resume from. You can do this in either of the following ways:
Using the Ververica Platform User Interface
In the Deployment list view, select the target Deployment. On the Snapshots tab, click Add Savepoint Manually and complete the form.
Using the REST API
Send a request to the following endpoint, specifying the Deployment ID in the request body:
POST /api/v1/namespaces/{namespace}/savepoints
metadata:
deploymentId: ${deploymentId}
annotations:
com.dataartisans.appmanager.controller.deployment.spec.version: ${deploymentSpecVersion}
type: ${type}
spec:
savepointLocation: ${savepointLocation}
flinkSavepointId: ${flinkSavepointId}
status:
state: COMPLETED
The savepointLocation is required. The flinkSavepointId is optional. If not specified, the Deployment annotation com.dataartisans.appmanager.controller.deployment.spec.version is set to that of the current Deployment. The type of the Savepoint defaults to UNKNOWN. The origin of the Savepoint is COPIED.
After saving, the Snapshots tab for this Deployment shows that the Deployment starts from this Savepoint.
You can also use the same REST API endpoint to create new Savepoints by sending spec == null.
Ensure that the savepointLocation you provide is valid and accessible to the Apache Flink® pods. If the location is invalid or inaccessible, errors appear only during runtime of jobs that try to restore from this location.
Deleting a Savepoint Resource
You can delete Savepoint resources that are no longer needed to free up space. Ververica Platform automatically deletes the underlying data in blob storage.
You can delete both Savepoint resources that reference Apache Flink® savepoints and those that reference Apache Flink® retained checkpoints.
When you delete a Savepoint resource, Ververica Platform also attempts to delete all other Savepoint resources that point to the same physical location in blob storage.
Prerequisites
To delete a Savepoint resource, the following conditions must be true:
- Universal blob storage is enabled.
- The user requesting the deletion has the
editororownerrole inside the Namespace.
Additionally, the following conditions must be true for all Savepoint resources pointing to the same physical location:
- The Savepoint resource is in state
COMPLETED,FAILED, orFAILED_DELETION. - The Savepoint resource references a savepoint or a full checkpoint (precise logic below).
- If the Savepoint resource is associated with an active Deployment, it must not be the latest snapshot (savepoint or checkpoint) to ensure its deletion does not impact the underlying job's failure recovery.
To support Savepoint resources created using Ververica Platform 2.4 or earlier, a Savepoint resource passes the second condition if either of the following is true:
metadata.typeisFULL.metadata.typeisUNKNOWNor not set, andmetadata.originisUSER_REQUESTorSUSPEND.
Additionally, if multiple Savepoint resources share the same physical location, it is sufficient if one of them passes and none are incremental (according to metadata.type).
Ververica Platform performs some checks synchronously. Any failure results in an immediate failure response. Most conditions are checked asynchronously. Any failure results in the Savepoint moving to state FAILED_DELETION. In both cases, nothing is deleted.
Force Deletion
To skip these prerequisites and ensure that the Savepoint resource is always removed regardless of any failures while deleting the underlying data, you can force delete the resource. Ververica Platform attempts to delete the underlying data, but the Savepoint resource is deleted regardless of the outcome.
When a resource is force deleted, Ververica Platform cannot guarantee that the underlying data is removed. Use force deletion with caution. It might also prevent running jobs from recovering in case of failure if the deleted Savepoint resource represented the latest state.
Force deleting a Savepoint resource that references an incremental checkpoint only attempts to remove the part of the underlying data used exclusively by that checkpoint. Files that might be used by other incremental checkpoints are not removed.
Force deletion propagates to all other Savepoint resources that point to the same physical location.
Methods of Deletion
The following REST API call marks the affected Savepoint resource for deletion:
DELETE /api/v1/namespaces/{namespace}/savepoints/{savepointId}[?force=true]
To trigger force deletion, specify the force=true query parameter.
The API returns one of the following status codes:
202: The request was accepted. Both the Savepoint resource and the underlying physical Apache Flink® savepoint or checkpoint are scheduled for deletion. All Savepoint resources with the same physical location are also scheduled for deletion.409: A prerequisite was not met. Check the error message for details.400: A user error occurred, likely a data type issue. Check the error message for details.
You can also delete Savepoint resources using the web user interface. Navigate to the Snapshots tab for the Deployment and select Delete Snapshot or Force delete Snapshot.
The Delete Snapshot action is only available for Snapshots in state COMPLETED, FAILED, or FAILED_DELETION.
Copying or manually creating a Savepoint resource with the same spec.savepointLocation as a Savepoint in state PENDING_DELETION does not stop the deletion and can result in a Savepoint resource pointing to an invalid savepoint location.
Limitations
- Job-specific Apache Flink® configuration is only applied when set in the Deployment template. If incremental checkpoints are configured directly in the submitted JAR code, Ververica Platform does not recognize this configuration.
- Although you can force delete Savepoint resources that reference incremental checkpoints, force deletion does not remove the part of the underlying data shared with other incremental checkpoints.