Production Readiness Checklist

Applies toBYOCSelf-Managed v2

3 min read

On this page

This page provides a practical production readiness checklist for Ververica Platform deployments and Flink applications. It is intended to be used before go-live for both self-managed platform deployments and BYOC environments.

Info

How to use this page

Use this checklist as a go/no-go gate before production. Items marked REQUIRED should be considered mandatory for production. Items marked RECOMMENDED reduce operational risk significantly. Items marked OPTIONAL are maturity improvements.

Scope and operating principle

This checklist follows a strict production mindset: some configurations may be acceptable in development, but they should not be tolerated in production. The goal is to make the boundary explicit.

Deployment model	Who manages Kubernetes	Who manages the platform	Typical scope
Deployment model	Who manages Kubernetes	Who manages the platform	Typical scope
VVP Self-Managed	Customer	Customer	Platform, infrastructure, networking, security, workloads
VVC BYOC	Customer	Shared between Ververica control plane and customer data plane	Cluster, networking, IAM, storage, workloads, observability

Use the checklist in order: infrastructure first, then connectivity, security, storage, application settings, observability, deployment strategy, and recovery.

Kubernetes and infrastructure

Your Kubernetes cluster is the operational foundation of the platform. Wrong instance families, undersized disks, missing autoscaling, or poor workload isolation typically surface only under load or during recovery.

Check	Priority	Development tolerance	Production expectation	What to validate
Check	Priority	Development tolerance	Production expectation	What to validate
Kubernetes version	REQUIRED	Recent supported version	Use a currently supported Kubernetes version for your Ververica release	Cluster version, node pool compatibility, add-on compatibility
Node groups / VM scale sets	REQUIRED	Single shared node pool is acceptable	Use managed node groups or VM scale sets with clear ownership and scaling boundaries	Separate node pools for platform and Flink workloads where possible
Instance family sizing	REQUIRED	General purpose instances are acceptable	Use memory-optimized instances for state-heavy TaskManagers and stable compute for platform services and JobManagers	CPU, memory, ephemeral storage, expected state size, checkpoint profile
Cluster autoscaling	RECOMMENDED	Manual scaling is acceptable	Configure Cluster Autoscaler or equivalent with sane min/max boundaries	Scale-out behavior, drain behavior, interaction with Flink recovery
Storage class	REQUIRED	Default storage class is acceptable	Use performant block storage suitable for sustained checkpointing and platform persistence	Storage class type, IOPS profile, latency consistency
Workload isolation	RECOMMENDED	Shared nodes are acceptable	Isolate Flink workloads from core platform services using taints, tolerations, labels, or dedicated node pools	Noisy-neighbor risk, eviction pressure, capacity contention
Multi-AZ placement	RECOMMENDED	Single zone is acceptable	Distribute nodes across availability zones for resilience	Node spread, anti-affinity, failure domain exposure

Info

Practical rule of thumb

Size TaskManagers for the workload they actually run, not for average utilization. Stateful jobs fail during peaks, backpressure, and recovery events, not during calm periods.

Networking and connectivity

For BYOC, networking is a first-class production topic. The most common deployment issue is blocked or incomplete egress. The Ververica agent uses an outbound-only secure connection model, so the key question is whether your cluster can reliably reach all required endpoints.

Check	Priority	Development tolerance	Production expectation	What to validate
Check	Priority	Development tolerance	Production expectation	What to validate
Outbound connectivity to Ververica control plane	REQUIRED		Cluster can reach Ververica endpoints on port 443	Egress rules, proxies, TLS inspection behavior
Container registry access	REQUIRED		Nodes and pods can pull required images	Registry reachability, credentials, rate limits
DNS resolution	REQUIRED		Internal and external names resolve consistently	CoreDNS health, external resolution, private zones
Private connectivity to data sources and sinks	RECOMMENDED		Prefer private links or private endpoints for sensitive systems	PrivateLink, VPC peering, private endpoints, routing
Proxy compatibility	RECOMMENDED		Proxy does not break TLS or WebSocket upgrade behavior	HTTP proxy variables, no-proxy exceptions, WebSocket support
Firewall / NSG rule review	REQUIRED		Only the required outbound paths are allowed; no unnecessary inbound paths are opened	Direction, port, protocol, destination, owner

Recommended networking questions to answer before production

Which Ververica endpoints must be reachable from the cluster?
Which source and sink systems are accessed by Flink jobs, over which ports and protocols?
Does any proxy, TLS inspection layer, or egress gateway alter long-lived secure connections?
Are DNS and routing paths identical across all worker nodes?
Is cross-region access part of the design, and if yes, what is the latency impact?

Identity, access, and secrets

Production readiness is not only about scale and uptime. It also means reducing operational and security risk: strong authentication, least privilege, short-lived credentials, and safe secret delivery.

Check	Priority	Development tolerance	Production expectation	What to validate
Check	Priority	Development tolerance	Production expectation	What to validate
SSO integration	REQUIRED	Local authentication may be acceptable	Use enterprise SSO via OIDC or SAML	Login flow, group mapping, logout behavior, MFA policy
Namespace RBAC	REQUIRED	Broad permissions may be acceptable	Use role separation with clear namespace ownership	Owner/Admin/Editor/Viewer mapping, access reviews
API token governance	RECOMMENDED	Long-lived tokens are acceptable	Scope tokens narrowly, define expiry and rotation procedures	Token inventory, expiry policy, automation users
Cloud IAM integration	REQUIRED	Static credentials may be acceptable	Use IRSA, Workload Identity, or equivalent federated identity patterns	No embedded cloud keys in manifests or Helm values
Secret delivery	REQUIRED	Kubernetes secrets only may be acceptable	Prefer external secret stores or managed secret delivery patterns for sensitive credentials	Vault, Secrets Manager, Key Vault, mounted files, rotation workflow
Encryption in transit	REQUIRED	Plaintext may be acceptable only in isolated local development	Enable TLS for Flink components and exposed REST/UI endpoints	Certificate management, internal RPC settings, ingress or load balancer TLS, REST/UI access paths
Kubernetes NetworkPolicies	RECOMMENDED	Open namespace networking may be acceptable	Restrict pod-to-pod and egress traffic to required platform, source, sink, and observability paths	Default deny posture, allowed namespaces, DNS, object storage, registries, and monitoring endpoints
Non-root execution	RECOMMENDED	Not always enforced	Run containers as non-root wherever supported and align with Pod Security Standards	SecurityContext coverage, admission policies, platform version caveats

Note

Important

Do not treat “we will secure it later” as a production strategy. Authentication, IAM, and secret handling are part of production readiness, not post-go-live hardening.

Object storage and artifacts

For Flink, durable object storage is part of the runtime control plane. It stores checkpoints, savepoints, and often deployment artifacts. Poor storage design directly impacts reliability and recovery.

Check	Priority	Development tolerance	Production expectation	What to validate
Check	Priority	Development tolerance	Production expectation	What to validate
Durable checkpoint and savepoint storage	REQUIRED		Use S3, ADLS Gen2, or another durable object store	Checkpoint path, savepoint path, durability guarantees
Separation of paths or buckets	RECOMMENDED		Separate artifacts, checkpoints, and savepoints logically	Path structure, bucket ownership, IAM boundaries
Checkpoint retention	REQUIRED		Retain multiple checkpoints, not just the latest one	state.checkpoints.num-retained and storage cleanup rules
Externalized checkpoint retention on cancel	REQUIRED		Retain checkpoints on cancellation	RETAIN_ON_CANCELLATION setting
Encryption at rest	RECOMMENDED		Enable provider-managed or customer-managed encryption	SSE-S3, SSE-KMS, Azure Storage encryption
Storage performance validation	RECOMMENDED		Validate throughput under realistic checkpoint load	Checkpoint size, duration, timeout margin, concurrent job behavior
Artifact versioning	RECOMMENDED		Use immutable or versioned artifact storage	Versioning policy, naming convention, rollback availability

Flink application configuration

This is the heart of production readiness for Flink workloads. Most severe runtime failures are not caused by exotic bugs; they come from a handful of missing or unsafe production settings.

Check	Priority	Development tolerance	Production expectation	What to validate
Check	Priority	Development tolerance	Production expectation	What to validate
High availability enabled	REQUIRED		Set high-availability.type: kubernetes or the equivalent supported HA mode	Deployment config, high-availability.storageDir, and recovery test
Checkpointing enabled	REQUIRED		Checkpointing is active with realistic interval and timeout values	Interval, timeout, min pause, tolerated failures
State backend selection	REQUIRED		Use RocksDB or Gemini/VERA backend for large state workloads	State size profile, memory model, checkpoint behavior, and state.backend.type usage
Incremental checkpoints	RECOMMENDED		Enable for large-state workloads when supported	Backend compatibility, checkpoint size evolution
Exactly-once sink semantics	REQUIRED	At-least-once or idempotent-only sinks may be acceptable for non-critical tests	Use exactly-once-capable sinks or transactional connectors where correctness requires it, with transaction timeout safely above checkpoint duration and interval	Connector semantics, transaction timeout, checkpoint interval and timeout, failure recovery, duplicate and loss behavior
State schema and serializer evolution	REQUIRED	Breaking changes may be acceptable before state is durable	Plan compatible state schema and serializer evolution for upgrades and savepoint restores	Serializer snapshots, Avro/Protobuf compatibility rules, savepoint restore test, fallback or migration path
Restart strategy	REQUIRED		Use a bounded automated restart strategy, not “no restart”	Failure rate, delay, escalation path
Stable operator UIDs	REQUIRED		All stateful operators have explicit, stable UIDs	Code review for uid() usage
Explicit maxParallelism	RECOMMENDED		Set intentionally on stateful operators	Code review and scaling plan
State TTL for SQL jobs	RECOMMENDED		Configure TTL where state can otherwise grow without bound	table.exec.state.ttl and expected retention semantics
Restore mode strategy	RECOMMENDED		Choose restore mode according to deployment strategy	LATEST_STATE, NO_CLAIM, upgrade flow

Info

Production principle

If a job cannot recover cleanly from its latest durable state in a staging test, it is not production ready.

Resource sizing and memory

Memory and CPU sizing should be treated as engineering inputs, not default values left untouched. Production problems usually appear during backpressure, spikes, rescaling, or restore from state. Keep this checklist at go/no-go level; use the companion section Resource sizing guidance (Ververica Platform) below for prescriptive starting points, and validate them through the Load testing item in this table.

Check	Priority	Development tolerance	Production expectation	What to validate
Check	Priority	Development tolerance	Production expectation	What to validate
JobManager memory	REQUIRED		Size for job graph complexity, metadata pressure, and recovery operations	Heap usage, GC, failover behavior
TaskManager memory	REQUIRED		Size according to state size, operator footprint, network buffers, and managed memory needs	Heap, managed memory, off-heap, disk spill
Unaligned checkpoints under backpressure	RECOMMENDED	Aligned checkpoints are acceptable for simple or low-pressure pipelines	Enable execution.checkpointing.unaligned.enabled for pipelines where sustained backpressure makes aligned checkpoints unstable or too slow	Backpressure metrics, checkpoint alignment time, checkpoint size increase, recovery behavior
CPU requests and limits	RECOMMENDED		Set deliberately rather than reusing a single default ratio everywhere	Requests, limits, throttling metrics
Slots per TaskManager	RECOMMENDED		Prefer simple slot layouts unless you have measured reasons not to	Slot strategy vs. workload shape
Namespace quotas and limits	RECOMMENDED		Protect the cluster from a single runaway deployment	Quota definitions, observed saturation patterns
Load testing	RECOMMENDED		Run a realistic pre-production benchmark or shadow test	Volume, cardinality, state growth, sink behavior, restore time

Resource sizing guidance (Ververica Platform)

Note

Required framing

All numeric values in this section are starting points to validate by load test, not universal sizing rules. Stateful job sizing depends on workload shape, key cardinality, state growth, checkpoint profile, source and sink behavior, and recovery objectives. Use this section together with the Load testing checklist item above; adjust only from measured signals such as OOM kills, RocksDB/Gemini spill or compaction pressure, checkpoint duration, backpressure, restart time, and CPU throttling.

Deployment Templates and namespace defaults

Check	Priority	Guidance	Signals to observe
Check	Priority	Guidance	Signals to observe
Deployment Templates	RECOMMENDED	Use Ververica Platform Deployment Templates to standardize namespace-level defaults for Deployment sizing: JobManager replicas, JobManager and TaskManager resources, parallelism, number of TaskManagers, and slots per TaskManager. Keep defaults conservative and explicit, then require a documented exception when a Deployment deviates from the namespace profile.	Repeated manual overrides, inconsistent resource settings across similar jobs, quota saturation, failed scheduling, and benchmark deltas between template defaults and actual load-test needs.
Exception process	RECOMMENDED	Require the owner to state why the Deployment needs a different profile, which benchmark supports the change, and when the exception will be reviewed. Treat higher memory, higher CPU limits, extra TaskManagers, or unusual slot layouts as operational exceptions rather than silent one-offs.	CPU throttling, OOM kills, checkpoint timeout margin, backpressure, restore duration, and cost or quota impact after the exception is applied.

JobManager sizing

Check	Priority	Starting guidance	Signals to observe
Check	Priority	Starting guidance	Signals to observe
JobManager replicas	REQUIRED	For production Deployments, use more than one JobManager replica where Ververica Platform HA is enabled for the namespace or Deployment. Start with 2 replicas for HA and validate by failover testing; increase only when the platform guidance for the selected VVP version and operating model calls for it.	Leader failover time, job recovery time, failed or slow leadership transitions, restart loops, and whether recovery meets the documented RTO.
JobManager resources	REQUIRED	Size JobManager CPU and memory through the Deployment resources exposed by Ververica Platform, not by unmanaged runtime files. Start small for simple graphs, then increase for large SQL plans, many operators, high checkpoint metadata volume, frequent failover testing, or complex savepoint and upgrade operations.	JobManager heap pressure, GC pauses, metadata growth, slow checkpoint coordination, slow deployment submission, and failover or restore operations that exceed the expected window.

TaskManager resources and memory model exposed by VVP

Check	Priority	Guidance	Signals to observe
Check	Priority	Guidance	Signals to observe
TaskManager resources	REQUIRED	Set TaskManager sizing through the Deployment resources fields in Ververica Platform. Treat CPU and memory as Deployment-level capacity inputs that must be benchmarked against expected event rate, key cardinality, state size, checkpoint interval, and recovery target.	Container OOM kills, pod evictions, JVM memory pressure, RocksDB/Gemini memory pressure, disk spill, checkpoint duration, and sustained backpressure.
Memory split	REQUIRED	Explain memory reviews in the Flink memory components surfaced through VVP: framework and task heap for JVM objects and user code, managed memory for state backends such as RocksDB/Gemini and batch or sort workloads, network memory for shuffle buffers, and off-heap or JVM overhead for native memory and process overhead. For state-heavy jobs, do not size only from the Kubernetes memory value; validate whether the managed-memory and off-heap portions leave enough room for the backend and checkpoint activity.	OOM kills without high Java heap, RocksDB/Gemini native memory pressure, compaction or write stalls, spill growth, checkpoint alignment time, network buffer exhaustion, and GC pauses.
State-size interaction	RECOMMENDED	Use observed state size and growth rate to validate TaskManager memory. Large state does not imply all state is resident in memory, but larger keyed state increases backend working set, checkpoint metadata and I/O pressure, and recovery time. Start with headroom, then reduce only after benchmark evidence.	State growth trend, checkpoint size, checkpoint duration and timeout margin, restore time, local disk utilization, backend cache hit behavior where available, and backpressure during checkpointing.

Parallelism, TaskManagers, and slots

Check	Priority	Guidance	Signals to observe
Check	Priority	Guidance	Signals to observe
Capacity relationship	REQUIRED	In VVP Deployment sizing, total available slots come from numberOfTaskManagers multiplied by TaskManager slots. The job parallelism consumes those slots according to the execution graph. Keep the relationship explicit in templates so a change in parallelism is reviewed together with TaskManager count and slot layout.	Insufficient slots, unscheduled Deployments, idle slots, uneven subtask load, backpressure isolated to a few subtasks, and restore or rescale duration.
Slot layout	RECOMMENDED	Prefer simple layouts as starting points: 1 slot per TaskManager for state-heavy or isolation-sensitive jobs; 2 to 4 slots per TaskManager for lighter stateless or throughput-oriented jobs only when benchmarks show good resource sharing. Avoid high slot counts per TaskManager unless measured CPU, memory, and network behavior justify them.	Per-subtask backpressure, CPU saturation, heap and managed-memory contention, network buffer pressure, noisy-neighbor effects between operators, and checkpoint duration changes after changing slot count.

Requests, limits, and throttling

Check	Priority	Guidance	Signals to observe
Check	Priority	Guidance	Signals to observe
Requests and limits	RECOMMENDED	Use the Ververica Platform resource model and limit factors to derive Kubernetes requests and limits consistently from Deployment resources. Keep limit factors explicit in namespace defaults so teams understand how much burst capacity a Deployment receives and what the scheduler reserves.	Failed scheduling, node overcommit, memory limit OOM kills, CPU throttling, and mismatch between requested capacity and observed steady-state usage.
CPU throttling	REQUIRED	Do not treat CPU limits as harmless guardrails for latency-sensitive streaming jobs. If the limit factor is too tight, CPU throttling can look like application backpressure, checkpoint slowness, or sink lag. Validate CPU limits under peak input rate and during recovery.	Container CPU throttled time, busy time, source lag, checkpoint duration, backpressure, restart duration, and sink latency during load tests.

Reference profiles to benchmark

The profiles below are starting points to validate by benchmark. They are intended for Deployment Templates and exception discussions, not as universal production defaults.

Profile	Starting point to validate by benchmark	When to use	Signals to confirm or adjust
Profile	Starting point to validate by benchmark	When to use	Signals to confirm or adjust
Small / simple	JobManager: 2 replicas, about 0.5 to 1 CPU and 1 to 2 GiB memory. TaskManagers: 1 to 2 TaskManagers, 1 to 2 slots each, about 1 to 2 CPU and 2 to 4 GiB memory per TaskManager.	Low-volume jobs, limited state, simple topology, non-critical throughput requirements.	OOM kills, GC pressure, checkpoint duration, source lag, idle slots, and any sustained backpressure during the Load testing benchmark.
Stateful-heavy	JobManager: 2 replicas, about 1 to 2 CPU and 2 to 4 GiB memory. TaskManagers: start with 1 slot per TaskManager, 4 to 8 CPU and 16 to 64 GiB memory per TaskManager; scale numberOfTaskManagers with required parallelism and state distribution.	Large keyed state, RocksDB/Gemini backend, long retention, expensive checkpoints, or strict recovery objectives.	Container OOM kills, RocksDB/Gemini native memory pressure, spill or compaction pressure, checkpoint size and duration, checkpoint timeout margin, restore time, and backpressure during checkpointing.
High-throughput	JobManager: 2 replicas, about 1 to 2 CPU and 2 to 4 GiB memory. TaskManagers: start with 2 to 4 slots per TaskManager, about 4 to 8 CPU and 8 to 32 GiB memory per TaskManager; increase parallelism and numberOfTaskManagers together after measuring source, network, and sink bottlenecks.	High event rate, lighter per-key state, CPU or network-heavy operators, demanding source and sink throughput.	CPU saturation or throttling, network buffer pressure, backpressure, source lag, sink latency, checkpoint alignment time, and throughput plateau despite added parallelism.

Info

Production principle

A sizing profile is production-ready only after it survives the documented Load testing scenario, including peak traffic, checkpointing, failure and restore, and sink behavior. If the benchmark changes the conclusion, update the Deployment Template rather than relying on tribal knowledge.

Observability and alerting

You should never learn that a streaming application is unhealthy from a downstream business team. Production readiness requires metrics, logs, dashboards, and alerts before the first incident.

Check	Priority	Development tolerance	Production expectation	What to validate
Check	Priority	Development tolerance	Production expectation	What to validate
Metrics collection	REQUIRED		Prometheus or equivalent collects JM and TM metrics	Scrape targets, labels, retention, dashboard usability
Dashboards	RECOMMENDED		Grafana or equivalent dashboards exist for platform and jobs	Coverage for platform, jobs, and storage health
Alerting	REQUIRED		Alerts exist for checkpoint failures, sustained backpressure, restart loops, lag growth, and OOM kills	Thresholds, routing, on-call ownership
Log aggregation	RECOMMENDED		Centralized logs for JM, TM, and platform components	Retention, searchability, pod restart coverage
BYOC agent monitoring	REQUIRED		Agent health and control-plane connectivity are monitored	Pod health, reconnect behavior, tunnel status
Audit logging	RECOMMENDED		Enable and retain audit logs where supported	Retention policy, SIEM export, access review

Minimum alert set worth defining before go-live

Checkpoint failures above a low threshold
Checkpoint duration trending upward beyond normal baseline
Sustained backpressure on any critical operator
Source lag growing continuously
Repeated job restarts within a short interval
Container OOM kills or pod evictions
Loss of BYOC agent connectivity

Upgrade and deployment strategy

Stateful systems need a deployment strategy, not just a deployment tool. Before production, you should know how upgrades work, how rollback works, what state artifacts are retained, and what assumptions must hold for zero-downtime patterns.

Check	Priority	Development tolerance	Production expectation	What to validate
Check	Priority	Development tolerance	Production expectation	What to validate
Stateful upgrade path	REQUIRED		Understand how savepoint-based upgrades behave and how long they take	Savepoint creation time, restart duration, rollback artifacts
Rollback plan	REQUIRED		Rollback is documented and tested before production	Previous artifact availability, restore flow, owner
Savepoint retention during deployment windows	RECOMMENDED		Keep enough historical savepoints for safe rollback	Retention rules and manual override ability
Blue/Green prerequisites	RECOMMENDED		If using advanced rollout patterns, verify state compatibility, sink semantics, and restore mode	Operator UIDs, sink idempotence, output equivalence, restore mode
Dynamic parameter update awareness	OPTIONAL		Know which changes can be applied without full redeploy	Platform capability and runbook clarity
Platform upgrade procedure	RECOMMENDED		Platform Helm or operator upgrade steps are documented, including CRD handling and smoke tests	CRD refresh, smoke tests, rollback of platform changes

Disaster recovery and business continuity

Recovery should be designed and tested, not assumed. At minimum, you should be able to recover a failed job from retained checkpoints within a known recovery objective.

Check	Priority	Development tolerance	Production expectation	What to validate
Check	Priority	Development tolerance	Production expectation	What to validate
Recovery from latest checkpoint	REQUIRED		Jobs recover automatically and predictably from the latest retained checkpoint	Kill-and-recover test in staging
Savepoint policy	RECOMMENDED		Savepoints are retained according to a documented policy and not treated as ad hoc artifacts	Retention ownership and cleanup process
Disaster checkpointing / secondary copy	OPTIONAL		Use a secondary location for checkpoint durability where business criticality justifies it	Secondary path, replication, test procedure
Platform metadata database backup and restore	REQUIRED	Ephemeral or manual exports may be acceptable only in non-production tests	For VVP self-managed deployments, the platform metadata database is backed up and restore is tested	Backup schedule, retention, encryption, restore runbook, recovery test for platform metadata
RTO / RPO definition	RECOMMENDED		Each critical job has explicit recovery targets	Checkpoint interval, restart time, downstream tolerance
Namespace and configuration recovery	RECOMMENDED		Deployment definitions and access mappings can be recreated quickly	Git-backed config, RBAC mapping, runbook completeness

Tip

Good sign

A production-ready team can answer two questions clearly: “What is our latest recoverable point?” and “How long does recovery take under realistic conditions?”

Governance and compliance

As the number of jobs and teams grows, governance becomes an operational necessity. Production readiness includes naming standards, default policies, approved versions, and a clear operating model.

Check	Priority	Development tolerance	Production expectation	What to validate
Check	Priority	Development tolerance	Production expectation	What to validate
Namespace taxonomy	RECOMMENDED		Use consistent naming by team, environment, or domain	Naming rules and owner mapping
Deployment defaults	RECOMMENDED		Standardize critical defaults such as checkpointing, HA, restart strategy, and resource bounds	Namespace-level defaults and exception process
Approved engine versions	RECOMMENDED		Production runs on tested and approved engine versions only	Version catalog and upgrade ownership
Connector governance	OPTIONAL		Maintain an approved connector list and version policy	Connector review process
Operational guardrails	RECOMMENDED		Define acceptable bounds for checkpoint intervals, autoscaling, and resource consumption	Guardrail policy and exception approvals
Change management	RECOMMENDED		Production deployment rights and approval steps are explicit	Approver model, auditability, release process

Pre-production gate

Use this as a final gate before opening production traffic.

Kubernetes cluster sizing, storage, and scaling model have been reviewed
All required network paths and DNS behavior have been validated
SSO, RBAC, IAM, and secret delivery patterns are in place
Durable object storage is configured for checkpoints, savepoints, and artifacts
HA, checkpointing, state backend, restart strategy, and operator UIDs have been verified
Resource sizing has been tested with realistic workload assumptions
Metrics, dashboards, logs, and alerts are operational
Upgrade and rollback procedures are documented and tested
Recovery from checkpoint has been tested end to end
Governance rules, defaults, and ownership are clear

Minimum production configuration example

YAML

1high-availability.type: kubernetes
2high-availability.storageDir: s3://<bucket>/ha
3execution.checkpointing.interval: 60s
4execution.checkpointing.timeout: 10min
5execution.checkpointing.min-pause: 30s
6execution.checkpointing.tolerable-failed-checkpoints: 3
7execution.checkpointing.externalized-checkpoint-retention: RETAIN_ON_CANCELLATION
8state.backend.type: rocksdb
9state.backend.incremental: true
10state.checkpoints.dir: s3://<bucket>/checkpoints
11state.savepoints.dir: s3://<bucket>/savepoints
12state.checkpoints.num-retained: 3
13restart-strategy.type: failure-rate
14restart-strategy.failure-rate.max-failures-per-interval: 3
15restart-strategy.failure-rate.failure-rate-interval: 10min
16restart-strategy.failure-rate.delay: 30s
17taskmanager.numberOfTaskSlots: 1

Was this helpful?

Yes No