Docs Home
Viewing docs for
BYOCSelf-Managed

Production Readiness Checklist

On this page

This page provides a practical production readiness checklist for Ververica Platform deployments and Flink applications. It is intended to be used before go-live for both self-managed platform deployments and BYOC environments.

Scope and operating principle

This checklist follows a strict production mindset: some configurations may be acceptable in development, but they should not be tolerated in production. The goal is to make the boundary explicit.

Deployment modelWho manages KubernetesWho manages the platformTypical scope
VVP Self-ManagedCustomerCustomerPlatform, infrastructure, networking, security, workloads
VVC BYOCCustomerShared between Ververica control plane and customer data planeCluster, networking, IAM, storage, workloads, observability

Use the checklist in order: infrastructure first, then connectivity, security, storage, application settings, observability, deployment strategy, and recovery.

Kubernetes and infrastructure

Your Kubernetes cluster is the operational foundation of the platform. Wrong instance families, undersized disks, missing autoscaling, or poor workload isolation typically surface only under load or during recovery.

CheckPriorityDevelopment toleranceProduction expectationWhat to validate
Kubernetes versionREQUIREDRecent supported versionUse a currently supported Kubernetes version for your Ververica releaseCluster version, node pool compatibility, add-on compatibility
Node groups / VM scale setsREQUIREDSingle shared node pool is acceptableUse managed node groups or VM scale sets with clear ownership and scaling boundariesSeparate node pools for platform and Flink workloads where possible
Instance family sizingREQUIREDGeneral purpose instances are acceptableUse memory-optimized instances for state-heavy TaskManagers and stable compute for platform services and JobManagersCPU, memory, ephemeral storage, expected state size, checkpoint profile
Cluster autoscalingRECOMMENDEDManual scaling is acceptableConfigure Cluster Autoscaler or equivalent with sane min/max boundariesScale-out behavior, drain behavior, interaction with Flink recovery
Storage classREQUIREDDefault storage class is acceptableUse performant block storage suitable for sustained checkpointing and platform persistenceStorage class type, IOPS profile, latency consistency
Workload isolationRECOMMENDEDShared nodes are acceptableIsolate Flink workloads from core platform services using taints, tolerations, labels, or dedicated node poolsNoisy-neighbor risk, eviction pressure, capacity contention
Multi-AZ placementRECOMMENDEDSingle zone is acceptableDistribute nodes across availability zones for resilienceNode spread, anti-affinity, failure domain exposure

Networking and connectivity

For BYOC, networking is a first-class production topic. The most common deployment issue is blocked or incomplete egress. The Ververica agent uses an outbound-only secure connection model, so the key question is whether your cluster can reliably reach all required endpoints.

CheckPriorityDevelopment toleranceProduction expectationWhat to validate
Outbound connectivity to Ververica control planeREQUIREDCluster can reach Ververica endpoints on port 443Egress rules, proxies, TLS inspection behavior
Container registry accessREQUIREDNodes and pods can pull required imagesRegistry reachability, credentials, rate limits
DNS resolutionREQUIREDInternal and external names resolve consistentlyCoreDNS health, external resolution, private zones
Private connectivity to data sources and sinksRECOMMENDEDPrefer private links or private endpoints for sensitive systemsPrivateLink, VPC peering, private endpoints, routing
Proxy compatibilityRECOMMENDEDProxy does not break TLS or WebSocket upgrade behaviorHTTP proxy variables, no-proxy exceptions, WebSocket support
Firewall / NSG rule reviewREQUIREDOnly the required outbound paths are allowed; no unnecessary inbound paths are openedDirection, port, protocol, destination, owner

Recommended networking questions to answer before production

  • Which Ververica endpoints must be reachable from the cluster?
  • Which source and sink systems are accessed by Flink jobs, over which ports and protocols?
  • Does any proxy, TLS inspection layer, or egress gateway alter long-lived secure connections?
  • Are DNS and routing paths identical across all worker nodes?
  • Is cross-region access part of the design, and if yes, what is the latency impact?

Identity, access, and secrets

Production readiness is not only about scale and uptime. It also means reducing operational and security risk: strong authentication, least privilege, short-lived credentials, and safe secret delivery.

CheckPriorityDevelopment toleranceProduction expectationWhat to validate
SSO integrationREQUIREDLocal authentication may be acceptableUse enterprise SSO via OIDC or SAMLLogin flow, group mapping, logout behavior, MFA policy
Namespace RBACREQUIREDBroad permissions may be acceptableUse role separation with clear namespace ownershipOwner/Admin/Editor/Viewer mapping, access reviews
API token governanceRECOMMENDEDLong-lived tokens are acceptableScope tokens narrowly, define expiry and rotation proceduresToken inventory, expiry policy, automation users
Cloud IAM integrationREQUIREDStatic credentials may be acceptableUse IRSA, Workload Identity, or equivalent federated identity patternsNo embedded cloud keys in manifests or Helm values
Secret deliveryREQUIREDKubernetes secrets only may be acceptablePrefer external secret stores or managed secret delivery patterns for sensitive credentialsVault, Secrets Manager, Key Vault, mounted files, rotation workflow
Encryption in transitREQUIREDPlaintext may be acceptable only in isolated local developmentEnable TLS for Flink components and exposed REST/UI endpointsCertificate management, internal RPC settings, ingress or load balancer TLS, REST/UI access paths
Kubernetes NetworkPoliciesRECOMMENDEDOpen namespace networking may be acceptableRestrict pod-to-pod and egress traffic to required platform, source, sink, and observability pathsDefault deny posture, allowed namespaces, DNS, object storage, registries, and monitoring endpoints
Non-root executionRECOMMENDEDNot always enforcedRun containers as non-root wherever supported and align with Pod Security StandardsSecurityContext coverage, admission policies, platform version caveats

Object storage and artifacts

For Flink, durable object storage is part of the runtime control plane. It stores checkpoints, savepoints, and often deployment artifacts. Poor storage design directly impacts reliability and recovery.

CheckPriorityDevelopment toleranceProduction expectationWhat to validate
Durable checkpoint and savepoint storageREQUIREDUse S3, ADLS Gen2, or another durable object storeCheckpoint path, savepoint path, durability guarantees
Separation of paths or bucketsRECOMMENDEDSeparate artifacts, checkpoints, and savepoints logicallyPath structure, bucket ownership, IAM boundaries
Checkpoint retentionREQUIREDRetain multiple checkpoints, not just the latest onestate.checkpoints.num-retained and storage cleanup rules
Externalized checkpoint retention on cancelREQUIREDRetain checkpoints on cancellationRETAIN_ON_CANCELLATION setting
Encryption at restRECOMMENDEDEnable provider-managed or customer-managed encryptionSSE-S3, SSE-KMS, Azure Storage encryption
Storage performance validationRECOMMENDEDValidate throughput under realistic checkpoint loadCheckpoint size, duration, timeout margin, concurrent job behavior
Artifact versioningRECOMMENDEDUse immutable or versioned artifact storageVersioning policy, naming convention, rollback availability

This is the heart of production readiness for Flink workloads. Most severe runtime failures are not caused by exotic bugs; they come from a handful of missing or unsafe production settings.

CheckPriorityDevelopment toleranceProduction expectationWhat to validate
High availability enabledREQUIREDSet high-availability.type: kubernetes or the equivalent supported HA modeDeployment config, high-availability.storageDir, and recovery test
Checkpointing enabledREQUIREDCheckpointing is active with realistic interval and timeout valuesInterval, timeout, min pause, tolerated failures
State backend selectionREQUIREDUse RocksDB or Gemini/VERA backend for large state workloadsState size profile, memory model, checkpoint behavior, and state.backend.type usage
Incremental checkpointsRECOMMENDEDEnable for large-state workloads when supportedBackend compatibility, checkpoint size evolution
Exactly-once sink semanticsREQUIREDAt-least-once or idempotent-only sinks may be acceptable for non-critical testsUse exactly-once-capable sinks or transactional connectors where correctness requires it, with transaction timeout safely above checkpoint duration and intervalConnector semantics, transaction timeout, checkpoint interval and timeout, failure recovery, duplicate and loss behavior
State schema and serializer evolutionREQUIREDBreaking changes may be acceptable before state is durablePlan compatible state schema and serializer evolution for upgrades and savepoint restoresSerializer snapshots, Avro/Protobuf compatibility rules, savepoint restore test, fallback or migration path
Restart strategyREQUIREDUse a bounded automated restart strategy, not “no restart”Failure rate, delay, escalation path
Stable operator UIDsREQUIREDAll stateful operators have explicit, stable UIDsCode review for uid() usage
Explicit maxParallelismRECOMMENDEDSet intentionally on stateful operatorsCode review and scaling plan
State TTL for SQL jobsRECOMMENDEDConfigure TTL where state can otherwise grow without boundtable.exec.state.ttl and expected retention semantics
Restore mode strategyRECOMMENDEDChoose restore mode according to deployment strategyLATEST_STATE, NO_CLAIM, upgrade flow

Resource sizing and memory

Memory and CPU sizing should be treated as engineering inputs, not default values left untouched. Production problems usually appear during backpressure, spikes, rescaling, or restore from state. Keep this checklist at go/no-go level; use the companion section Resource sizing guidance (Ververica Platform) below for prescriptive starting points, and validate them through the Load testing item in this table.

CheckPriorityDevelopment toleranceProduction expectationWhat to validate
JobManager memoryREQUIREDSize for job graph complexity, metadata pressure, and recovery operationsHeap usage, GC, failover behavior
TaskManager memoryREQUIREDSize according to state size, operator footprint, network buffers, and managed memory needsHeap, managed memory, off-heap, disk spill
Unaligned checkpoints under backpressureRECOMMENDEDAligned checkpoints are acceptable for simple or low-pressure pipelinesEnable execution.checkpointing.unaligned.enabled for pipelines where sustained backpressure makes aligned checkpoints unstable or too slowBackpressure metrics, checkpoint alignment time, checkpoint size increase, recovery behavior
CPU requests and limitsRECOMMENDEDSet deliberately rather than reusing a single default ratio everywhereRequests, limits, throttling metrics
Slots per TaskManagerRECOMMENDEDPrefer simple slot layouts unless you have measured reasons not toSlot strategy vs. workload shape
Namespace quotas and limitsRECOMMENDEDProtect the cluster from a single runaway deploymentQuota definitions, observed saturation patterns
Load testingRECOMMENDEDRun a realistic pre-production benchmark or shadow testVolume, cardinality, state growth, sink behavior, restore time

Resource sizing guidance (Ververica Platform)

Deployment Templates and namespace defaults

CheckPriorityGuidanceSignals to observe
Deployment TemplatesRECOMMENDEDUse Ververica Platform Deployment Templates to standardize namespace-level defaults for Deployment sizing: JobManager replicas, JobManager and TaskManager resources, parallelism, number of TaskManagers, and slots per TaskManager. Keep defaults conservative and explicit, then require a documented exception when a Deployment deviates from the namespace profile.Repeated manual overrides, inconsistent resource settings across similar jobs, quota saturation, failed scheduling, and benchmark deltas between template defaults and actual load-test needs.
Exception processRECOMMENDEDRequire the owner to state why the Deployment needs a different profile, which benchmark supports the change, and when the exception will be reviewed. Treat higher memory, higher CPU limits, extra TaskManagers, or unusual slot layouts as operational exceptions rather than silent one-offs.CPU throttling, OOM kills, checkpoint timeout margin, backpressure, restore duration, and cost or quota impact after the exception is applied.

JobManager sizing

CheckPriorityStarting guidanceSignals to observe
JobManager replicasREQUIREDFor production Deployments, use more than one JobManager replica where Ververica Platform HA is enabled for the namespace or Deployment. Start with 2 replicas for HA and validate by failover testing; increase only when the platform guidance for the selected VVP version and operating model calls for it.Leader failover time, job recovery time, failed or slow leadership transitions, restart loops, and whether recovery meets the documented RTO.
JobManager resourcesREQUIREDSize JobManager CPU and memory through the Deployment resources exposed by Ververica Platform, not by unmanaged runtime files. Start small for simple graphs, then increase for large SQL plans, many operators, high checkpoint metadata volume, frequent failover testing, or complex savepoint and upgrade operations.JobManager heap pressure, GC pauses, metadata growth, slow checkpoint coordination, slow deployment submission, and failover or restore operations that exceed the expected window.

TaskManager resources and memory model exposed by VVP

CheckPriorityGuidanceSignals to observe
TaskManager resourcesREQUIREDSet TaskManager sizing through the Deployment resources fields in Ververica Platform. Treat CPU and memory as Deployment-level capacity inputs that must be benchmarked against expected event rate, key cardinality, state size, checkpoint interval, and recovery target.Container OOM kills, pod evictions, JVM memory pressure, RocksDB/Gemini memory pressure, disk spill, checkpoint duration, and sustained backpressure.
Memory splitREQUIREDExplain memory reviews in the Flink memory components surfaced through VVP: framework and task heap for JVM objects and user code, managed memory for state backends such as RocksDB/Gemini and batch or sort workloads, network memory for shuffle buffers, and off-heap or JVM overhead for native memory and process overhead. For state-heavy jobs, do not size only from the Kubernetes memory value; validate whether the managed-memory and off-heap portions leave enough room for the backend and checkpoint activity.OOM kills without high Java heap, RocksDB/Gemini native memory pressure, compaction or write stalls, spill growth, checkpoint alignment time, network buffer exhaustion, and GC pauses.
State-size interactionRECOMMENDEDUse observed state size and growth rate to validate TaskManager memory. Large state does not imply all state is resident in memory, but larger keyed state increases backend working set, checkpoint metadata and I/O pressure, and recovery time. Start with headroom, then reduce only after benchmark evidence.State growth trend, checkpoint size, checkpoint duration and timeout margin, restore time, local disk utilization, backend cache hit behavior where available, and backpressure during checkpointing.

Parallelism, TaskManagers, and slots

CheckPriorityGuidanceSignals to observe
Capacity relationshipREQUIREDIn VVP Deployment sizing, total available slots come from numberOfTaskManagers multiplied by TaskManager slots. The job parallelism consumes those slots according to the execution graph. Keep the relationship explicit in templates so a change in parallelism is reviewed together with TaskManager count and slot layout.Insufficient slots, unscheduled Deployments, idle slots, uneven subtask load, backpressure isolated to a few subtasks, and restore or rescale duration.
Slot layoutRECOMMENDEDPrefer simple layouts as starting points: 1 slot per TaskManager for state-heavy or isolation-sensitive jobs; 2 to 4 slots per TaskManager for lighter stateless or throughput-oriented jobs only when benchmarks show good resource sharing. Avoid high slot counts per TaskManager unless measured CPU, memory, and network behavior justify them.Per-subtask backpressure, CPU saturation, heap and managed-memory contention, network buffer pressure, noisy-neighbor effects between operators, and checkpoint duration changes after changing slot count.

Requests, limits, and throttling

CheckPriorityGuidanceSignals to observe
Requests and limitsRECOMMENDEDUse the Ververica Platform resource model and limit factors to derive Kubernetes requests and limits consistently from Deployment resources. Keep limit factors explicit in namespace defaults so teams understand how much burst capacity a Deployment receives and what the scheduler reserves.Failed scheduling, node overcommit, memory limit OOM kills, CPU throttling, and mismatch between requested capacity and observed steady-state usage.
CPU throttlingREQUIREDDo not treat CPU limits as harmless guardrails for latency-sensitive streaming jobs. If the limit factor is too tight, CPU throttling can look like application backpressure, checkpoint slowness, or sink lag. Validate CPU limits under peak input rate and during recovery.Container CPU throttled time, busy time, source lag, checkpoint duration, backpressure, restart duration, and sink latency during load tests.

Reference profiles to benchmark

The profiles below are starting points to validate by benchmark. They are intended for Deployment Templates and exception discussions, not as universal production defaults.

ProfileStarting point to validate by benchmarkWhen to useSignals to confirm or adjust
Small / simpleJobManager: 2 replicas, about 0.5 to 1 CPU and 1 to 2 GiB memory. TaskManagers: 1 to 2 TaskManagers, 1 to 2 slots each, about 1 to 2 CPU and 2 to 4 GiB memory per TaskManager.Low-volume jobs, limited state, simple topology, non-critical throughput requirements.OOM kills, GC pressure, checkpoint duration, source lag, idle slots, and any sustained backpressure during the Load testing benchmark.
Stateful-heavyJobManager: 2 replicas, about 1 to 2 CPU and 2 to 4 GiB memory. TaskManagers: start with 1 slot per TaskManager, 4 to 8 CPU and 16 to 64 GiB memory per TaskManager; scale numberOfTaskManagers with required parallelism and state distribution.Large keyed state, RocksDB/Gemini backend, long retention, expensive checkpoints, or strict recovery objectives.Container OOM kills, RocksDB/Gemini native memory pressure, spill or compaction pressure, checkpoint size and duration, checkpoint timeout margin, restore time, and backpressure during checkpointing.
High-throughputJobManager: 2 replicas, about 1 to 2 CPU and 2 to 4 GiB memory. TaskManagers: start with 2 to 4 slots per TaskManager, about 4 to 8 CPU and 8 to 32 GiB memory per TaskManager; increase parallelism and numberOfTaskManagers together after measuring source, network, and sink bottlenecks.High event rate, lighter per-key state, CPU or network-heavy operators, demanding source and sink throughput.CPU saturation or throttling, network buffer pressure, backpressure, source lag, sink latency, checkpoint alignment time, and throughput plateau despite added parallelism.

Observability and alerting

You should never learn that a streaming application is unhealthy from a downstream business team. Production readiness requires metrics, logs, dashboards, and alerts before the first incident.

CheckPriorityDevelopment toleranceProduction expectationWhat to validate
Metrics collectionREQUIREDPrometheus or equivalent collects JM and TM metricsScrape targets, labels, retention, dashboard usability
DashboardsRECOMMENDEDGrafana or equivalent dashboards exist for platform and jobsCoverage for platform, jobs, and storage health
AlertingREQUIREDAlerts exist for checkpoint failures, sustained backpressure, restart loops, lag growth, and OOM killsThresholds, routing, on-call ownership
Log aggregationRECOMMENDEDCentralized logs for JM, TM, and platform componentsRetention, searchability, pod restart coverage
BYOC agent monitoringREQUIREDAgent health and control-plane connectivity are monitoredPod health, reconnect behavior, tunnel status
Audit loggingRECOMMENDEDEnable and retain audit logs where supportedRetention policy, SIEM export, access review

Minimum alert set worth defining before go-live

  • Checkpoint failures above a low threshold
  • Checkpoint duration trending upward beyond normal baseline
  • Sustained backpressure on any critical operator
  • Source lag growing continuously
  • Repeated job restarts within a short interval
  • Container OOM kills or pod evictions
  • Loss of BYOC agent connectivity

Upgrade and deployment strategy

Stateful systems need a deployment strategy, not just a deployment tool. Before production, you should know how upgrades work, how rollback works, what state artifacts are retained, and what assumptions must hold for zero-downtime patterns.

CheckPriorityDevelopment toleranceProduction expectationWhat to validate
Stateful upgrade pathREQUIREDUnderstand how savepoint-based upgrades behave and how long they takeSavepoint creation time, restart duration, rollback artifacts
Rollback planREQUIREDRollback is documented and tested before productionPrevious artifact availability, restore flow, owner
Savepoint retention during deployment windowsRECOMMENDEDKeep enough historical savepoints for safe rollbackRetention rules and manual override ability
Blue/Green prerequisitesRECOMMENDEDIf using advanced rollout patterns, verify state compatibility, sink semantics, and restore modeOperator UIDs, sink idempotence, output equivalence, restore mode
Dynamic parameter update awarenessOPTIONALKnow which changes can be applied without full redeployPlatform capability and runbook clarity
Platform upgrade procedureRECOMMENDEDPlatform Helm or operator upgrade steps are documented, including CRD handling and smoke testsCRD refresh, smoke tests, rollback of platform changes

Disaster recovery and business continuity

Recovery should be designed and tested, not assumed. At minimum, you should be able to recover a failed job from retained checkpoints within a known recovery objective.

CheckPriorityDevelopment toleranceProduction expectationWhat to validate
Recovery from latest checkpointREQUIREDJobs recover automatically and predictably from the latest retained checkpointKill-and-recover test in staging
Savepoint policyRECOMMENDEDSavepoints are retained according to a documented policy and not treated as ad hoc artifactsRetention ownership and cleanup process
Disaster checkpointing / secondary copyOPTIONALUse a secondary location for checkpoint durability where business criticality justifies itSecondary path, replication, test procedure
Platform metadata database backup and restoreREQUIREDEphemeral or manual exports may be acceptable only in non-production testsFor VVP self-managed deployments, the platform metadata database is backed up and restore is testedBackup schedule, retention, encryption, restore runbook, recovery test for platform metadata
RTO / RPO definitionRECOMMENDEDEach critical job has explicit recovery targetsCheckpoint interval, restart time, downstream tolerance
Namespace and configuration recoveryRECOMMENDEDDeployment definitions and access mappings can be recreated quicklyGit-backed config, RBAC mapping, runbook completeness

Governance and compliance

As the number of jobs and teams grows, governance becomes an operational necessity. Production readiness includes naming standards, default policies, approved versions, and a clear operating model.

CheckPriorityDevelopment toleranceProduction expectationWhat to validate
Namespace taxonomyRECOMMENDEDUse consistent naming by team, environment, or domainNaming rules and owner mapping
Deployment defaultsRECOMMENDEDStandardize critical defaults such as checkpointing, HA, restart strategy, and resource boundsNamespace-level defaults and exception process
Approved engine versionsRECOMMENDEDProduction runs on tested and approved engine versions onlyVersion catalog and upgrade ownership
Connector governanceOPTIONALMaintain an approved connector list and version policyConnector review process
Operational guardrailsRECOMMENDEDDefine acceptable bounds for checkpoint intervals, autoscaling, and resource consumptionGuardrail policy and exception approvals
Change managementRECOMMENDEDProduction deployment rights and approval steps are explicitApprover model, auditability, release process

Pre-production gate

Use this as a final gate before opening production traffic.

  • Kubernetes cluster sizing, storage, and scaling model have been reviewed
  • All required network paths and DNS behavior have been validated
  • SSO, RBAC, IAM, and secret delivery patterns are in place
  • Durable object storage is configured for checkpoints, savepoints, and artifacts
  • HA, checkpointing, state backend, restart strategy, and operator UIDs have been verified
  • Resource sizing has been tested with realistic workload assumptions
  • Metrics, dashboards, logs, and alerts are operational
  • Upgrade and rollback procedures are documented and tested
  • Recovery from checkpoint has been tested end to end
  • Governance rules, defaults, and ownership are clear

Minimum production configuration example

YAML
1high-availability.type: kubernetes
2high-availability.storageDir: s3://<bucket>/ha
3execution.checkpointing.interval: 60s
4execution.checkpointing.timeout: 10min
5execution.checkpointing.min-pause: 30s
6execution.checkpointing.tolerable-failed-checkpoints: 3
7execution.checkpointing.externalized-checkpoint-retention: RETAIN_ON_CANCELLATION
8state.backend.type: rocksdb
9state.backend.incremental: true
10state.checkpoints.dir: s3://<bucket>/checkpoints
11state.savepoints.dir: s3://<bucket>/savepoints
12state.checkpoints.num-retained: 3
13restart-strategy.type: failure-rate
14restart-strategy.failure-rate.max-failures-per-interval: 3
15restart-strategy.failure-rate.failure-rate-interval: 10min
16restart-strategy.failure-rate.delay: 30s
17taskmanager.numberOfTaskSlots: 1
Was this helpful?