Production Readiness Checklist
On this page
- Scope and operating principle
- Kubernetes and infrastructure
- Networking and connectivity
- Identity, access, and secrets
- Object storage and artifacts
- Flink application configuration
- Resource sizing and memory
- Resource sizing guidance (Ververica Platform)
- Observability and alerting
- Upgrade and deployment strategy
- Disaster recovery and business continuity
- Governance and compliance
- Pre-production gate
- Minimum production configuration example
This page provides a practical production readiness checklist for Ververica Platform deployments and Flink applications. It is intended to be used before go-live for both self-managed platform deployments and BYOC environments.
How to use this page
Use this checklist as a go/no-go gate before production. Items marked REQUIRED should be considered mandatory for production. Items marked RECOMMENDED reduce operational risk significantly. Items marked OPTIONAL are maturity improvements.
Scope and operating principle
This checklist follows a strict production mindset: some configurations may be acceptable in development, but they should not be tolerated in production. The goal is to make the boundary explicit.
Use the checklist in order: infrastructure first, then connectivity, security, storage, application settings, observability, deployment strategy, and recovery.
Kubernetes and infrastructure
Your Kubernetes cluster is the operational foundation of the platform. Wrong instance families, undersized disks, missing autoscaling, or poor workload isolation typically surface only under load or during recovery.
Practical rule of thumb
Size TaskManagers for the workload they actually run, not for average utilization. Stateful jobs fail during peaks, backpressure, and recovery events, not during calm periods.
Networking and connectivity
For BYOC, networking is a first-class production topic. The most common deployment issue is blocked or incomplete egress. The Ververica agent uses an outbound-only secure connection model, so the key question is whether your cluster can reliably reach all required endpoints.
Recommended networking questions to answer before production
- Which Ververica endpoints must be reachable from the cluster?
- Which source and sink systems are accessed by Flink jobs, over which ports and protocols?
- Does any proxy, TLS inspection layer, or egress gateway alter long-lived secure connections?
- Are DNS and routing paths identical across all worker nodes?
- Is cross-region access part of the design, and if yes, what is the latency impact?
Identity, access, and secrets
Production readiness is not only about scale and uptime. It also means reducing operational and security risk: strong authentication, least privilege, short-lived credentials, and safe secret delivery.
Important
Do not treat “we will secure it later” as a production strategy. Authentication, IAM, and secret handling are part of production readiness, not post-go-live hardening.
Object storage and artifacts
For Flink, durable object storage is part of the runtime control plane. It stores checkpoints, savepoints, and often deployment artifacts. Poor storage design directly impacts reliability and recovery.
Flink application configuration
This is the heart of production readiness for Flink workloads. Most severe runtime failures are not caused by exotic bugs; they come from a handful of missing or unsafe production settings.
Production principle
If a job cannot recover cleanly from its latest durable state in a staging test, it is not production ready.
Resource sizing and memory
Memory and CPU sizing should be treated as engineering inputs, not default values left untouched. Production problems usually appear during backpressure, spikes, rescaling, or restore from state. Keep this checklist at go/no-go level; use the companion section Resource sizing guidance (Ververica Platform) below for prescriptive starting points, and validate them through the Load testing item in this table.
Resource sizing guidance (Ververica Platform)
Required framing
All numeric values in this section are starting points to validate by load test, not universal sizing rules. Stateful job sizing depends on workload shape, key cardinality, state growth, checkpoint profile, source and sink behavior, and recovery objectives. Use this section together with the Load testing checklist item above; adjust only from measured signals such as OOM kills, RocksDB/Gemini spill or compaction pressure, checkpoint duration, backpressure, restart time, and CPU throttling.
Deployment Templates and namespace defaults
JobManager sizing
TaskManager resources and memory model exposed by VVP
Parallelism, TaskManagers, and slots
Requests, limits, and throttling
Reference profiles to benchmark
The profiles below are starting points to validate by benchmark. They are intended for Deployment Templates and exception discussions, not as universal production defaults.
Production principle
A sizing profile is production-ready only after it survives the documented Load testing scenario, including peak traffic, checkpointing, failure and restore, and sink behavior. If the benchmark changes the conclusion, update the Deployment Template rather than relying on tribal knowledge.
Observability and alerting
You should never learn that a streaming application is unhealthy from a downstream business team. Production readiness requires metrics, logs, dashboards, and alerts before the first incident.
Minimum alert set worth defining before go-live
- Checkpoint failures above a low threshold
- Checkpoint duration trending upward beyond normal baseline
- Sustained backpressure on any critical operator
- Source lag growing continuously
- Repeated job restarts within a short interval
- Container OOM kills or pod evictions
- Loss of BYOC agent connectivity
Upgrade and deployment strategy
Stateful systems need a deployment strategy, not just a deployment tool. Before production, you should know how upgrades work, how rollback works, what state artifacts are retained, and what assumptions must hold for zero-downtime patterns.
Disaster recovery and business continuity
Recovery should be designed and tested, not assumed. At minimum, you should be able to recover a failed job from retained checkpoints within a known recovery objective.
Good sign
A production-ready team can answer two questions clearly: “What is our latest recoverable point?” and “How long does recovery take under realistic conditions?”
Governance and compliance
As the number of jobs and teams grows, governance becomes an operational necessity. Production readiness includes naming standards, default policies, approved versions, and a clear operating model.
Pre-production gate
Use this as a final gate before opening production traffic.
- Kubernetes cluster sizing, storage, and scaling model have been reviewed
- All required network paths and DNS behavior have been validated
- SSO, RBAC, IAM, and secret delivery patterns are in place
- Durable object storage is configured for checkpoints, savepoints, and artifacts
- HA, checkpointing, state backend, restart strategy, and operator UIDs have been verified
- Resource sizing has been tested with realistic workload assumptions
- Metrics, dashboards, logs, and alerts are operational
- Upgrade and rollback procedures are documented and tested
- Recovery from checkpoint has been tested end to end
- Governance rules, defaults, and ownership are clear
Minimum production configuration example
1high-availability.type: kubernetes
2high-availability.storageDir: s3://<bucket>/ha
3execution.checkpointing.interval: 60s
4execution.checkpointing.timeout: 10min
5execution.checkpointing.min-pause: 30s
6execution.checkpointing.tolerable-failed-checkpoints: 3
7execution.checkpointing.externalized-checkpoint-retention: RETAIN_ON_CANCELLATION
8state.backend.type: rocksdb
9state.backend.incremental: true
10state.checkpoints.dir: s3://<bucket>/checkpoints
11state.savepoints.dir: s3://<bucket>/savepoints
12state.checkpoints.num-retained: 3
13restart-strategy.type: failure-rate
14restart-strategy.failure-rate.max-failures-per-interval: 3
15restart-strategy.failure-rate.failure-rate-interval: 10min
16restart-strategy.failure-rate.delay: 30s
17taskmanager.numberOfTaskSlots: 1