Manage Deployment Tuning
Effective resource management is crucial for maintaining the performance and efficiency of deployments. Autopilot and Scheduled Tuning provide automated solutions for optimizing resource allocation and adapting to workload demands. This guide explains how to configure and leverage these features for optimal deployment performance.
Overview
Deployment tuning typically requires a significant time investment. For example:
- When you publish a draft, you must configure resources, parallelism, and the number and size of TaskManagers for the draft.
- When a deployment is running, you must adjust the resources of the deployment to maximize resource utilization.
- If backpressure occurs on the deployment or the latency increases, you must adjust the configurations of the deployment.
Tuning Modes
Ververica recommends that you choose a tuning mode that best meets your business requirements. The three available tuning modes are:
- Default mode allows manual tuning based on system generated recommendations of the running deployment.
- Autopilot dynamically adjusts resources based on workload demands, ensuring efficient utilization without manual intervention. It continuously monitors performance metrics and scales resources accordingly.
- Scheduled Tuning allows for predefined optimizations at specific intervals. This helps manage predictable workload patterns and ensures deployments maintain peak efficiency.
The following table describes each of the available tuning modes, the scenarios that are best suited for each one, the benefits of each mode, and recommended resources for further learning.
Tuning Mode | Use Case | How it Works | Benefits | References |
---|---|---|---|---|
Disabled (default mode) | You want to optimize deployment resources without using automated tuning modes. | Default tuning mode provides system-generated recommendations for your review. Note: The optimization suggestions are not automatically applied to the deployment. | Using the default tuning mode, you can manually tune and adjust deployment resources based on the resource suggestions provided on the running status of the deployment. | None |
Autopilot | Your deployment uses 30 compute units (CUs). After the deployment runs for a period of time, the CPU utilization and memory usage of the deployment are sometimes excessively low when no latency and no backpressure occur in the source. | If you do not want to manually adjust the resources of the deployment, you can enable Autopilot to allow the system to automatically adjust the resources. When the resource usage is low, the system automatically downgrades the resource configuration. When the resource usage reaches a specified threshold, the system automatically upgrades the resource configuration. | Autopilot helps you adjust the parallelism and resource configuration for a deployment in an efficient manner. It globally optimizes your deployment. This helps handle performance issues, such as low deployment throughput, upstream and downstream backpressure, and a waste of resources. | For more information about the default tuning actions of Autopilot, see Default tuning actions of Autopilot. For more information about how to enable the Autopilot feature, see Enable and configure Autopilot. |
Scheduled | Your deployment experiences predictable workload fluctuations based on business activity. For example, the peak hours of the deployment are from 09:00:00 to 19:00:00 every day, and the off-peak hours of the deployment are from 19:00:00 to 09:00:00 of the next day. | Scheduled tuning allows predefined resource adjustments based on known time intervals. In this case, you can enable scheduled tuning to use 30 CUs for your deployment during the peak hours and 10 CUs during the off-peak hours. | Scheduled tuning enables proactive scaling based on expected workload patterns, improving efficiency. It helps prevent resource waste during low-demand periods while ensuring sufficient capacity during peak usage. Note: You must obtain the resource usage in each time period. | For more information about how to configure scheduled tuning, see Using Scheduled Mode. |
About Autopilot Mode
Autopilot optimizes deployment performance using two distinct strategies:
- Stable Strategy: Maintains a steady configuration once optimal settings are achieved, preventing unnecessary adjustments.
- Adaptive Strategy: Continuously adjusts parameters in response to system demands and performance fluctuations.
The Stable Strategy ensures that once a deployment reaches a steady state, Autopilot stops making adjustments under the following conditions:
- No adjustments have been made for 24 consecutive hours.
- The system has been running for 72 hours in Stable Strategy, regardless of adjustments. Once either condition is met, Autopilot ceases parameter modifications. However, restarting the deployment resets all Stable Strategy statuses, and the 24-hour and 72-hour conditions begin recalculating from scratch.
Changes to Stable Strategy parameters are saved but do not reset these conditions. Only a deployment restart resets the Stable Strategy logic.
Unlike the Stable Strategy, the Adaptive Strategy continuously monitors system behavior and resource usage, making real-time adjustments. This strategy is ideal for deployments with dynamic workloads requiring constant optimization.
Limits and Considerations
Review these constraints with using Autopilot.
- Unaligned Checkpoints: You cannot modify the parallelism for a deployment if you enable the Unaligned Checkpoints feature.
- Session Clusters: Autopilot is not supported.
- Performance Bottlenecks: Autopilot cannot resolve all bottlenecks, as performance is influenced by upstream and downstream systems. It works best when:
- Traffic changes smoothly.
- No data skew exists.
- Throughput scales linearly with increased parallelism. If these conditions are not met, issues may arise, such as:
- Parallelism changes failing or deployments repeatedly restarting.
- Performance degradation in UDSFs, UDAFs, or UDTFs.
- Increased parallelism overloading external systems, leading to failures.
Review these key considerations when using Autopilot.
- Deployment Restarts: Autopilot restarts deployments when triggered, temporarily pausing data processing.
- Trigger Interval: Autopilot triggers every 10 minutes by default, configurable via the cooldown.minutes parameter.
- Manual Parallelism Configuration: If a DataStream deployment or custom SQL connector explicitly sets parallelism, Autopilot is disabled.
- Policy Timing: A new Autopilot policy cannot be triggered within 30 minutes of an existing policy.
Default Tuning Actions
When enabled, Autopilot automatically adjusts resource configurations based on system metrics.
Parallelism Adjustments
Autopilot optimizes deployment throughput by dynamically adjusting parallelism based on system performance.
- No change needed: If deployment delay remains below 60s, parallelism stays the same.
- Scaling up: If deployment delay exceeds 60s and continues increasing for 3 minutes, parallelism is increased up to twice the current processing capacity (capped at 64 CUs).
- Scaling down: If CPU utilization or vertex node processing time remains below 20% for 24 consecutive hours, parallelism is reduced to optimize resource efficiency.
- Other conditions:
- If vertex node processing time exceeds 80% for 6 minutes, parallelism is increased to lower slot utilization to 50%.
- If average CPU utilization of all TaskManagers exceeds 80% for 6 minutes, parallelism is increased to bring CPU usage down to 50%.
Memory Optimization
Autopilot monitors memory usage and adjusts configurations to prevent failures.
- Scaling up:
- If the JobManager experiences frequent garbage collection (GC) or out-of-memory (OOM) errors, memory is increased (up to 16 GiB).
- If a TaskManager experiences GC, OOM, or HeartBeatTimeout errors, memory is increased (up to 16 GiB).
- If TaskManager memory usage exceeds 95%, memory allocation is increased.
- Scaling down:
- If TaskManager memory usage falls below 30% for 24 hours, memory allocation is reduced (minimum 1.6 GiB).
Run a Deployment Using Autopilot
You can enable and configure Autopilot when starting a job or from the Deployments > Resources tab.
- On the Dashboard page, locate the workspace you want to manage
- Click the title of the workspace or the vertical ellipsis icon (⋮), and select Open Console.
- In the left-side navigation pane, click Deployments.
- On the Deployments page, click the name of the desired deployment.
- Choose one of the following methods for enabling Autopilot.
- To enable Autopilot on an existing deployment:
- Open the Resources tab.
- Click Autopilot Mode and toggle Autopilot to ON.
- Click Edit in the Configurations section.
- To enable Autopilot when starting a job:
- Click Start at the top right of the Deployments window.
- Select the job start mode (Initial Mode or Resume Mode). For details, see Starting Jobs.
- Toggle Configure Autopilot to ON.
- Set the Resource Tuning Mode to Autopilot Mode.
- To enable Autopilot on an existing deployment:
- Select a resource tuning strategy:
- Stable Strategy: The system will reduce the impact of start-stop behaviours on jobs, and will reduce job resources according to the operation of longer-cycle jobs to reach the convergence state as quickly as possible.
- Adaptive Strategy: The system will pay more attention to the latency of the current job, and the application of resources, and optimize the resources more quickly according to the changes of the relevant indicators.
- Edit the parameters. See Autopilot Parameters.
- Click Save.
Autopilot Parameters
Parameter | Description |
---|---|
Cooldown Minutes | The time interval at which Autopilot is triggered after a deployment is restarted due to Autopilot. |
Max CPU | The maximum number of CPUs that can be allocated to a deployment. Unit: CPUs. Default (Adaptive): 64 cores. Default (Stable): 4 cores. |
Max Memory | The maximum amount of memory that can be allocated to a deployment. Unit: GiB. Default (Adaptive): 256GiB. Default (Stable): 16 GiB. |
Max Delay | The maximum delay that is allowed. The throughput of the deployment is measured based on the delay of source data consumption. Default value: 1. Unit: minutes. If the data processing capability is insufficient and the data processing delay is longer than 1 minute, the system performs the scale-up operation to increase the throughput of the deployment or the system provides a recommendation for performing the scale-up operation. The system can increase the parallelism or split chains to perform the scale-up operation. |
mem.scale-up.interval | The time interval for memory expansion (the time interval between the most recent tuning and memory expansion tuning). If the memory usage is more than the specified threshold, the system increases the memory size. Default (Stable & Adaptive): 6 minutes. |
mem.scale-down.interval | The time interval for memory reduction (the time interval between the most recent tuning and memory reduction tuning). If the memory usage is less than the specified threshold, the system decreases the memory size. Default (Stable & Adaptive): 24 hours. |
parallelism.scale.max | The maximum parallelism when the value of the Parallelism parameter is increased. Default value: -1. This value indicates that the maximum parallelism is not limited. Available in Adaptive and Stable modes. |
parallelism.scale.min | The minimum parallelism when the value of the Parallelism parameter is decreased. Default value: 1. This value indicates that the minimum parallelism is 1. Available in Adaptive and Stable modes. |
parallelism.scale.up.interval | Time interval for parallelism expansion (the time interval between the most recent tuning and parallelism expansion tuning). Default (Adaptive): 6 minutes. Default (Stable): 6 minutes. |
parallelism.scale.down.interval | Time interval for parallelism reduction (the time interval between the most recent tuning and parallelism reduction tuning). Default (Adaptive): 24 hours. Default (Stable): 11 hours. |
delay-detector.scale-up.threshold | Measures the internal metric “currentFetchEventTimeLag”, which defines the difference between the time when the data is generated and the time when the data enters the Flink Source. Unit: ms. Default: 1 ms. Available in Adaptive and Stable modes. |
slot-usage-detector.scale-up.threshold | If the percentage of the data processing time of a vertex node is greater than the value of this parameter, the parallelism for the deployment is increased. Default value: 0.8. The idle time of data processing nodes is monitored. The idle time of source nodes is not monitored. If the percentage of the data processing time is consecutively greater than 0.8, the parallelism for the deployment is decreased to reduce the resource utilization or the system provides a recommendation for performing the scale-up operation. Available in Adaptive and Stable modes. |
slot-usage-detector.scale-down.threshold | If the percentage of the data processing time of a vertex node is greater than the value of this parameter, the parallelism for the deployment is decreased. Default value: 0.2. The idle time of data processing nodes is monitored. The idle time of source nodes is not monitored. If the percentage of the data processing time is consecutively less than 0.2, the parallelism for the deployment is increased to improve the resource utilization or the system provides a recommendation for performing the scale-down operation. Available in Adaptive and Stable modes. |
slot-usage-detector.scale-up.sample-interval | The interval at which the slot idle metric is monitored. This parameter can be used to calculate the average value of the idle time. Default value: 3 minutes. This parameter takes effect together with the slot-usage-detector.scale-up.threshold and slot-usage-detector.scale-down.threshold parameters. If the average value of the idle time in a 3-minute period is greater than 0.8, the scale-up operation is performed. If the average value of the idle time in a 3-minute period is less than 0.2, the scale-down operation is performed. Available in Adaptive and Stable modes. |
resources.memory-scale-up.max | The maximum memory size of a TaskManager and the JobManager. Default value: 16. Unit: GiB. When the system automatically tunes the resource configuration or increases the parallelism for a TaskManager or the JobManager, the maximum memory size is 16 GiB. Available in Adaptive and Stable modes. |
About Scheduled Mode
Scheduled Mode is a good choice when you know your peak traffic patterns in advance such as traffic for big events, like Black Friday, or when you have high and low traffic periods during a given day. For example, you run a taxi ride service company and a football game happens every Sunday. Or, you run a live broadcasting service and every evening your traffic peak has big spikes during a popular show. You can create a scheduled plan to handle these traffic peaks.
Scheduled Mode also covers scenarios that Autopilot does not cover. For example:
- If you are using Autopilot and the traffic jitters frequently, it will eventually cause the job to be called continuously and restarted continuously.
- When traffic changes slowly, Autopilot does not detect these changes and causes situations where the tuning cannot be completed in one attempt. In this scenario, you will need to make many iterations to achieve better results.
In these scenarios, you can use Scheduled mode to set the optimal resource requirements for the job based on the business characteristics. When traffic jitters are frequent, the job will not be restarted. When you know the required resource allocation during peak or through traffic, you can adjust the job to a better state.
Run a Deployment Using Scheduled Mode
Instead of using Autopilot Mode, you can set up your own scheduled plans to run deployments at specific times, with specific tuning parameters. To schedule a deployment, at least one scheduled plan must exist.
Create a Plan
You can create a scheduled plan and then apply it to a job. The plan is then available to select when you start a deployment.
Once you have scheduled one or more plans, they apply to all the running jobs under that deployment.
- In the Ververica Cloud console, display the Deployments > Resources tab.
- Click Scheduled Mode.
- In the Resource Plans section, click New Plan.
- Enter a Plan Name and configure the parameters:
- Trigger Period: Valid values: No Repeat, Every Day, Every Week, and Every Month. If you set this parameter to Every Week or Every Month, you must specify the related time range during which you want the policy to take effect.
- Trigger Time: The time when the plan takes effect.
- For other parameter descriptions, see Resources and Parameters.
- (Optional) Scroll down and click New Resource Setting Period below the Resource Setting panel and create another set of parameters to control the schedule (e.g. another time period).
- Click OK.
The plan will be saved and listed in the Resource Plans section.
Start a Job Using a Scheduled Plan
-
Click the deployment you want to start in the Deployments window.
-
Click Start in the Deployments toolbar.
noteThere must be at least one saved plan available to apply at startup.
-
In the Start Job dialog, set the mode (Initial or Resume).
-
Specify a start time, if appropriate.
-
Click to set Configure Autopilot to ON.
-
Set the Resource Tuning Mode to Scheduled Mode.
-
Select a scheduled plan from the drop-down menu.
noteIf no scheduled plans are available, or if you want to create another one, you can choose Create new scheduled plan, but this will take you back to the main Resources window and you will need to follow the instructions in Create a scheduled plan.
-
Click Start. The job will start with the specified scheduled plan.
Change the Applied Plan
You can change the scheduled plan that applies to the currently running job.
This might cause the job to restart.
-
In the Deployments > Resources tab, locate the entry for the scheduled plan that is applied to the running job.
-
Click Stop Applying:
- You can click Stop Applying next to the scheduled plan entry in the Resource Plans list.
- You can click the main Stop Applying button near the top of the Resources tab.This is useful if you have many scheduled plans defined and can't easily see the one you want in the Resource Plans list.
-
Click Apply next to the new scheduled plan.
Edit an Existing Plan
- You cannot edit the name of an existing plan. You would need to delete the plan and recreate it with a new name, or just create a new plan.
- You cannot edit the details of a plan that is currently applied to a running job
To edit the details of an existing scheduled plan:
- Display the Deployments > Resources tab.
- Click on the name of the scheduled plan, or on Details, in the Resource Plans table.
- Click Edit at the top of the resulting plan screen.
- Change the parameters.
- Click Save.
Delete a Plan
You cannot delete a plan that is currently applied to a running job.
To delete a saved plan:
- In the Deployments > Resources tab, locate the entry for the scheduled plan that you want to delete.
- Click Delete next to the plan's entry.
- Click OK to confirm.