Troubleshooting Application Manager

In this page we will cover few common debugging scenarios, by walking though fictitious scenarios, and learn how to use dA platform to find the root cause.

A Deployment is transitioning to RUNNING for a long time

In this scenario, a deployment resource is “stuck” in a transitioning status for a long period of time, and we would like to understand why.

  • First thing that we would do is to check the event log of the deployment. For example, suppose that we see in the event log the following message: Waiting for required number of Flink task slots (available: 0, required: 1), and nothing else for a long period. This would indicate that a TaskManager is unable to “report for duty”.
  • To understand why, let’s find the Application Manager’s job id that is currently trying to start. For example, by using the web UI jobs tab under the desired deployment, we will find a job with the status STARTING.
  • Then, let’s look for problematic task manager Pods. To do that we can use kubectl -n <namespace> get pods -l component=taskmanager,jobId=<job id>.
  • Having found the desired pod, we can now describe it via kubectl -n <namespace> describe pod <pod id> and examine its status.
  • To find the logs that are associated with the task managers for this job, we can use Kibana and filter log lines with the job id obtained before and limit the loggerName field to org.apache.flink.runtime.taskmanager.TaskManager.

After figuring out the root cause we can reconfigure the Application Manager deployment resource and wait for a new Job to appear.

The web user interface is showing a white page

It might happen that the Application Manager user interface screen goes white after clicking a link or loading a page. Usually there’s a stack trace in the Developer Tools of your browser. If possible, please report the steps to reproduce the issue and the stack trace to our support.