Talk at Open Source India 2023

Introduction: A challenge unveiled

I had the pleasure of presenting a talk at Open Source India 2023 recently, where I shared an intriguing problem I encountered while working on my project, Canvasboard (Open Source). The issue was simple yet perplexing: a spike in 5xx errors during rolling updates in kubernetes env. This wasn’t just any bug; it was one that stubbornly refused to be solved until I dug deep into the inner workings of Node.js, Docker, and NPM.

Graceful Shutdowns – Why they matter

Graceful shutdowns are crucial in ensuring that an application can stop processing requests properly, close connections, and finish any pending tasks before shutting down. In the case of our Node.js application, we had programmed these shutdown signals, yet during containerized deployment, these signals seemed to vanish into thin air.

Problem: The mysterious 5xx spike

The problem surfaced when our dockerized Node.js application was being deployed. Despite programming graceful shutdowns and handling termination signals, we noticed a significant spike in 5xx errors during rolling updates. This was particularly frustrating because everything worked perfectly in local environments, but once inside a container, the errors persisted.

Initially, we thought the problem was correlated with CPU and memory spikes, which occur due to the container shutting down and the subsequent start of a new container. However, this assumption turned out to be incorrect.

Analyzing the Setup – What could be wrong?

To get to the bottom of this, I started by looking into how the Node.js process was being initiated within the container. In our Canvasboard project, we used npm run start to kick off the application—a common method employed by many to run Node.js applications, but despite our best efforts, the 5xx issues persisted. Even more perplexing was the absence of termination signal logs, which led me to question whether the process was receiving any signals at all.

Understanding PID 1 in Docker

I dug deeper into the concept of PID 1 in Docker. In a Docker container, the first process that runs becomes PID 1, which comes with specific responsibilities, including reaping orphaned zombie processes and handling termination signals. If this process doesn’t correctly handle these signals, it could explain why our application wasn’t shutting down gracefully.

The role of NPM in the problem

As I examined our setup, I realized that while NPM could relay signals to its child processes, there was a hitch when a shell (sh -c) was involved. This shell was being introduced, unbeknownst to us, when npm run start was executed inside a debian-based Docker image. The shell blocked the signals from reaching the Node.js process, which explained why our shutdown logic wasn’t working as intended.

Uncovering the shell’s impact

To confirm that the shell was indeed the culprit, I tested the application by running it directly with the command CMD node index.js in the Dockerfile. After sending a docker stop command to that container and I had to wait for 10 seconds ( Default termination wait time untill a SIGKILL is sent by docker deomon ), I found no termination logs before the container exited, which validated my suspicion. The shell was obstructing the signal flow, causing the graceful shutdown to fail.

A surprising twist with alpine

Curious to see if this issue was unique to debian, I decided to test the application in an alpine-based Docker image as well. The results were eye-opening. When running the same npm run start command in an alpine container, the termination logs appeared as expected. This indicated that alpine handled process management differently, effectively bypassing the shell interference that had caused issues in the debian setup.

A black box exploration of base images

To further understand why these base images behaved differently, I approached the issue by treating both debian and alpine as black boxes and tested them with a series of commands like sh -c 'sleep 10; sleep 15; sleep 20' &. This approach allowed me to observe how each image managed processes under the hood. In debian, the shell added unnecessary complexity, whereas alpine’s simplicity shone through.

Visualizing the problem

For a clearer understanding, I visualized the process flow in both debian and alpine environments. In debian, the shell inserted by NPM was causing a disconnect between the termination signals and the Node.js process.

In contrast, alpine avoided this issue by not introducing a shell when there’s only a single child process. This behavior, likely due to alpine's internal optimizations aimed at maintaining its lightweight and efficient nature, which allowed signals to pass directly to the application without shell interference.

This comparison between debian and alpine highlighted the critical difference in how they managed processes. This insight was crucial in formulating a solution. I must also give credit to my friend Sai Praneeth, who brainstormed and explored this issue with me.

The Final Solution – Avoiding NPM

After all the exploration and testing, the solution turned out to be straightforward yet powerful: avoid using NPM as a process manager in Docker or any process that introduces blockers like the shell. Instead, we opted to directly use ENTRYPOINT ["node", "index.js"] in the dockerfile. This simple change ensured that Node.js received termination signals directly, effectively eliminating the 5xx errors during rolling updates.

Alternatively, third-party packages like PM2 can handle process management and signal handling more gracefully in Docker environments. However, for our specific use case, the direct approach proved to be the most efficient solution.

Thanks!