1. Project Environment: Our web project uses nginx as a front-end proxy, specifically for proxying purposes, along with 3 tomcat instances running on the same server.2. Business Logic Involved: The project includes functionalities such as file upload (potentially large files, like a 100MB Android game), client interface requests, and website backend management.3. Problem Reproduction Process: 3.1 After configuring the tomcat server, nginx was set up as a front-end proxy, configured to use an HTTP proxy. 3.2 Issue 1: Large files could not be uploaded successfully by the administrator in the backend, leading to a timeout. Repeated testing showed that uploads exceeding 1 minute resulted in a timeout, while smaller files were unaffected. 3.3 The default HTTP connection timeout for nginx is 75 seconds. Connections exceeding this time were terminated, often during large file uploads. The solution was to increase the nginx HTTP connection timeout to 30 minutes by setting `keepalive_timeout=1800;`, which resolved the file upload issue. 3.4 After 2 days of operation, the server crashed. Restarting nginx solved the problem temporarily, but the crash reoccurred 2 hours later. Investigations revealed the nginx error log contained the message: âsocket() failed (24: Too many open files) while connecting to upstreamâ, indicating the nginx connection limit (default 1024) was reached. 3.5 To address this, I increased the nginx connection limit by setting `worker_connections 10240;`. This seemed to resolve the issue in the short term. However, the error âsocket() failed (24: Too many open files) while connecting to upstreamâ reappeared intermittently. 3.6 I realized that simply changing the nginx connection limit wasnât a comprehensive solution. Further research identified the problem was linked to the `keepalive_timeout` setting being too long. The client interface typically needs quick access, and the HTTP connection should be released once access is complete. Due to incorrect nginx configurations, these connections werenât being released, leading to an accumulation of active connections and ultimately causing nginx to crash.
4. So, how should this problem be solved? Lowering the keepalive_timeout time might result in unsuccessful uploads; increasing it results in many invalid HTTP connections occupying nginxâs connection count. This seems like a dilemma!
Here comes the important part:
How to Set Nginxâs TCP KeepAlive
At the start, I mentioned a recent issue where a client sends a request to the Nginx server, and the server takes a period of calculation before returning. The time exceeded the LVS Sessionâs hold of 90s. Using Tcpdump at the server and analyzing with Wireshark locally displayed results like the second picture, showing a roughly 90-second gap between the 5th and last packet timestamps. After determining that the issue was the expiration of the LVS Session hold time, I began looking into how to set Nginxâs TCP KeepAlive. The first option I found was keepalive_timeout. My colleague informed me that when the value of keepalive_timeout is set to 0, it disables keepalive, and when set to a positive integer, it indicates how many seconds to keep the connection. Therefore, I set keepalive_timeout to 75s, but actual test results showed it was ineffective. Clearly, keepalive_timeout couldnât solve the TCP layerâs KeepAlive issue. In fact, there are quite a few options in Nginx related to keepalive. The usual Nginx usage is as follows:
>>
From the TCP layer, Nginx needs to be concerned with KeepAlive not only with the Client but also with the Upstream. Simultaneously, from the HTTP protocol layer, Nginx needs to be concerned with the Client Keep-Alive, and if the Upstream uses the HTTP protocol, it also needs to be concerned with Upstream Keep-Alive. Overall, itâs rather complex. Once you understand both TCP and HTTP KeepAlive, you wonât mistakenly set Nginxâs KeepAlive. Initially, while solving the problem, I wasnât sure if Nginx had a configuration option for TCP keepAlive, so I opened the Nginx source code and searched for TCP_KEEPIDLE. The related code is as follows:
>>
From the context of the code, I found out that TCP KeepAlive can be configured, so I continued to search which option allows configuration. Finally, I discovered that the listen directiveâs so_keepalive option can configure KeepAlive for TCP sockets.
>>
The above three parameters can only be used one at a time, not simultaneously, such as so_keepalive=on, so_keepalive=off, or so_keepalive=30s:: (meaning it waits 30 seconds without data packets before sending a probe packet). By setting listen 80,so_keepalive=60s::, I successfully solved the Nginx issue of maintaining long connections in LVS, avoiding other costly solutions. In commercial load devices, if you encounter similar issues, this approach can also solve them.