Diagnosis Slow File Transfer Due to TCP Packet Loss

Background

A friend shared a case where it took over 30 seconds for a server to transfer an 8MB file to one client, while another client on the same LAN transferred the same file in just a few seconds. The difference in speed was significant. After investigating this slow file transfer issue, we discovered some unique factors contributing to the delay, which we would like to share with you.

Problem Information

The packet trace file information is mainly as follows:

The client is a Windows 10 64-bit system. Wireshark is used to capture data packets. The capture time is 80 seconds, the number of data packets is 16k, the file size is 20MB, and the average rate is about 1.986Mbps.

The expert information showed that data segments were not captured, suspected retransmissions, and DUP ACK. Considering the large number, it was initially speculated that packet loss was suspected to have occurred.

image.png
image.png

Problem Analysis

Client Behavior

Considering that the problem reported by the client is related to the transmission speed, a quick look shows I/O Graphsthat it is basically around 2Mbps. It is observed that the server transmitted a total of 19M bytes to the client. Based on the size of the transferred file of 8M and the approximately 2-second transmission interval with no speed at around 40 seconds, it is speculated that the client attempted to download the file twice.

Slow File Transfer
image.png
image.png

Based on the analysis TCP Stream Graphsof Time Sequence(tcptrace), the following diagram is shown roughly:

image.png

After zooming in, you can see a regular transmission phenomenon. Each time the slow start starts, the MSS gradually increases until packet loss occurs (marked by a red circle and displayed by a red short line), which means congestion occurs. At this time, it will pause and there will be no data transmission for about 250ms (marked by a red arrow). After that, the slow start will start again and repeat, round after round.

image.png

MSS growth diagram

image.png

If the retransmission segmentation is around 250 ms, it is basically a timeout retransmission phenomenon.

image.png

Packet Analysis

Back to the actual data packet, we can see that it is transmitted using the TLSv1.2 protocol. According to the result of the TCP three-way handshake, the RTT is about 33ms and the MSS is 1436.

image.png

In the previous transmission before packet loss, we can see that the server sent a total of 10 TCP segments with a length of 1383 bytes and a TLS data field length of 1329 bytes, which is not a complete MSS of 1460 bytes.

image.png

The complete data packet suspected of packet loss and timeout retransmission is shown below:

image.png

Analysis of main phenomena:

  1. There is suspected packet loss between No.100 and No.99. TCP Previous segment not capturedAccording to the difference of 5316 bytes between Seq 99027 and Seq 93711, it is 4 packets of 1329 bytes, which confirms that 4 TCP segments are lost here.
  2. The client responds with a SACK in No.110, where ACK 93711 marks the required lost segment, and SLE=99027 SRE=112317 marks the received segment;
  3. After that, the interaction of data packets resulted in the client triggering twice TCP Dup ACK. Note that there were only two times, so the server failed to trigger a fast retransmission.
  4. The server timed out after about 250ms and retransmitted segment Seq 93711 and the three segments that followed.
image.png
image.png

The packet loss problem still exists afterwards, and the problem phenomenon is similar. Because the three TCP Dup ACKconditions cannot be met, the conditions for fast retransmission cannot be met each time. Each timeout retransmission is about 250ms, and it happened a total of 82 times, which means 20.5 seconds of idle waiting time. Therefore, the root cause of the slow transmission problem this time is packet loss, and the reason for the final slow speed is indeed due to the accumulation of timeout retransmissions that are constantly generated .

In-depth analysis

So far, the root cause of the slow file transfer has been analyzed. The next step is to find the point where the packet loss occurs. Why is there a deep analysis? In fact, this is also the place where I think I can continue to learn and improve in data packet analysis , that is, observe more and think more.

The following questions are some special ones that do not have clear answers yet.

1.TCP MSS

In fact, in the TCP three-way handshake negotiation, it has been clearly stated that the MSS is 1436 bytes, including the 1490 maximum length data packet (1436 + 54) in the subsequent TLS handshake. However, in the subsequent data packet transmission stage, 1383 (1329 + 54) bytes are used as a TCP segment for transmission. This interactive and simultaneous growth method makes the MSS become 1329 bytes.

Real answer : Unknown.

Guess answer : The server application layer controls the byte sending behavior, and a more likely answer is 4.

2.TCP DUP ACK

In fact, in the analysis of the above problem, only the phenomenon is explained, there are only two DUP ACKs, so fast retransmission cannot be triggered . But students who observe carefully may have such a question, why does the discontinuous Seq TCP segment after No.100 not trigger the client to generate DUP ACK continuously? Instead, only one SACK is generated at No.110.

What needs to be noted here is the time. The TCP segments transmitted by the server No.100 – No.109 are continuously transmitted without any time interval, so the client fails to respond to the out-of-order packets and generates ack immediately. Finally, it can be understood that a SACK is generated by merging. This is the same phenomenon in the subsequent data packet transmission process. The segments are continuously transmitted without any time interval.

Of course, some students may have doubts. This is just a normal win10 client. All 0s without any time interval are more likely to be a problem of capturing time accuracy, so I define it as an unresolved issue here.

Real answer : Unknown.

Guess the answer : In the special out-of-order situation of win10, dup ack is not sent.

image.png

3. TCP timeout retransmission

Some students may have noticed that the 4 packets No.116, No.118-No.120 that were retransmitted due to timeout have an interval of about 30ms, which is one RTT. Here is the actual answer, that is, the 4 TCP original segments lost between No.99 and No.100 were not sent in the same RTT.

The original segment of the retransmitted segment No.116 was sent by the server together with No.99 in the previous RTT. The original segments of the three retransmitted segments No.118-No.120 were sent by the server together with No.100 in this RTT before No.100.

Therefore, the four TCP timeout retransmission time intervals No.116, No.118-No.120 are generated.

Real answer : The server-side tracking file needs to be verified.

Guess the answer : It is caused by the server sending TCP segments.

image.png

4.TCP SACK

In the subsequent packet loss phenomenon, a strange SACK phenomenon also appeared. After showing that two TCP segments were lost between No.273 and No.274, the client No.284 SACK indicated SLE=291732 SRE=305022, which means that No.274 and No.275 were received, and the subsequent No.286 and No.288 SACKs also indicated that these two segments were received.

However, after the server retransmitted the lost segments No.287 and No.289 due to timeout, it also retransmitted No.290 and No.291. The Seq Numbers of these two segments correspond to No.274 and No.275, which means that the server did not receive the sack normally or ignored it. Then the server responded with a DSACK with No.292, indicating that SLE=291732 SRE=294390 is a duplicate.

In fact, from the perspective of packet loss, it is difficult to understand why three consecutive sacks are lost, and there is normal data packet interaction during this period. Because the server cannot capture it, we can only guess that the server did not receive the sack normally.

Real answer : Unknown.

Guessed answer : Combined with the MSS change in 1 above, I think it is more likely that there is some kind of device (proxy or security device) on the server side or in the middle, which modifies the MSS to 1436 in both directions, while the actual server’s mtu is 1329. At the same time, the device may modify the Seq during NAT conversion, but does not convert the SLE and SRE in TCP OPS, causing the server to ignore SACK.

image.png

Summary of the problem

In summary, in the absence of an understanding of the global environment and server-side packet trace files, some questions cannot be answered accurately, but this does not prevent us from conducting a series of analyses of packet files. The study of some special phenomena will also continuously consolidate the basic knowledge points of TCP.