Background
A friend shared a case where it took over 30 seconds for a server to transfer an 8MB file to one client, while another client on the same LAN transferred the same file in just a few seconds. The difference in speed was significant. After investigating this slow file transfer issue, we discovered some unique factors contributing to the delay, which we would like to share with you.
Problem Information
The packet trace file information is mainly as follows:
λ capinfos "client slow.pcapng"
File name: client slow.pcapng
File type: Wireshark/... - pcapng
File encapsulation: Ethernet
File timestamp precision: microseconds (6)
Packet size limit: file hdr: (not set)
Number of packets: 16 k
File size: 20 MB
Data size: 19 MB
Capture duration: 80.048161 seconds
First packet time: 2022-05-28 19:09:33.167685
Last packet time: 2022-05-28 19:10:53.215846
Data byte rate: 248 kBps
Data bit rate: 1986 kbps
Average packet size: 1172.93 bytes
Average packet rate: 211 packets/s
Strict time order: True
Capture hardware: Intel(R) Core(TM) i7-9700 CPU xxx
Capture oper-sys: 64-bit Windows 10 (2009), build 22000
Capture application: Dumpcap (Wireshark) 3.4.6 (v3.4.6-0-g6357ac1405b8)
Number of interfaces in file: 1
Interface #0 info:
Name = \Device\NPF_{xxx}
Description = 以太网
Encapsulation = Ethernet (1 - ether)
Capture length = 262144
Time precision = microseconds (6)
Time ticks per second = 1000000
Time resolution = 0x06
Operating system = 64-bit Windows 10 (2009), build 22000
Number of stat entries = 1
Number of packets = 16943
λ
The client is a Windows 10 64-bit system. Wireshark is used to capture data packets. The capture time is 80 seconds, the number of data packets is 16k, the file size is 20MB, and the average rate is about 1.986Mbps.
The expert information showed that data segments were not captured, suspected retransmissions, and DUP ACK. Considering the large number, it was initially speculated that packet loss was suspected to have occurred.
Problem Analysis
Client Behavior
Considering that the problem reported by the client is related to the transmission speed, a quick look shows I/O Graphs
that it is basically around 2Mbps. It is observed that the server transmitted a total of 19M bytes to the client. Based on the size of the transferred file of 8M and the approximately 2-second transmission interval with no speed at around 40 seconds, it is speculated that the client attempted to download the file twice.
Based on the analysis TCP Stream Graphs
of Time Sequence(tcptrace)
, the following diagram is shown roughly:
After zooming in, you can see a regular transmission phenomenon. Each time the slow start starts, the MSS gradually increases until packet loss occurs (marked by a red circle and displayed by a red short line), which means congestion occurs. At this time, it will pause and there will be no data transmission for about 250ms (marked by a red arrow). After that, the slow start will start again and repeat, round after round.
MSS growth diagram
If the retransmission segmentation is around 250 ms, it is basically a timeout retransmission phenomenon.
Packet Analysis
Back to the actual data packet, we can see that it is transmitted using the TLSv1.2 protocol. According to the result of the TCP three-way handshake, the RTT is about 33ms and the MSS is 1436.
In the previous transmission before packet loss, we can see that the server sent a total of 10 TCP segments with a length of 1383 bytes and a TLS data field length of 1329 bytes, which is not a complete MSS of 1460 bytes.
The complete data packet suspected of packet loss and timeout retransmission is shown below:
Analysis of main phenomena:
- There is suspected packet loss between No.100 and No.99.
TCP Previous segment not captured
According to the difference of 5316 bytes between Seq 99027 and Seq 93711, it is 4 packets of 1329 bytes, which confirms that 4 TCP segments are lost here. - The client responds with a SACK in No.110, where ACK 93711 marks the required lost segment, and SLE=99027 SRE=112317 marks the received segment;
- After that, the interaction of data packets resulted in the client triggering twice
TCP Dup ACK
. Note that there were only two times, so the server failed to trigger a fast retransmission. - The server timed out after about 250ms and retransmitted segment Seq 93711 and the three segments that followed.
The packet loss problem still exists afterwards, and the problem phenomenon is similar. Because the three TCP Dup ACK
conditions cannot be met, the conditions for fast retransmission cannot be met each time. Each timeout retransmission is about 250ms, and it happened a total of 82 times, which means 20.5 seconds of idle waiting time. Therefore, the root cause of the slow transmission problem this time is packet loss, and the reason for the final slow speed is indeed due to the accumulation of timeout retransmissions that are constantly generated .
In-depth analysis
So far, the root cause of the slow file transfer has been analyzed. The next step is to find the point where the packet loss occurs. Why is there a deep analysis? In fact, this is also the place where I think I can continue to learn and improve in data packet analysis , that is, observe more and think more.
The following questions are some special ones that do not have clear answers yet.
1.TCP MSS
In fact, in the TCP three-way handshake negotiation, it has been clearly stated that the MSS is 1436 bytes, including the 1490 maximum length data packet (1436 + 54) in the subsequent TLS handshake. However, in the subsequent data packet transmission stage, 1383 (1329 + 54) bytes are used as a TCP segment for transmission. This interactive and simultaneous growth method makes the MSS become 1329 bytes.
Real answer : Unknown.
Guess answer : The server application layer controls the byte sending behavior, and a more likely answer is 4.
2.TCP DUP ACK
In fact, in the analysis of the above problem, only the phenomenon is explained, there are only two DUP ACKs, so fast retransmission cannot be triggered . But students who observe carefully may have such a question, why does the discontinuous Seq TCP segment after No.100 not trigger the client to generate DUP ACK continuously? Instead, only one SACK is generated at No.110.
What needs to be noted here is the time. The TCP segments transmitted by the server No.100 – No.109 are continuously transmitted without any time interval, so the client fails to respond to the out-of-order packets and generates ack immediately. Finally, it can be understood that a SACK is generated by merging. This is the same phenomenon in the subsequent data packet transmission process. The segments are continuously transmitted without any time interval.
Of course, some students may have doubts. This is just a normal win10 client. All 0s without any time interval are more likely to be a problem of capturing time accuracy, so I define it as an unresolved issue here.
Real answer : Unknown.
Guess the answer : In the special out-of-order situation of win10, dup ack is not sent.
3. TCP timeout retransmission
Some students may have noticed that the 4 packets No.116, No.118-No.120 that were retransmitted due to timeout have an interval of about 30ms, which is one RTT. Here is the actual answer, that is, the 4 TCP original segments lost between No.99 and No.100 were not sent in the same RTT.
The original segment of the retransmitted segment No.116 was sent by the server together with No.99 in the previous RTT. The original segments of the three retransmitted segments No.118-No.120 were sent by the server together with No.100 in this RTT before No.100.
Therefore, the four TCP timeout retransmission time intervals No.116, No.118-No.120 are generated.
Real answer : The server-side tracking file needs to be verified.
Guess the answer : It is caused by the server sending TCP segments.
4.TCP SACK
In the subsequent packet loss phenomenon, a strange SACK phenomenon also appeared. After showing that two TCP segments were lost between No.273 and No.274, the client No.284 SACK indicated SLE=291732 SRE=305022, which means that No.274 and No.275 were received, and the subsequent No.286 and No.288 SACKs also indicated that these two segments were received.
However, after the server retransmitted the lost segments No.287 and No.289 due to timeout, it also retransmitted No.290 and No.291. The Seq Numbers of these two segments correspond to No.274 and No.275, which means that the server did not receive the sack normally or ignored it. Then the server responded with a DSACK with No.292, indicating that SLE=291732 SRE=294390 is a duplicate.
In fact, from the perspective of packet loss, it is difficult to understand why three consecutive sacks are lost, and there is normal data packet interaction during this period. Because the server cannot capture it, we can only guess that the server did not receive the sack normally.
Real answer : Unknown.
Guessed answer : Combined with the MSS change in 1 above, I think it is more likely that there is some kind of device (proxy or security device) on the server side or in the middle, which modifies the MSS to 1436 in both directions, while the actual server’s mtu is 1329. At the same time, the device may modify the Seq during NAT conversion, but does not convert the SLE and SRE in TCP OPS, causing the server to ignore SACK.
Summary of the problem
In summary, in the absence of an understanding of the global environment and server-side packet trace files, some questions cannot be answered accurately, but this does not prevent us from conducting a series of analyses of packet files. The study of some special phenomena will also continuously consolidate the basic knowledge points of TCP.