Understanding Hypertext Transfer Protocol: A Comprehensive Guide to HTTP Functionality and Analysis

1. Introduction to the Hypertext Transfer Protocol (HTTP)

1. What is the HTTP Protocol?

HTTP, which stands for Hypertext Transfer Protocol.

HTTP is a protocol used to transfer Hyper Text Markup Language (HTML) files from a web server to a client browser. It is one of the most common protocols on the Internet. The web pages we typically access are transferred using the HTTP protocol.

Understanding How the Hypertext Transfer Protocol Works

When HTTP identifies a resource by name (i.e., when entering a URL in the browser), it follows the rules of Uniform Resource Name (URN). The most commonly used URN on the current network is the Uniform Resource Locator (URL). When a client enters a URL or clicks a URL hyperlink in the browser, the address to be accessed is determined.

The composition of a URL using http://www.colasoft.com.cn/resource/index.html as an example:

1) http://: indicates the use of the Hypertext Transfer Protocol, informing the web server to display the webpage. The client can omit entering it.

2) www: represents a web server.

3) colasoft.com.cn/: the domain name of the web server, or the name of the site server.

4) resource/: a subdirectory on the web server, similar to a folder on a computer.

5) index.html: a webpage file in the resource subdirectory on the web server, which is the file the web server transmits to the client browser.

HTTP uses TCP protocol port 80 for reliable data transmission. An HTTP session is initiated by the client and includes the following steps:

1) The client identifies the URL it wishes to access in the browser;

2) Initiates an HTTP connection request, starting an HTTP session between the client (UA) and an initial WWW server or proxy server;

3) The WWW server or proxy server transmits the content requested by the client based on the client’s URL request.

3. How the Hypertext Transfer Protocol Operates

Macroscopic Operation:

1) The path of communication when a client (UA) directly connects to the web server is as shown in Figure 1. Communication between the client and the web server does not require any intermediary server, which is the simplest scenario.

 Hypertext Transfer Protocol

                                             (Figure 1: Direct connection between client and web server)

 2) The communication path when a client (UA) connects to the web server through an intermediary server is as shown in Figure 2. Communication between the client and the web server is forwarded by the intermediary server, which could be one or more.

 Hypertext Transfer Protocol

                           

                                     (Figure 2: Client connection to web server through intermediary server)

 3) The communication path from the client (UA) to the intermediary server is as shown in Figure 3. The client sends requests to intermediary server 1, which sends them to intermediary server 2, intermediary server 2 sends them to the web server, and finally, the content received by the client is sent from intermediary server 1, not the web server.

                          

                                           (Figure 3: Communication process between client and intermediary server)

Internal Operational Process:

As shown in Figure 4, it is divided into four steps: establishing connection, sending request information, sending response information, and closing connection.

                                                 (Figure 4: Internal operation process of HTTP protocol)

4. HTTP Protocol Message Format

The HTTP message sent by the client is called a request chain; the HTTP message sent by an intermediary or web server is called a response chain. Both types of messages follow the following format:

  • General start line, i.e., the request line of the request message and the status line of the response message;
  • Headers;
  • Message headers;
  • An empty line;
  • Message body. 

2. Analyzing HTTP Communication

1. Analyzing the Specific Process of HTTP Access

We use Colasoft Network Analysis System to capture and analyze an HTTP communication process. The client host name is “wangym,” the client browser is IE6.0, and the requested domain name is “www.colasoft.com.cn”.

Open the Colasoft Network Analysis System on the client. To avoid data interference, a filter can be set to capture only the local machine’s data communication. Once set, start capturing data and enter www.colasoft.com.cn in the browser on the local machine. After the webpage is fully opened, stop capturing.

Note: The HTTP access mentioned in this article refers to communication on the standard port 80. For HTTP access on non-80 ports, users can change it in “Project -> Advanced Analysis Module -> HTTP Analysis Module -> General Settings -> Port”. The default is 80. When the HTTP service has multiple ports, they are separated by a semicolon, such as 80;8080.

1) HTTP Request

Figure 5 shows the HTTP request message trace for the above operation to access www.colasoft.com.cn using the Colasoft Network Analysis System.

                                            (Figure 5: HTTP GET Request Operation)

The packet list in Figure 5 reveals the raw information of the HTTP request in the above operation as follows:

  1. The first packet is a DNS query packet, where the local machine retrieves the IP address corresponding to www.colasoft.com.cn through a DNS query.
  2. The second packet is a DNS response packet, where the DNS server finds the IP corresponding to the domain www.colasoft.com.cn as 64.246.27.237 and transmits the query result to the client.
  3. The third, fourth, and fifth packets are the three-way handshake packets of the TCP connection between the local machine and the IP address corresponding to the domain www.colasoft.com.cn, 64.246.27.237.
  4. The sixth packet is the HTTP GET request initiated by the client, requesting content from the web server, and the decoding of the seventh frame contains various parameter information of the GET request.

The HTTP request method in the above HTTP access is GET, and GET is just one of many methods in HTTP. HTTP uses different methods to achieve different functions. The table below lists common HTTP request methods.

Every HTTP request consists of two parts:

  1. HTTP request line, which is mostly GET or POST;
  2. Optional message headers in the HTTP request, which vary depending on the HTTP client browser or client browser configuration options used.

Specific analysis of the HTTP request decoding in the sixth packet of Figure 5 reveals the following information:

  1. HTTP request: The method is GET, “/” represents a request for the web server’s root directory, and “HTTP/1.1” indicates the URI (Uniform Resource Identifier) and its version;
  2. Accept: Specifies the content types the client can accept, with the order indicating the client’s acceptance priority. Here you can see types like gif, bitmap, jpeg, etc., that the client can accept.
  3. Accept-Language: Specifies the preferred language as Chinese;
  4. Accept-Encoding: Specifies the content encoding type as gzip or deflate;
  5. User-Agent: Contains the type of browser running on the HTTP client;
  6. Host: Contains host information as www.colasoft.com.cn.
  7. Connection: Specifies the connection type as Keep-Alive.

Note: When transferring a webpage, the web server will open multiple TCP connections simultaneously. For example, each image is transmitted using a separate TCP connection.

2) HTTP Response

After receiving an HTTP request, the web server sends a response to the HTTP client.

Figure 6 shows the HTTP response message trace for the above operation to access www.colasoft.com.cn using the Colasoft Network Analysis System.

                                                       (Figure 6: HTTP Response)

The eighth packet in Figure 6 is the HTTP response packet returned by the web server to the client. Detailed analysis of its decoding reveals the following information:

HTTP response: “HTTP/1.1” indicates the URI (Uniform Resource Identifier) and its version, and “200 OK” is the HTTP response status code, indicating that the client’s requested page exists and is in normal status.

  1. Date: Shows the current time.
  2. Server: Shows the type of web server supporting the current requested page.
  3. X-Powered-By: Shows the script type of the current requested page.
  4. Set-Cookie: Shows the Cookie information for this HTTP connection.
  5. Keep-Alive: Shows the Keep-Alive time for this HTTP connection.
  6. Connection: Shows the connection type for this HTTP connection as Keep-Alive.
  7. Transfer-Encoding: Shows the transfer encoding for this HTTP connection.
  8. Content-Type: Shows the content type for this HTTP connection.
  9. Line1-N: HTML code transmitted by the web server to the client browser.

In Ethernet, the size of a packet ranges from 64-1518 bytes. If the client’s requested page is larger than 1518 bytes, the page will be transmitted in segments to the client. Once the client browser receives the HTML code transmitted by the web server, it begins to read the data and displays it as a webpage.

Different HTTP status codes represent different types of HTTP responses, mainly including:

                                                  (Table 2: HTTP Response Status Codes)

3) HTTP Access Flow

By tracking and analyzing the messages during the access to www.colasoft.com.cn, we can summarize the HTTP workflow diagram as shown in Figure 7.

Note: HTTP access can be done using a domain name or directly using an IP address. When accessing via IP, packets 1 and 2 representing the DNS data in Figure 5 will not be generated. Therefore, the HTTP flowchart does not include the DNS part and starts directly from the TCP three-way handshake.

                                                       (Figure 7: HTTP Access Workflow Diagram)

3. Summary

Above is a brief introduction to the HTTP protocol, using the Colasoft Network Analysis System to track and analyze the specific process of accessing a webpage. Therefore, when users encounter webpage access faults, they can combine the HTTP knowledge mentioned above and use network detection and analysis software (here, the Colasoft Network Analysis System) to track and analyze HTTP access packets, enabling rapid troubleshooting of such faults.