This article chronicles the journey of submitting a Bug to Apache’s top-level project, ShardingSphere, a database and table partitioning middleware, emphasizing the Bug tracking process.
Honestly, it was a rather tortuous Bug tracking journey. On October 28th, we submitted an issue on GitHub, which was closed twice due to the subjective assumptions of the official developers, and it wasn’t until November 20th that it was recognized as a Bug and a fix was released, taking over 20 days.
This article will reconstruct the analysis process of the Bug and extract valuable experiences and technical points. Through this article, you will gain:
1. Troubleshooting complex issues2. The principles of the database middleware Sharding Proxy3. The process and interaction protocol of MySQL pre-compilation4. Wireshark packet analysis cunning techniques for MySQL
01 Problem Description
This Bug came from a reader of my public account, who was conducting a pre-research on ShardingProxy (a sub-product of ShardingSphere that can be used for database and table partitioning, which will be detailed later) for the company. He wrote a very simple demo according to the official documentation, but it couldn’t fetch the data after being run.
The following is the information he sent me after encountering the issue, hoping I could help locate the cause with him.
>
The document in the screenshot details ShardingProxy’s configuration, debug analysis logs, and the specific phenomena of the problem.
To make it easier for everyone to understand, I will re-describe the business logic of this Demo and the issue manifestation.
1. Explanation of Demo’s Business Logic
This Demo is very simple, mainly designed to validate the database and table partitioning functionality of ShardingProxy. The program uses SpringBoot + MyBatis to implement a single-table query logic, with a long-type field in the table being used as a partitioning key, with partitioning done through ShardingProxy.
Below is a detailed definition of the table, with a total of 16 fields. Please focus on the first two fields as others are irrelevant to the Bug mentioned here.
>
The roles of the first two fields are as follows:
- BIZ_DT: Business field, date type, related to the Bug
- ECIF_CUST_NO: bigint type, used as a partitioning key
The code, in summary, involves the Controller calling the Service, the Service calling the Dao, and the Dao implemented through MyBatis, so it’s not pasted here.
Since ShardingProxy middleware is used, the configuration differs from directly connecting to a database. When defining the dataSource, the URL needs to be configured like this:
jdbc:mysql://127.0.0.1:3307/sharding_db?useServerPrepStmts=true&cachePrepStmts=true&serverTimezone=UTC
You can see that the JDBC connects to the logical data source sharding_db of ShardingProxy, and the port used is 3307, not the actual underlying database nor the real port 3306 of the MySQL Server, which will be introduced in the following text. The two blue-highlighted parameters, useServerPrepStmts, and cachePrepStmts, are related to the Bug discussed in this article, and will be specifically analyzed later.
Additionally, ShardingProxy’s table partitioning strategy is: using the long-type ecif_cust_no field to take modulo 2, splitting it into two tables. Specific configuration is as follows:
shardingColumn: ecif_cust_noalgorithmExpression: pscst_prdt_cvr${ecif_cust_no % 2}
2. Problem Description
Reiterating the problem encountered. Initially, insert a record with ECIF_CUST_NO equal to 10000 into the data table in advance:
Then launch the demo program, use curl to send a post request, querying for the record where ecifCustNo equals 10000, surprisingly returns no data:
At this point, the background is basically clear. Why is there data in the database, yet the program cannot query it? Is the problem with ShardingProxy, or the application itself?
02 Introduction to ShardingProxy Principles
Before starting the process of analyzing this issue, I will quickly popularize the basic principles of ShardingProxy, so that everyone can better understand my analysis thought process.
You must have encountered open-source database middleware, the most popular being MyCat and ShardingSphere. MyCat is open-sourced by Alibaba; ShardingSphere was open-sourced by Dangdang.com and gradually grew strong at JD.com, becoming an Apache top-level project in 2020.
The goal of ShardingSphere is an ecosystem, composed of three independent products: ShardingJDBC, ShardingProxy, and ShardingSidecar. This article focuses on ShardingProxy, and the other two will not be expanded upon.
1. What is ShardingProxy?
ShardingProxy is a product benchmarking MyCat, positioned as a transparent database proxy, which can be understood as a Server implementing the MySQL protocol (an independent process), applicable to scenarios like read-write separation, database and table partitioning, and flexible transactions.
For applications or DBAs, ShardingProxy can be considered a database proxy, capable of direct interaction with tools like MySQL client (Navicat) or command line, while internally it communicates with the real MySQL server using the native MySQL protocol.
Figure 1: ShardingProxy’s Application Architecture Diagram
From the architecture diagram, ShardingProxy is equivalent to MySQL itself, not storing data but shielding the storage details of the Database externally, allowing you to connect to ShardingProxy just like connecting to MySQL (except the port is different), using the familiar ORMapping framework to use it.
2. Internal Architecture of ShardingProxy
Let’s take a look at the internal architecture of ShardingProxy, as it will be part of the subsequent source code analysis.
Figure 2: Internal Architecture Diagram of ShardingProxy
The whole architecture is divided into frontend, core components, and backend:
The frontend (Frontend) is responsible for network communication with the client, using the NIO framework, completing the encoding and decoding of the MySQL protocol during communication.
The core component (Core-module), upon receiving the decoded MySQL command, begins to call Sharding-Core to parse, rewrite, route, and merge SQLs, and other core functions.
The backend (Backend) interacts with the real database, using the Hikari connection pool, also involving MySQL protocol encoding and decoding.
3. ShardingProxy’s Precompiled SQL Function
This Bug is related to ShardingProxy’s precompiled SQL, so I will introduce this function and the associated MySQL protocol individually. This is crucial to this article, please read patiently.
Students familiar with database development definitely understand: precompiled SQL (PreparedStatement), from the moment a database receives an SQL to its completion, generally divides into 3 steps:
1. Lexical and semantic parsing2. SQL optimization, formulating execution plan3. Executing and returning results
Many times, an SQL statement might be executed repeatedly, just with different parameter values at execution time. The precompiled feature replaces these values with placeholders, ultimately achieving once compilation, multiple runs, eliminating the parsing and optimization process, greatly improving SQL execution efficiency.
Suppose we were to execute the following SQL twice:
SELECT * FROM t_user WHERE user_id = 10;
Then the protocol messages between JDBC and MySQL are as follows:
Through the above process, you can see: The 1st message is a PreparedStatement, with the parameter values in the query replaced by a question mark, telling MySQL to precompile this SQL; the 2nd message is MySQL notifying JDBC of successful preparation; the 3rd message is JDBC setting the parameter to 1; the 4th message is MySQL returning the query results; the 5th and 6th messages represent the second execution of the same SQL, which no longer needs to be precompiled.
Back to ShardingProxy, if precompiled functionality needs to be supported, the interaction process must change, because when Proxy receives the JDBC’s PreparedStatement command, the sharding key in the SQL is a question mark, and it cannot determine which real database to route to currently.
Hence, the process changes as follows:
As you can see, upon receiving a PreparedStatement command, Proxy does not forward this message to MySQL but caches this SQL; when receiving the ExecuteStatement command, it determines the real database based on the sharding key and passed parameter values, and interacts with MySQL.
03 Problem Analysis
The previous chapter basically introduced the background knowledge related to this Bug clearly, now officially entering the problem analysis process.
Initially receiving this issue, I was honestly quite stumped, especially upon seeing the following information from the reader.
Of course, my skills are not adept enough for blind guesses, I will explain my complete thought process.
Step 1 Reproduce the Problem
I asked the reader to send me the source code of the Demo, database script, and ShardingProxy configuration package, then installed ShardingProxy 4.1.1 locally, and connected to ShardingProxy using Navicat to execute the database script, and the environment was basically ready.
Upon launching the Demo program and sending requests through Postman, the issue stably reproduced, indeed yielding no data.
Step 2 Confirm Whether the Application Has a Bug
Because the entire code is very simple, the only possible aspect with problems at the code level would be the MyBatis layer. To confirm this, I modified SpringBoot’s configuration to also log MyBatis’s debug logs. Sending the request again, the following detailed logs could be seen on the console:
No anomalies were found in the logs, and the parameters of PreparedStatement and ExecuteStatement were set correctly with actually empty query results.
To narrow the scope of the investigation, I changed the dataSource configuration back to direct connection to the real database, thus excluding ShardingProxy as an interference factor. The revised URL is as follows:
jdbc:mysql://127.0.0.1:3306/db1?useServerPrepStmts=true&cachePrepStmts=true&serverTimezone=UTC
Wherein, db1 is the real database, and 3306 is the port of the MySQL server. Sending the request again using Postman, it can be seen: correct data returned.
Through this step, I shifted the suspicion back onto ShardingProxy and reverted the dataSource configuration to its original state, continuing the investigation.
Step 3 Investigate ShardingProxy
First, check ShardingProxy’s runtime logs, no anomalies detected; secondly, the Actual SQL is correct in the logs, having been correctly routed onto pcsct_prdt_cvr0 table according to the partition key:
[INFO ] 17:25:48.804 [ShardingSphere-Command-15] ShardingSphere-SQL - Actual SQL: ds_0 ::: SELECTBIZ_DT,ECIF_CUST_NO,DEP_FLG ...FROM pscst_prdt_cvr0WHERE ECIF_CUST_NO = ? ::: [10000]
Therefore, it can be inferred that there should be no problem with ShardingProxy’s database and table partitioning configuration.
I began to wonder: could it be related to the database driver used by ShardingProxy? Since this Jar package is a version chosen by the application party, manually placed in ShardingProxy’s installation directory. Consequently, I changed the driver version from 5.1.47 to 8.0.13 (the same version used by the Demo), but the issue persisted.
Additionally, I wondered whether ShardingProxy’s latest version introduced a bug? Consequently, I installed its previous version 4.1.0 and retested it, the problem still existed.
At this point, it honestly seemed there were no other suspicious points, having checked all conceivable points. I revisited the Demo program itself; its only conjunction with ShardingProxy was in the DataSource’s URL.
jdbc:mysql://127.0.0.1:3307/sharding_db?useServerPrepStmts=true&cachePrepStmts=true&serverTimezone=UTC
The database name and port number configurations are correct; the only suspicious points are the other three parameters: useServerPrepStmts, cachePrepStmts, serverTimezone. Wherein, the first two are combined concerning precompiled SQL.
Therefore, I removed these two parameters from the URL and tested it. At this moment, miraculously, correct data returned. By this juncture, the issue was basically pinpointed, but what was the root cause? Was it indeed a Bug in ShardingProxy?
Step 4 Wireshark Packet Capture Analysis of MySQL Protocol
Having found the solution to this issue, I informed the reader concurrently. Concurrently, he also submitted an issue on ShardingProxy’s GitHub, reporting the latest findings.
Due to work reasons, this issue was temporarily set aside, with plans to resume investigation later.
About a week later I remembered this issue, upon opening the issue to learn about the investigation progress, I was profoundly surprised: the official developer actually, subjectively presuming it was an application issue after reproducing the issue shut it unexpectedly, their response was this:
The implication being: we conducted a lot of tests on the precompiled SQL feature, it is impossible to have an issue, suggesting changing the application’s database connection pool and grabbing packets for further analysis. (Such a claim is truly speechless, entirely deviating from the rigorous mindset of programmers!)
The following day, I began capturing packets using Wireshark to analyze the MySQL protocol, trying to ascertain the root cause. Also contacted the official party, to reopen the issue.
How to capture MySQL protocol data packets with Wireshark will not be expanded here, you can check online materials for details. Be sure to set Wireshark’s filtering conditions to:
mysql || tcp.port==3307
Where: mysql denotes data packets between ShardingProxy and MySQL Server, tcp.port==3307 represents data packets between the Demo program and ShardingProxy.
Upon launching the Wireshark packet capture and sending requests again through Postman to trigger the entire process, the following snapshot data packets could be successfully captured.
Focus on the 8 data packets with dark blue background. In the principle section of Chapter 2, ShardingProxy’s precompiled feature and the MySQL protocol messages for this process were detailed, the 8 data packets here perfectly align with the principle introduction.
How to further analyze from here? Thinking along the lines of ShardingProxy’s architectural diagram: Proxy merely acts as an intermediate proxy, seated between the application and the MySQL Server, entirely implementing the MySQL protocol, allowing for encoding and decoding MySQL commands, then adding its database and table partitioning logic.
If there is a Bug within ShardingProxy, certainly an issue occurred with a specific data packet. Following this thought, it swiftly reveals: after executing ExecuteStatement, MySQL Server returned the correct data packet to Proxy, however Proxy did not return the correct data packet to the application.
The following snapshot is the penultimate Response data packet, returned by MySQL Server to Proxy, displaying the record data in the Payload:
The following snapshot is the last Response data packet, returned by Proxy to the application, displaying only the table field definitions in the Payload while the record is nowhere to be seen.
Through this step of analysis, it essentially confirmed: ShardingProxy does indeed have a Bug. Subsequently, I conveyed these findings to the official developer, who began to pay attention, officially entering the source code analysis phase.
04 Root Cause Identification
That evening, the official developer identified the root cause and issued a Pull Request. I observed the code modification was merely a change of one line.
The changed line of code is precisely when ShardingProxy reassembles a data packet returning it to the application, where the exception arose.
Due to our data table containing a date-type field, the modified line of code forcibly converted the date-type to Timestamp, hence the exception arose. There remain a few doubts, which I will answer one by one based on my understanding of the source code.
1. Why was the exception thrown in the code, yet not logged in ShardingProxy’s console?
The screenshot above is: the entire call chain of the method throwing ClassCastException. As ShardingProxy did not capture this RuntimeException nor log it, ultimately this exception was swallowed by netty.
2. Why does ShardingProxy need to perform type conversion from date to Timestamp?
This can be understood from ShardingProxy’s architecture, as Proxy needs to encode and decode the MySQL protocol to insert its database and table partitioning logic midway.
Concerning date-type fields, ShardingProxy, through JDBC API, obtains a Date-type from the query result. The reason for converting to Timestamp is again related to the MySQL protocol, explained in the following description from MySQL’s official documentation:
In simple terms, while implementing ShardingProxy, a maximum range timestamp was used to store potentially three values, namely date, datetime, and timestamp, then put into binary according to the protocol specification mentioned.
3. Is this Bug only triggered when using the SQL pre-compilation feature?
Yes, processing ExecuteStatement commands is the only time this method is invoked. Why isn’t it invoked in regular SQL query scenarios?
This is again linked to the MySQL protocol; in regular SQL query scenarios, the payload is not binary protocol, but rather regular text protocol. In such cases, there is no need for this class’s conversion.
This marks the end of the entire analysis process.
05 In Conclusion
This article thoroughly recapped the analysis process of this Bug, summarizing the underlying knowledge and troubleshooting experience.
For an open-source project like ShardingSphere becoming a top-level one, I personally believe that a deep recap is also worthwhile. I do not agree with their handling of the issue, additionally, there must be inadequacies in their automated testing for core functionalities; otherwise, it would be impossible not to have discovered this serious Bug over consecutive versions.
06 Previous Recommendations
Previous recommendations
The interviewer asked me: What is request merging under high concurrency?
You definitely do not know about this Map, after all its presence is indeed very low.
To me, multi-threaded transactions definitely must be a pseudo-proposition!
Brother why quietly tells you about some HashCode issues.
Blood book from one person, requesting brother why to talk about this interview question.