X
    Categories Performance Monitoring

Solving Slow Database Performance In Java and MySQL

AppNeta no longer blogs on DevOps topics like this one.

Feel free to enjoy it, and check out what we can do for monitoring end user experience of the apps you use to drive your business at www.appneta.com.

When AppNeta acquired Tracelytics and their awesome team – we decided to keep their great blog content to document AppNeta’s journey!

Tracelytics is currently in beta with new Java instrumentation to help businesses using better understand their application’s performance, and steer future optimization efforts. We’ve been working with a handful of customers that are trying to solve complex issues with application speed and a Java stack, and learning a lot about what types of problems they frequently run into. Troubleshooting these setups can be tricky – The web application in today’s example is running on a nginx / Tomcat / MySQL stack. That’s a pretty common set of technologies to use, so let’s dive into how you can target and understand application slowness on specific pages.

The specific application we’re looking at is a message board (forum) that processes about 3600 requests per hour. Viewing the heatmap, we notice a pattern: there are basically 2 clusters of requests, those that take under 100 ms and then a set of slower requests that take over 600 ms, forming a line near the top:

We select the slow, high latency requests in the heat map and check out the top URLs:

Ah ha! This is a little more revealing to the source of our problem: The majority of these requests are related to “profile” pages, so investigating those seems like an appropriate place to start. Let’s break down an individual trace for the /user/profile/3.page request, where we notice that the majority of time is spent in a single SQL query:

Viewing that extent, Tracelytics provides us the exact SQL query being executed, along with the offending line of code in the application, in the form of a Java stack trace. The SQL query is:

1
2
3
4
5
6
SELECT COUNT(pm.privmsgs_to_userid) AS private_messages, u.*
    FROM jforum_users u
    LEFT JOIN jforum_privmsgs pm ON pm.privmsgs_type = 1
         AND pm.privmsgs_to_userid = u.user_id
    WHERE u.user_id = 3
    GROUP BY pm.privmsgs_to_userid

We connect to the MySQL database and run an “explain” on the query to provide us with the query plan:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
mysql> explain SELECT COUNT(pm.privmsgs_to_userid) AS private_messages, u.*
    FROM jforum_users u
    LEFT JOIN jforum_privmsgs pm ON pm.privmsgs_type = 1
        AND pm.privmsgs_to_userid = u.user_id
    WHERE u.user_id = 3
    GROUP BY pm.privmsgs_to_userid G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: u
type: const
possible_keys: PRIMARY
key: PRIMARY
key_len: 4
ref: const
rows: 1
Extra: Using temporary; Using filesort
*************************** 2. row ***************************
id: 1
select_type: SIMPLE
table: pm
type: ALL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: 2100274
Extra:
2 rows in set (0.00 sec)

In the above ‘explain’, we can immediately see the source of the problem: the “type: ALL” under “table: pm” means no index is being used when accessing the private_messages table and MySQL is doing a full table scan.

For any table with a very small number of rows, this probably wouldn’t matter. Unfortunately, this particular table has over 2.1 million rows and MySQL indicates it is going to examine them all (“rows: 2100274″). This is likely the source of our performance problem. To verify, we look at the table’s DDL with “show create table” and discover that the table actually has no indexes at all. This is the exact type of issue that only becomes a problem at scale: When the message board was young and only had a few users, this page was fast and seemed innocuous. Now that it’s part of a big database and there’s a lot of data being stored, even if the overall level of traffic is the same over time, this huge database takes a much longer time to scan through. Let’s say that one more time: Not every problem of application scaling is the result of increased users or load – Your storage and methods of accessing that storage can break down, too!

Let’s solve this issue quickly: By adding an index on the the two join conditions (privmsgs_type and privmsgs_to_userid), the query should become much more efficient.

We run:

1

alter table jforum_privmsgs add key privmsgs(privmsgs_type,privmsgs_to_userid);
By running this command, we’ve added an index to the table that the query can run against. To verify it is actually being used with our query, we run the same “explain” again:
The “key: privmsgs” output indicates that our new index is being used. As a result, the number of rows examined on the jforum_privmsgs table has dropped from over 2.1 million to just 8 rows. We’re back to the small table sizes that are very fast for MySQL to scan through.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: u
type: const
possible_keys: PRIMARY
key: PRIMARY
key_len: 4
ref: const
rows: 1
Extra: Using temporary; Using filesort
*************************** 2. row ***************************
id: 1
select_type: SIMPLE
table: pm
type: ref
possible_keys: privmsgs
key: privmsgs
key_len: 5
ref: const,const
rows: 8
Extra: Using index
2 rows in set (0.00 sec)

The “key: privmsgs” output indicates that our new index is being used. As a result, the number of rows examined on the jforum_privmsgs table has dropped from over 2.1 million to just 8 rows. We’re back to the small table sizes that are very fast for MySQL to scan through.

About a half hour later, we take another look at the heatmap. There is a very noticeable change around 2:15, right after we added the index:

We filtered our traces down to the /user/profile/3.page URL to see the effect our change has had on that particular page:

The profile requests are now averaging 11 ms. Originally, they were around 640 ms, which makes this about a 58x improvement. We made a very small change that provided a very large impact on the performance of our application! Not every problem will be this straightforward to solve, but without complete information about the time spent in each part of this request, this would have been a much more difficult and complicated problem to diagnose.

You can get this same performance data for your applications by signing up for a Tracelytics trial today: Start tracing today!

Team AppNeta: