Tweet |
Modwest’s plan to deploy a new Grid database server has been lurking in the shadows for months. Though db2, our most recent addition, still had plenty of headroom, we wanted to offer the newest Community MySQL, 5.5, for those needing cutting edge MySQL features . Lack of problems have a way of de-prioritizing a project, so we delayed the new db3 database server in favor of stability and capacity improvements to our mail and storage clusters.
In May, disk IO usage on db2 leapt wildly upwards. Caught off-guard, we urgently revived the project to create db3. Several unexpected delays assailed the db3 project, and in early July, we still had weeks to go before db3 was ready. Db2 was starting to noticeably affect service. We stepped back to reassess the situation; we needed to help customers now and buy time for a long-term solution. And, something just did not feel quite right about the spike in usage.
We checked the normal indicators of MySQL performance:
- Slow Queries –They were numerous, but there were no smoking guns that explained the burgeoning disk wait.
- Query Cache – 1GB in size it was performing well at over 80% efficiency.
- Sort Merge Passes – This statistic was very low; it was only in the hundreds for a service that had been up for hundreds of days. So, we did not change sort_buffer_size. The benefit of changing sort_buffer_size is debatable, anyway. Additionally, sort_buffer_size applies to each session, and in a shared hosting environment, what works today could flatline the server tomorrow, as the number of connections grows.
Then we noticed the camouflaged obvious. Db2 is a machine with 12 SAS drives. 2 are in RAID-1, dedicated to the OS, and the remaining 10, in a RAID-10, spend their existence spinning at MySQL’s behest. Guess where /tmp was mounted? That is correct, on the RAID-1! A RAID-1, where writes cost double. All queries ‘using temporary’ were completely dependent on the write IO capacity of a single hard drive. This was a terrible mistake, and the source of the machine’s lethargy. Our intention when rolling out db2 was to put /tmp on the RAID-10, where there was 5 times the write IO capacity of the RAID-1.
Adjusting the TMPDIR environment variable to go into the RAID-10 was the obvious solution, but our System Administrators like to find the best solution. They explored some options, and we settled on at least trying tmpfs for MySQL’s temporary tables. Tmpfs uses virtual memory, which, as long as the machine is not swapping, is going to be much faster than disks. Tmpfs was attractive because of its speed, but also because we could set an explicit limit, and all the memory could go where it was needed, unlike sort_buffer_size. If tmpfs did not work well, we would use the RAID-10.
The results of tmpfs dwarfed expectations. The benefit to db2 was astounding. Almost all write IO on the RAID-1 ceased:
There are 10 times less slow queries:
If your website ran any queries that required filesorts larger than sort_buffer_size, we are happy to say you are going to notice a massive improvement. Our plan is to still roll out db3 with MySQL 5.5, but are relieved that no one is suffering database slowness while we work on it.
Well done. Good news for everyone!
Posted by: Tony Ferrini | August 01, 2012 at 02:44 PM