Monday, November 16, 2009

Fixing Poor MySQL Default Configuration Values

Fixing Poor MySQL Default Configuration Values

I've recently been accumulating some MySQL configuration variables that have defaults which have proven to be problematic in a high-volume production environment. The thing they all have in common is a network blip or two can trigger some very undesirable behavior.

max_connect_errors

If a client is having trouble connecting to MySQL, the server will give up waiting after connect_timeout seconds and increment the counter which tracks the number of connect errors it has seen for the host. Then, when that value reaches max_connect_errors, the client will be locked out until you issue a FLUSH HOSTS command. Worse yet, if you have occasionally network blips and never need to restart your MySQL boxes, these errors can accumulate over time and eventually cause you middle of the night pain.

See Host 'host name' is blocked in the MySQL docs. Sadly, there is no way to disable this check entirely. Setting the variable to 0 doesn't accomplish that. Your only real solutions are (a) setting it to a very high value (max_connect_errors=1844674407370954751), and (b) running an occasional FLUSH HOSTS command.

connect_timeout

This is related to the above problem. In situations of network congestion (either at the client or server), it's possible for an initial connection to take several seconds to complete. But the default value for connect_timeout is 5 seconds. When you trip over that, the max_connect_errors problem above kicks in.

To avert this, try setting connect_timeout to a value more like 15 or 20. And also consider making thread_cache_size a non-zero value. That will help in situations when the server occasionally gets a high number of new connections in a very short period of time.

skip-name-resolve

MySQL does a reverse DNS lookup on every incoming connection by default. This sucks. It seems that no matter how good your infrastructure is, there are blips in DNS service. MySQL's host cache exists to keep those lookups to a minimum. Yet I've seen this cause pain off and on for eight years now. I can only assume there's a bug in the host cache or the resolver library when this happens.

I recommend adding skip-name-resolve to your /etc/my.cnf to skip DNS entirely. Just use IP addresses or ranges for your GRANTs. It seems that slow replies from DNS servers can also help you to trip over connect_timeout as well. Imagine having 2 or 3 DNS servers configured but the first one is unavailable.

slave_net_timeout

When the network connection between a master and slave database is interrupted in a way that neither side can detect (like a firewall or routing change), you must wait until slave_net_timeout seconds have passed before the salve realizes that something is wrong. It'll then try to reconnect to the master and pick up where it left off. That's awesome.

However, the default value is 3600 seconds. That's a full hour! FAIL.

Who wants their slaves to sit idle for that long before checking to see if something might be wrong? I can't think of anyone who wants that.

My suggestion, if you're in a busy environment, is that you set that to something closer to 30 seconds.

Posted by jzawodn at November 09, 2009 07:49 AM

Reader Comments
# James Day said:

How do you want us to change these without making things worse for less capable users?

For max_connect_errors, how should we change the server so that it retains the intrusion detection feature but is less likely to be bothersome? Perhaps a regular decrease of all counts by n each hour? Each minute? 1844674407370954751 is big enough to disable the count for those who want that. It's five billion failures each second for a decade.

Anyone with a serious availability requirements who doesn't have an unavoidable need for hostname authentication should use skip-name-resolve to get DNS out of the possible failure cause list. It's also faster.

slave_net_timeout is too high for your use situations but those who want a higher setting are those who have infrequent updates and don't want their slaves disconnecting, logging and rotating relay log file just because no updates have happened for an hour. Think of cases like low volume hosting with a customer remote slave for backup or a business user where everyone goes home at nights and weekends and stops updating the accounting system. Those are more common than your situation, I think, and we should be defaulting to being nice to those who are less likely to have capable pros around to tweak. Your suggestion of 30 seconds is likely to be suitable for a high use situation where idleness for 30 seconds isn't likely. Adding more retry attempts would also be sensible so that a connectivity outage to a site doesn't cause that limit to expire on the database servers just because clients can't make requests.

You make an interesting point about high volume production setups. We know that we'd suggest changes for such situations and sometimes we note this with the settings but I don't think that we've gathered that information in one place as a set of suggestions for high volume users.
on November 9, 2009 10:07 AM
# Jeremy Zawodny said:

I'm not suggesting that you change them.

I'm suggesting that in high-volume environments they need to be changed.

That's why I started the post with this:

"I've recently been accumulating some MySQL configuration variables that have defaults which have proven to be problematic in a high-volume production environment."

Jeremy
on November 9, 2009 10:15 AM
# James Day said:

I was wondering if or how we might be able to change them so they would be closer to meeting both sets of need. And/or whether we might usefully document some of this ourselves.
on November 9, 2009 11:28 AM
# Tyson Lowery said:

I believe max_connect_errors get reset to 0 after each successful connection, at least as of MySQL 5.1 or so...
on November 9, 2009 12:17 PM
# Mark R said:

max_connect_errors is pretty irrelevant because in production people *never* get passwords wrong (people hardly connect at all; processes connect all the time and they always get it right). Hosts from the internet cannot access production servers, and those that do don't have random people trying to connect from them.

skip-name-resolve however is absolutely vital for any environment at all - those DNS lookups are expensive and pointless (and make GRANTs dependent on potentially flakey DNS).

slave timeout also - 1 hr is too long, we've set this lower too.
on November 9, 2009 02:03 PM
# Kevin Burton said:

Yeah. The MySQL slave_net_timeout thing is waaaay FRAKING fail..

Seriously. that's just evil.

That bit me like 3-4 years ago... I seriously had to wait 3600 minutes for a slave to read timeout before SLAVE STOP would finish.

WTF.
on November 9, 2009 09:43 PM
# mens dress shirts said:

I think Jeremy is right, when in high-volume environments they need to be changed, and for other situation, there is no need
on November 10, 2009 01:20 AM
# John Arundel said:

Thanks for this excellent post! I've been bitten by one or two of those in the past, especially 'skip-name-resolve', and I'll take note of the others and put them into production!

Much appreciated.
on November 10, 2009 01:59 AM
# Steve Souders said:

For years I've wished there was a YSlow-like script that would automatically check for tips like this. Possible?
on November 10, 2009 09:11 AM
# Jeremy Zawodny said:

Very possible, yes.
on November 10, 2009 09:17 AM
# Sheeri K. Cabral (Pythian) said:

As for the max_connect_errors, Tyson is right -- it's max errors *in a row*. The default is 100, which is enough to cover someone flubbing their password several times, but also enough to catch if an automated tool (like a monitoring tool) has a wrong password.
on November 10, 2009 10:37 AM
# Sebastian Bonhag said:

Type your comment here.

After you submit the comment, check your email. There will be
a link you need to click to make your comment visible.

Your email address WILL NOT appear on the site, so don't worry
about being anonymous, even if you think you are.
on November 11, 2009 12:40 AM
# Sebastian Bonhag said:

Thank you for your hints.

As for the automatically checking script see https://launchpad.net/mysql-tuning-primer

It helped me quite a bit with the tunig, but leaves out your suggestions. OK, most are not relevant in my environment.

(Sorry, if posted twice. I'm not sure if I fumbled the first time.)
on November 11, 2009 12:42 AM
# Robert said:

Excellent post! I personally spent so much time to configure my sql server. If I had found this post earlier I would have saved a lot.
on November 13, 2009 12:15 AM

No comments: