March 24th, 2010 posted by Bender Rodríguez
Performance tuning for Apache and PostgreSQL using robots.txt, mod_rewrite, memcached, and possibly StaticGenerator for Django.
Our site has been subjected to frequent hammering as of late from various bots, malignant and benign, that has brought one of our servers to its knees on a few occasions. To address this issue, we set up a mod_rewrite rule, as documented here, to block most of the well known nasty bots, which has worked out quite well, but does not address the issue of the good bots that come along. To that end, we set up a few User-Agent rules in our robots.txt file to throttle those bots we want to crawl the site using the Crawl-delay protocol.
User-agent: gigabot Crawl-delay: 120 User-agent: googlebot Crawl-delay: 120 User-agent: Baiduspider Crawl-delay: 120 User-agent: msnbot Crawl-delay: 120 User-agent: teoma Crawl-delay: 120 User-agent: slurp Crawl-delay: 120 User-agent: opera Crawl-delay: 120
That last line might seem out of place, including the Opera browser in there amongst the bots, but Opera has an offline browsing function that crawls the sites you are visiting so that your browsing speed increases as your surf the site by prefetching the content and then displaying that "cached" content if and when you request it.
With both the rewrites and the robots.txt configurations, we still needed to set up Apache and PostgreSQL to deal with spikes in traffic. For Apache, using pre-fork, we decreased MaxClients from 150 to 50, which prevents the runaway out of control spawing of more clients when traffic increases. For PostgreSQL, we increased max_connections from 100 to 200. When our Django applications started throwing 500 server errors under heavy load, it was because PostgreSQL was failing to deal with the demand, throwing a "Too many clients connected" error.
All of these measures have helped to reduce the amount of downtime due to increased load on the server caused by bot traffic, but it still might not be enough. We are considering using a static file generator like StaticGenerator for Django. This solution would certainly work well for our calendar, which accounts for much of the load when bots come a knocking, and our news blogs would benefit as well, but the majority of our site is comprised of pages, most of which are pretty stable and could be static.
Of course, we use Memcached for caching most of the site to anonymous users, and that helps greatly, but I am thinking that the static solution, especially for calendar, news, and pages, might be the wave of our future.
"We lie the loudest when we lie to ourselves."
Eric Hoffer