Thursday, March 19, 2009

Boost your Drupal site!

Boost your Drupal site!
July 23, 2007 by Justin
Posted in

Boost: Static HTML caching for Drupal
I've recently become quite familiar with Arto Bendiken's Boost for Drupal. For context, Drupal is an open source, modular, PHP-based content management system (CMS) that I use with many of my clients. Boost is a module for Drupal which assists you in caching content as static HTML, bypassing Drupal (and thereby PHP and MySQL) in order to handle much more traffic and serve content much more quickly. Essentially, you let Apache do what it does best -- serve HTML pages. With a busy site or one with a lot of content, this can be a lifesaver.
Arto has a good write-up about Boost in his original blog post. However, Boost is a little more complex than most Drupal modules, so what I hope to add here is a couple things:
the basics of what Boost gives you
how the two "halves" of Boost complement each other
how Boost gets you outside of Drupal entirely
the status of Boost with regard to Drupal 5.x
a little more detail about how it works
some caveats that I've found
Arto's documentation for the setup of Boost is great, so I won't be rehashing that. Rather, I hope to provide a little more technical info about how the module works.
The Basics
Basically, Boost is two parts: first, a Drupal module (in the traditional sense) that manages the cache and provides for an administrative user interface, and second, some rule lines to add to your Drupal site's top-level .htaccess file which allow Apache to bypass Drupal entirely and serve pages from the cache.
The biggest thing to understand about how Boost works is to understand its utilization of Apache's mod_rewrite via the .htaccess file rule lines (by the way, .htaccess is just the default name for files in your site that Apache will read for configuration info). Many people may not understand that Drupal's use of clean URLs is dependent upon mod_rewrite. Every URL on a Drupal site is basically just a path argument to the top-level index.php, which dispatches calls to various points in the code to handle that argument. So, when you go to /about, the index.php file actually gets an argument of about and determines what content to serve. Apache's mod_rewrite is able to keep the browser pointed at /about while actually running index.php. Another popular open source CMS, WordPress, behaves similarly.
Once you understand this, it's easy to understand what Boost does and why it requires mod_rewrite. Boost by default stores the cached versions of pages under /cache on your website (this path is configurable, though). Then, when a request comes in, the .htaccess file is consulted (because that's what Apache does), which tells it to look for cache files first before sending anything to Drupal's index.php. Since the cache files are plain HTML, they go out much more quickly than Apache running PHP, firing up Drupal, querying MySQL, and then serving content. Arto provides some graphs in his original post showing just how dramatic this improvement can be.
Lastly, a word about the cache filename standard. If in fact the /about URL were cached, it would actually be in your site at /cache/about.html. If Apache finds this file, it assumes that the cache is still valid (the Drupal module side takes care of expiring and removing stale content) and serves it directly. For path aliases (such as "/about should serve the same content as /node/137"), Boost uses UNIX symbolic links in the cache filesystem, so /cache/about.html would be a link to /cache/node/137.html.
Boost and Drupal 5.x
I have been using Boost on a Drupal 5.1 site, thanks to this port of Boost to Drupal 5.x by the maintainer of drupal.ru. This seems to be the only source of Drupal 5.x-compatibile Boost material currently. The only caveat to be aware of about this version is that by default, the front page is not cached -- more on this below. If your site is anything like the one I used Boost on, you will need to remedy this since your front page is likely your busiest as well as most complicated page and is in need of caching.
A Little More Detail
A couple other notes about Boost's operation:
Cache files are created on demand. For example, if your front page is not cached when someone requests it, Drupal will construct the page and cache the file, but serve the constructed page to the user. Every user thereafter, until the cache file becomes stale and is removed, will receive the cached version. If you have pages that are particularly demanding, think about running a cron to request them anonymously in order to get them cached for regular users.
Special paths like /user/login and /admin, as well as HTTP POST requests and any request for a logged-in user, are not cached. Arto has put a lot of thought into this area. Note that this means that sites with mostly logged-in users will not benefit from Boost very much -- anonymous users see the real benefit.
Boost takes over the configuration interface for Drupal's built-in caching mechanism. This just means that it avoids confusion between two types of caching and just "upgrades" your current setup to be Boost-ified.
Like the built-in cache, Boost has multiple cache lifetime intervals to choose from; anywhere from one minute up to one day.
Boost expires content in one of two ways. It implements hook_nodeapi to catch node updates, insertions, and deletions and responds to those, and it also implements hook_cron to expire content that has become stale but has not had any specific actions performed on it.
Technical note: Boost uses PHP's output control functions (i.e. ob_start et al.) and hook_init to intercept every Drupal page request, buffer the content, compare to and update the cache, and then send the content along through Drupal normally.
Nothing stops you from expiring content manually by deleting its file from the cache. However, note that for pages which have path aliases (and thus Boost symbolic links) to them, the links do not get removed automatically so you may cause some wonkiness by doing this.
Boost inserts a small HTML comment at the very bottom of cached pages with the start and end cache times so that you can tell if it's working and how long a given file will persist in the cache.
Caveats
Like any somewhat intrusive technology (and by this I mean that it works with every page and changes the way your site operates as a whole), Boost should be used with caution. Arto states that the project is still in an alpha state.
The biggest issue that I've noticed is a strange bug which occasionally caches the front page as a Drupal "access denied" page. Others have seen this as well and I've never been able to nail it down completely. This is the main reason why drupal.ru's port of Boost to Drupal 5.x leaves out the front page from caching. I was able to work around this by hacking Boost's boost.api.inc file, in the boost_cache_set function, to not cache pages containing the words "access denied". I hope to report more once I figure this out.
The second issue is that currently, Boost will not work for sites that are not at the top-level. That is, if your site is domain.com/mysite, it will not work -- only domain.com would work. I believe this is on the .htaccess side, but it only really affected me in testing a development version of the site and since I was able to set up a top-level sandbox, I didn't investigate it any further. Once again, if I make any improvements in this area, I'll update this post.
Conclusion
This concludes my overview of Boost. As I mentioned above, I will update this post if I make any progress on the (very minor) issues that I've had with it. It's a great system and I highly recommend it!
You may also be interested in my Drupal page here at Code Sorcery Workshop for more info about my work with Drupal.
Thanks for reading!

No comments: