Efficient dual caching of dynamic web content

Last Tuesday, I held a short presentation at the very first “Web Monday” held in Stockholm (don’t ask why it was on a Tuesday) about a technique I used at the site I run, arkadtorget.se. The cache method I implemented there was the result of a series of discussions I had with my very good friend Peter Svensson, who is the smartest scatterbrain in the county, if not the country, or the world.

Peter, who has recently become a Javascript wizard and evangelist, suggested that I solve a problem I had with heavy load on the MySQL server at Arkadtorget by implementing a client-side cache. Storing content in local Javascript variables in the web browser. I had already determined that I needed to build and implement a server-side cache using PHP to write the results of MySQL queries into text files that I could later read from instead of bugging MySQL about the data again just because another user wants to read the same forum thread that another one just did.

There are plenty of PHP caching solutions out there, most of them using a similar method of writing database results into text files. To use server-side caching is a very wise and good idea in general, for almost any web site with common database access. Usually, the method is used to cache complete web pages after they have been rendered by a CMS or other web system. The complete resulting HTML from a complex series of PHP-parsing and database queries is stored in a text file on the server before being sent to the web browser. The next time a client requests the very same page (usually determined by the URL/query string used to access the content) the server can return the contents of the text file instead of again parsing PHP and making several database queries.

I use my cache methods mostly to cache parts of pages that I know are difficult for PHP/MySQL to render. This is often lists, tables of different kinds and simple counters or searches that are repeated a lot, but where the data does not neccessarily need to be 100% fresh. (What if it does need to be 100% pristine, fresh and healthy-smelling? More about that later.) In fact, sometimes I use server-side cache to store many parts of a page, separately and modularized, so that the same module (say, a sidebar navigation or statistics used in many different places on a site) can be called upon an unlimited amount of times without asking the database to check its rows more than once ?? i.e. the first time the content is rendered, and subsequently cached.

Now, since we are talking about tables and lists, I have made an observation regarding user behaviour and pagination. After logging every click a user makes on a paginated list of results, I noticed that the very pagination itself seems to encourage “flipping” back and forth between pages of results. I even started noticing myself doing it. I’d make a web search on Google, scan page one of results, click to page two, maybe to page three and then think “what was that third link on page one?” and flipping back to page one, checking it before flipping back to page three again and then on to four, and so on.

What’s going on here? In my logs, I could see that as many as 33% of the requested pages were pages I had already sent to the browser, during the same session! (Note, this is at a local, Swedish-language forum full of tech/arcade nerds – YMMV) Even if I have used on of the common cache methods in PHP to store the forum pages in a text file on the server, the client will ask the server for updates on every page, and with even low latency and a properly working cache, there will be plenty of requests sent from the client to the web server for changed graphics, included files, etc. (Most of which are preventable if you have set your expires headers correctly! More on this in another post.)

Now, I was already planning to migrate the forum to an AJAX-powered fetching method, but even using AJAX to get the contents of the forum, I would have to ask the backend for the same code, over and over in up to 33% of the time! This is where my friend Peter had the idea to circumvent the extra round trips to the web server by storing the HTML needed to print the pages in a local JS variable.

Simple, elegant, yet powerful. Every time a user requests a page he has not previously viewed I ask the AJAX backend (built with PHP) for the HTML-code needed to print the page. The PHP code will use the server-side cache method to make sure that the database will not have to bother with presenting content for the page, if it has been loaded before, and send the HTML-code ?? from text file cache or not ?? to the client. The browser, upon receiving this content for the first time, displays it for the user to read, but also quietly stores the entire page of data in a Javascript variable. If the user ends up going to page two and then wants page one again (“flipping” as it were) we can check to see if page one is already present in the local cache, and serve up the content of the Javascript variable straight, on the rocks, without even checking with the server to see what time of day it is! It’s a beautful solution, especially since the users who are doing the “flipping” notice nothing, except that a page they already visited before is loading instantly, without any noticable delay.
In fact, why stop there? I made some changes to make sure that the user experience is even more smooth by telling my AJAX backend to let the browser know whether the next page in the result set is available as server-cached content (I don’t want to bother the database if I don’t have to ?? I don’t know that the user is actually going to ask for the next page) and if it is, the browser will then proceed to load the next page of results straight into the local Javascript variable. The user sees and notices nothing of this, except if he clicks the next page, in which case I can now present that content instantaneously, without asking the server “hey, hand me page two ?? stat!”.

OK, so what was actually gained here? First of all, the user experience is greatly improved. The dual caching method makes this forum seem fast as lightning, most of the time. Secondly, the database can take a deep breath of relief, since I never have to ask it for content it’s already served up once.

Which brings us to the server-side cache, and its expiration methods. A common practice is to define a timeout, an interval of time during which the content will be considered fresh. When the predefined time is out, the content will be generated by asking the database the very next time it is requested by the user. This works great for content that doesn’t need to be 100% fresh, all the time. With a relatively conservative timeout, say 15 minutes, any content on the site is never more than 15 minutes old, even if it’s being read from the server cached text files. But what if 15 minutes is too long? In my case, I have a fairly active forum that needs caching, and 15 minutes is a very long time to wait between posting a message and actually showing it on the site.

We could, of course, delete the entire stock of stored text files containing the cached data every time a user posts in the forum. This would make sure that no cached content will be presented in case there is new data in the database. However, I don’t think that this is a very good idea in my case, because I have literally thousands of cached snippets (think hundreds of forum threads times tens and sometimes hundreds of pages) and deleting all of them would actually take some time for the server. Also, I am throwing the baby out with the bath water by deleting a lot of data that is not expired by a user posting in that very forum thread.

The answer is again, simple. Always make sure that you can back-reference a cache file with its content! A very common method for storing cache files on the server is using a hash (usually md5()) to make a unique filename for the content being cached. However, the drawback is that once you’ve md5()-ed, you can’t go back, since you can’t make potatoes out of fries or a cow from a hamburger. In my case, I decided on making up the file names of the cache files based on the content being cached. If I am caching forum 34, thread 34124 and page 2, the cache file is called “34-34124-2.cache”. Simple, no?

“But”, you say. “Having thousands of small text files on the server will make the request slow anyway, since the time the server needs to look up a file grows with the number of files in the same folder!”

Yes, it does. Have you ever made the mistake of not deleting session files in Apache after the session is expired? After a while, the web site will be extremely slow when Apache tries to locate your session file in a folder with a million files. My solution was to make a folder structured based on the md5() of the cache name, but using only the first two characters from the md5(). So, I currently have 235 folders storing the cache data, but I retained back-referencing!

This solution is so simple it could be used anywhere paged results are used. It costs next to nothing to implement (a little memory usage on the client, but unless you have pages of HTML using hundreds of kilobytes, this is of no concern) and there is no drawback to the method. Faster browsing, less traffic and load on the server is your reward.

Here is a tiny bit of example Javascript just to show what I am talking about:

[quickcode:noclick]function printThread(forumid, threadid, page) {
// Check for local cache first – pagesCached is an array of booleans,
// pageCache is an array of strings containing local cached content.
if (pagesCached[page] && pageCache[page].length > 10) {
alert(“From JS cache”); // We have a local cache!
// First, choose what div to write to
div = document.getElementById(“forumthread”);
// Next, set the innerHTML of that div to the content of local cache
div.innerHTML = pageCache[page];
// Now, if the next page in the result set is not
// already present in the local cache – go get it!
if (!pagesCached[(page+1)]) {
// Insert your own loading/caching code here
}
}
else {
alert(“From AJAX”);
// Go get the page from the server,
// but don’t forget to cache it locally afterwards!
}
}
[/quickcode]

/M;

2 thoughts on “Efficient dual caching of dynamic web content”

  1. If the forum entries / profile snippets are not changing upon profile changes (eg. new signature, number of posts written up-to-now, …), all pages except the last one of a forum thread are fixed and will never change. Thus they can get a far-future expires header and the browser will handle it. If the cached-page loading is still much slower than with the Ajax-caching you mention, improve the overall HTML (navigation, CSS, scripts, etc.) for performance.

    I once did a lot of Ajax for loading new “pages” faster to find out that standard website performance improvement tricks (eg. put CSS at top, script includes at the end of body,…) can be really fast as well. Note that I mean loading “big content” parts of the site, not those small snippets or autocompletion features for which Ajax is the only choice for better usability through speed.

    See http://developer.yahoo.com/performance/

  2. Alexander, thank you for your comments. What you say is basically true, and in a “standard” forum setup, far-future expires headers are definately our friend. In my case, I have a somewhat “weird” forum setup, in which the order of posts is reversed, so the pages will be different pretty much every time somebody posts in the thread.

    Even so, far-future expires headers are very important for many reasons, some of which I intend to talk about in a not-so-far-future blog post. :)

    /M;

Leave a Reply

Your email address will not be published.