Tuesday, September 21, 2010

Cache a large array: JSON, serialize or var_export?

Cache a large array: JSON, serialize or var_export?
Monday 06 July 2009 10:30 While developing software like our framework you will need to cache a large data array to a file at some point sooner or later. At such a point you need to choose what caching method you will be using. In this article I will compare three methods: JSON, serialization and var_export() combined with include().
By Taco van den Broek

Too curious? Jump right to the results!
JSON

The JSON method uses the json_encode and json_decode functions. The JSON-encoded data is stored as is into a plain text file.
Code example
view plaincopy to clipboardprint?

1. // Store cache
2. file_put_contents($cachePath, json_encode($myDataArray));
3. // Retrieve cache
4. $myDataArray = json_decode(file_get_contents($cachePath));

// Store cache
file_put_contents($cachePath, json_encode($myDataArray));
// Retrieve cache
$myDataArray = json_decode(file_get_contents($cachePath));

pros

* Pretty easy to read when encoded
* Can easily be used outside a PHP application

cons

* Only works with UTF-8 encoded data
* Will not work with objects other than instances of the stdClass class.

Serialization

The serialization method uses the serialize and unserialize functions. The serialized data is, just like the JSON data, stored as is into a plain text file.
Code example
view plaincopy to clipboardprint?

1. // Store cache
2. file_put_contents($cachePath, serialize($myDataArray));
3. // Retrieve cache
4. $myDataArray = unserialize(file_get_contents($cachePath));

// Store cache
file_put_contents($cachePath, serialize($myDataArray));
// Retrieve cache
$myDataArray = unserialize(file_get_contents($cachePath));

pros

* Does not need the data to be UTF-8 encoded
* Works with instances of classes other than the stdClass class.

cons

* Nearly impossible to read when encoded
* Can not be used outside of a PHP application, without having to write custom functions

Var_export

This method 'encodes' the data using var_export and loads the data using the include statement (no need for file_get_contents!). The encoded data needs to be in a valid PHP file so we wrap the encoded data in the following PHP code:
view plaincopy to clipboardprint?

1. 2. return /*var_export output goes here*/;

return /*var_export output goes here*/;

Code example
view plaincopy to clipboardprint?

1. // Store cache
2. file_put_contents($cachePath, " 3. // Retrieve cache
4. $myDataArray = include($cachePath);

// Store cache
file_put_contents($cachePath, " // Retrieve cache
$myDataArray = include($cachePath);

pros

* No need for UTF-8 encoding
* Is very readable (assuming you can read PHP code)
* Retrieving the cache uses one language construct instead of two functions
* When using an opcode cache your cache file will be stored in the opcode cache. (This is actually a disadvantage, see the cons list).

cons

* Needs PHP wrapper code.
* Can not encode Objects of classes missing the __set_state method.
* When using an opcode cache your cache file will be stored in the opcode cache. If you do not need a persistant cache this is useless, most opcode caches support storing values in the shared memory. If you don't mind storing the cache in memory, use the shared memory without writing the cache to disk first.
* Another disadvantage is that your stored file has to be valid PHP. If it contains a parse error (which could happen when your script crashes while writing the cache) your application will not work anymore.

Benchmark

In my benchmark I used 5 different data sets with different sizes (measured in memory usage): 904B, ~18kB, ~250kB, ~4.5MB and ~72.5MB. For each of these data sets I did the following routine for each encoding method:

1. Encode the data 10 times
2. Calculate the string length of the encoded data
3. Decode the encoded data 10 times

Results

Yay, results! In the result tables you see the length of the encoded string, the total time used for encoding and the total time used for decoding. The benchmark was done on my laptop: 2.53GHz, 4GB, Ubuntu linux, PHP 5.3.0RC4.
904 B array
JSON Serialization var_export / include
Length 105 150 151
Encoding 0.0000660419464111 0.00004696846008301 0.00014996528625488
Decoding 0.0011160373687744 0.00092697143554688 0.0010221004486084
18.07 kB array JSON Serialization var_export / include
Length 1965 2790 3103
Encoding 0.0005040168762207 0.00035905838012695 0.001352071762085
Decoding 0.0017290115356445 0.0011298656463623 0.0056741237640381
290.59 kB array JSON Serialization var_export / include
Length 31725 45030 58015
Encoding 0.0076849460601807 0.0057480335235596 0.02099609375
Decoding 0.014955997467041 0.010177850723267 0.030472993850708
4.54 MB array JSON Serialization var_export / include
Length 507885 720870 1059487
Encoding 0.13873195648193 0.11841702461243 0.38376498222351
Decoding 0.29870986938477 0.21590781211853 0.53850317001343
72.67 MB array JSON Serialization var_export / include
Length 8126445 11534310 19049119
Encoding 2.3055040836334 2.7609040737152 6.2211949825287
Decoding 4.5191099643707 8.351490020752 8.7873070240021

We've done the same benchmark on eight other machines including Windows and Mac OS machines and some webservers running Debian. Some of these machines had PHP 5.2.9 installed, others already switched to 5.3.0. All had the same (relative) results, except for a macbook in which serialize was faster encoding the largest dataset.
Conclusion

As you can see the var_export (without opcode cache!) method doesn't come out that well and serialize seems to be the overall winner. What bothered me though was the largest dataset in which JSON became faster than serialize. Wondering whether this was a glitch or a trend I fired up my OpenOffice spreadsheet and created some charts:

The charts show the relative speed of each method compared to the fastest method (so 100% is the best a method can do). As you can see both JSON and var_export become relatively faster when the data set gets big (arrays of 70MB and bigger? Maybe you should reconsider the structure of your data set :)). So when using a sane sized data array: use serialize. When you want to go crazy with large data sets: use anything you like, disk i/o will become your bottleneck.

Tags

«Back
Reactions on "Cache a large array: JSON, serialize or var_export?"

garfix
Placed on: 07-09-2009 16:30 [Quote] Quote
Patrick van Bergen
User icon
to be continuum

Good job, Taco. Wish php.net had these kinds of stats.
Geert
Placed on: 08-04-2009 10:16 [Quote] Quote

Very useful benchmarks. Thanks.
Ries van Twisk
Placed on: 08-13-2009 04:52 [Quote] Quote

Do you happen to have any results where you have used an opcode cache?

I can only imagine that with an upcode cache the var_export method is faster. Pure theoretically this would mean that with an include the data is 'there' and shouldn't have to be parsed anymore.

Ries
Peter Farkas
Placed on: 09-29-2009 16:56 [Quote] Quote

This style is the one I like so much!
Thank you!
Brilliant work!
Vasilis
Placed on: 01-15-2010 10:27 [Quote] Quote

Great info man... I like benchmarks! Thank you
Nice work!
Placed on: 03-24-2010 17:35 [Quote] Quote

Thanks a million - refreshing to see solid content.

Concise and well documented, perfect.
Frank Denis
Placed on: 05-13-2010 20:47 [Quote] Quote

If speed and size matters, igbinary beats all of these hands down: http://opensource.dynamoid.com/

Reference: http://techblog.procurios.nl/k/618/news/view/34972/14863/Cache-a-large-array-JSON-serialize-or-var_export.html

No comments: