Friday, December 5, 2008

Handling large files with(out) PHP

Handling large files with(out) PHP

As one man was quoted "640K of memory should be enough for anybody" no one will need to access more than 2 GB data. What happens if you - just for scientific reasons of course - try to access larger files using your 32bit hardware and your favorite programming language PHP? For a first test let's take this file
$ ls -hl dummyfile
-rw-r--r-- 1 johannes users 2.2G 2006-02-02 14:32 dummyfile

and a bit code, to read the first few bytes of it

<php
$fp = fopen("dummyfile", "r");
$data = fread($fp, 255);
fclose($fp);
?>


Now, time to run the script, but first let's think what the right behavior is. In this case it's quite simple, we just expect an empty page. So let's see:
Warning: fopen(dummyfile): failed to open stream: File too large in ....

Ouch. In some cases it's now enough to remember the initial quote and block too large files. The code could like something like this:

<?php
$file = "dummyfile";
if (filesize($file) > 2147483647) {
die("File to big");
}
$fp = fopen("dummyfile", "r");
$data = fread($fp, 255);
fclose($fp);
?>

So let's see what happens now:
Warning: filesize(): stat failed for dummyfile in ...
Warning: fopen(dummyfile): failed to open stream: File too large in ...

And once again it didn't help. Even filesize() can't handle this file. So the only way to catch these errors is the only way handling fopen() errors: Shut down the error reporting and check the result. For silencing down the the error reporting we can simply use the @-Operator. Now somebody could show up and say "hiding errors is bad you need to prevent them by checks". The only answer I can give to such an statement is: I showed one example we can't check. and there are other things, think about race conditions, we can't check or prevent so the only way is to check after the failure occurred. But back to our topic, now a script which handles the errors and processes the file, as long as PHP is able to do it:

<?php
$fp = @fopen("dummyfile", "r");
if (!$fp) {
// You should do this a bit nicer...
die("Unable to process the data.");
}
$data = fread($fp, 255);
fclose($fp);
?>


The only reliable way to show a better error message is by parsing the one PHP generated. You can get it either using $php_errormsg if track_errors ini setting is on or by using your own error handler. I won't discuss this any further since we're still having another, bigger problem: What about if we need to access the data of our huge file? The first step is to see why we're having these problems. The short explanation is quite simple: PHP isn't platform independent. It absracts quite a lot off stuff but far from everything. One of the examples is that some functions are simply missing on a specific platform. For example checkdnsrr() isn't available on Windows. Another difference between platforms, which which is overlooked quite often is one of the most often sed data types for PHP variables: The integer. In PHP's implemention the architecutre dependant datatype long is being used all over the place. In our case all file system operations use the operating system's functiosn which return their darta with this size, even though most systems have a way to handle larger files. For the file system functions there's even a way to change this. If you add "-D_FILE_OFFSET_BITS=64" to your compiler's command line flags using the CFLAGs environment variable the libc will use stat64 instead of the standard stat call and other file system funcitons work then, too. So let's give it a shot.
$ CFLAGS="-D_FILE_OFFSET_BITS=64" ./configure --with-stuff-i-need

No it's time to test it. fopen() works, fread() works. It feels great. But that's it. Some issues are left. For example filesize() has problems with the integer overflow: $ php -r 'echo filesize("dummyfile");' -1988084300

Other functions like fseek() won't work at all. $ php -r '$fp = fopen("dummyfile", "r"); var_dump(fseek($fp, 2147483700));' int(-1)

In some cases this might be enough. In others not. Currently 64bit platforms are getting cheaper and cheaper so this might be a solution to the problem if you can't other hardware we need to find a solution within the application. An applications we're building for one of our client's imports CSV files processes these, generates some statistics and exports them aftwerwards. Currently we're processing around 5 GB per reporting date and we're expecting to have 20 GB soon. In that process we don't just get an bigger total size but also the individually imported files get bigger. Before we see how we solved the 2 GB problem let's take a short look how the application works:

First the data files are uploaded to the server as zip/gzip/bz2 compressed CSV files. Then theses files are extracted and a short verification is done whether the data seems to be in a valid format and is correct. After these checks the file is being split into different chunks. These chunks are validated and imported to the database by multiple parallel processes. Since we already need the chunked file we re-order the import steps and first split the file and then do the validation. For splitting the data files we're using the typical Unix Shell tools which are - on modern systems - all able to handle large files by default. The benefit is that we can rely on default PHP packages and don't neetd to worry wether the admin used the right compile flags or not. As we're handling CSV files we always need the header from the first line of the initial file for any chunk.

Let's look at some code:

<?php
function _makeFilechunks( $srcFilename, $chunkSize, $filePrefix)
{
$splitcmd = sprintf("split -a5 -l %u %s %s 2>&1", $chunkSize, $srcFilename, $filePrefix);
$output = null;
exec($splitcmd, $output, $return_code);
if ($return_code) {
die("Chunk split failed, code: ".$return_code." message: ".implode("\n", $output));
}

$chunkfiles = glob($filePrefix.'*');
sort($chunkfiles);

// We need the last element during the iteration...
end($chunkfiles);
$lastkey = key($chunkfiles);

$had_first = true;
foreach ($chunkfiles AS $key => $chunk) {
if (!had_first) {
// We don't need to change the first element, but need the first line from there
$header = exec("head -n1 ".current($chunkfiles));
$fp = fopen($filePrefix."__header_", "w");
fwrite($fp, $header."\n");
fclose($fp);

// The first file has $chunkSize - 1 lines. The following files
// hold one row more since we add the header there
$chunks[$key] = $chunkSize - 1;

$had_first = true;
continue;
}

// The second to the last element need the header, too - so add it
$cmd = sprintf('cp %1$s__header_ %1$s__tmp_ && cat %2$s >> %1$s__tmp_ && mv %1$s__tmp_ %2$s', $filePrefix, $chunk);
exec($cmd, $output, $return_code);
if ($return_code) {
die("Adding headers to chunk failed, code: ".$return_code." message: ".implode("\n", $output));
}

// The last element is smaller than chunkSize
if ($key == $lastkey) {
// wc returns the result in like "12345 filename", using trim and an int cast
// we get only the number of line, since we already added the header we need one row less
$numrows = (int)trim(exec("wc -l ".$chunk)) - 1;
} else {
$numrows = $chunkSize;
}

$chunks[$key] = $numrows;
}

// Remove the tempfile we used to store the CSV-Header
unlink($filePrefix.'__header_');

return $chunks;
}
}?>

Now we we've got our chunkfiles which are all handable and need noe special PHP setup and chances are high that it will survive future PHP updates. And don't forget: "People said I should accept the world. Bullshit! I don't accept the world", just to quote another man.


A client recently had a problem processing an XML with his PHP script. " File too large " was the error, and the data file was over 2 gigabytes in size. It turns out you can recompile PHP to deal with large files (see Requirements section here). But this


http://us2.php.net/manual/en/ref.filesystem.php

According to the PHP docs, you should also use the flag D_LARGEFILE_SOURCE. I have not tried it, but I wonder if that would fix some of those overflow issues.


Depends on what you're trying to do. If you just want to know if the files are different from one another you can use the md5_file() on each of the files and compare the md5 hash.

If your 10GB files contain data you want to compare individually, such as a CSV, you could always create a temporary MySQL table to house the CSV contents for both of the files and then import the CSV files into MySQL from the command line. (In your temp table you would need to identify the contents of the first file to the second -- such as assigning an ID of 1 to every record that is part of your first CSV.) Then it's just a matter of comparing the data between the two tables with vanilla SQL. You're not limited with PHP at all at that point.

The reason I'm here is I'm trying to find a solution for the filesize() issue. I have a file that is 11GB big and the return value of filesize() only represents about 2.7GB of the file. I need to be able to compare the file size of one file to another and give back the difference between the two files in MB. If anyone has a solution for that problem I'd greatly appreciate an answer. I think the solution is using exec('dir \\externalhost\path\to\file.ext') to get the number of bytes, but I still have the issue of the number of bytes being greater than the amount PHP will support when calculating the difference...

No comments: