Wednesday, December 16, 2009

Remote MySQL table comparison

Remote MySQL table comparison
by gmax

In the current Agust 2004 issue of SysAdmin there is an article about one of my favorite subjects, i.e. remote table comparison applied to MySQL databases.

The source code is actually working code (not as documented as I would have liked it, though, but space was not unlimited) with which you can compare two tables in two remote databases and see wether they differ at all, and if they do, there are method to find out which rows are different, without copying the data from a host to the other.
How this works is minutely explain in the article, so I won't repeat it here. However, since the article deals mostly with database issues, I would like to spend some words on the supporting Perl code, which does the magic of creating the not-so-obvious SQL queries to carry out the task. I won't examine the most complex script, the one finding the detailed differences between tables, because I'd need to introduce too much background knowledge to explain them in full. Therefore, I would like to present a smaller function that will tell you if two tables are different at all, so you can take further action.
This is a slightly modified version of the source code for the same task published in the magazine. I am thinking about making a CPAN module with the whole thing, but not right now.

Let's start with the algorithm used.


In short, if you want to compare two large records, you can make a signature of the whole record, by joining together their fields and applying a CRC function, such as MD5 or SHA1.
Comparing a whole table is trickier, because in standard SQL, without stored procedures and cursors, obtaining the CRC of a range of records is far from trivial. One possibility, though, is to make a CRC for each record and then sum them up to get a result that can be consider the table signature.

Here's the code.

sub get_table_CRC { my $dbh = shift; my $tablename = shift; my $fields = get_fields($dbh, $tablename); my $check_table = qq[ SELECT COUNT(*) AS cnt, CONCAT(SUM(CONV( SUBSTRING(\@CRC:=MD5( CONCAT_WS('/##/',$fields)),1,8),16,10 )), SUM(CONV(SUBSTRING(\@CRC, 9,8),16,10)), SUM(CONV(SUBSTRING(\@CRC,17,8),16,10)), SUM(CONV(SUBSTRING(\@CRC,25,8),16,10)) ) AS sig FROM $tablename ]; # uncomment the following line to see the full query # print $check_table,$/; my ($count, $crc); eval { ($count, $crc) = $dbh->selectrow_array($check_table)}; if ($@) { return undef; } return [$count, $crc]; }


The SQL part is quite complicated by the fact that MySQL can't handle arithmetic operations with the kind of number that can result from a MD5 signature. To get over this problem, I split the MD5 string into four chunks, using the SUBSTRING function, and convert them from base 16 to base 10 using CONV. The result is a simple number that is passed to SUM. The MD5 is calculated once per record and assigned to a global MySQL variable (@CRC). It is the signature of a string composed by all fields in the record, with some adjustments to avoid NULL values to come into the equation.
Perl's biggest involvement in all this, apart form making the query, which is a bitch to create manually, is the get_fields function that reads the table structure and creates the list of fields to be be passed to MD5. If one field is nullable, a call to the COALESCE function will be used instead of its bare name.

sub get_fields { my ($dbh, $tablename) = @_; my $sth = $dbh->prepare(qq[describe $tablename]); $sth->execute(); my @fields=(); while (my $row = $sth->fetchrow_hashref()) { my $field ="`$row->{Field}`"; # backticks # if the field is nullable, # then a COALESCE function is used # to prevent the whole CONCAT from becoming NULL if (lc $row->{Null} eq 'yes') { $field = qq[COALESCE($field,"#NULL#")]; } push @fields, $field; } return join ",", @fields; }


With these two functions ready, we can actually run a test on two tables in two different hosts.
To be sure that the function works as advertised, this script will actually create the tables in both hosts. The first time this script runs, the newly created tables have the same contents, and the result will be "no differences". If you run it a second time, one record will be altered, just barely, and the result will be different.

#!/usr/bin/perl -w use strict; use DBI; # for this test, we create the database handlers directly # in the script my $dbh1 = DBI->connect('dbi:mysql:test;host=localhost', 'localuser', 'localpassword', {RaiseError => 1}); my $dbh2 = DBI->connect('dbi:mysql:test;host=192.168.2.33;port=13330', + 'remoteuser', 'remotepassword', {RaiseError => 1}); # this is the table to be created in both hosts my $tablename = 'testcrc'; my $create = qq{create table if not exists $tablename (i int not null, j int, a char(1), b float )}; my ($table_exists) = $dbh1->selectrow_array( qq{SHOW TABLES LIKE '$tablename'}); if ($table_exists) { # table exists. Let's make a tiny change $dbh1->do(qq{update $tablename set j = j-1 where i = 50}); } else # table does not exists. Create and populate { # create both tables $dbh1->do($create); $dbh2->do($create); my $insert = qq{insert into $tablename values (?, ?, ?, ?)}; my $sth1 = $dbh1->prepare($insert); my $sth2 = $dbh2->prepare($insert); # populates both tables with the same values for ( 1 .. 100 ) { $sth1->execute($_, $_ * 100, chr(ord('A') + $_), 1 / 3 ); $sth2->execute($_, $_ * 100, chr(ord('A') + $_), 1 / 3 ); } } my %probes; # gets the local table record count and CRC $probes{'local'} = get_table_CRC($dbh1, $tablename) or die "wrong info: $DBI::errstr\n"; # gets the remote table record count and CRC $probes{'remote'} = get_table_CRC($dbh2, $tablename) or die "wrong info: $DBI::errstr\n"; # Checks the result and displays print "LOCAL : @{$probes{'local'}}\nREMOTE: @{$probes{'remote'}}\n"; if ( ($probes{'local'}->[0] != $probes{'remote'}->[0]) or ($probes{'local'}->[1] ne $probes{'remote'}->[1]) ) { print "there are differences\n"; } else { print "NO DIFFERENCES\n"; }


The first run gives this result (the query is displayed only if you uncomment the relative print statement).
SELECT
    COUNT(*) AS cnt,
    CONCAT(SUM(CONV(
        SUBSTRING(@CRC:=MD5(
            CONCAT_WS('/##/',`i`,
                COALESCE(`j`,"#NULL#"),
                COALESCE(`a`,"#NULL#"),
                COALESCE(`b`,"#NULL#"))),1,8),16,10 )),
    SUM(CONV(SUBSTRING(@CRC, 9,8),16,10)),
    SUM(CONV(SUBSTRING(@CRC,17,8),16,10)),
    SUM(CONV(SUBSTRING(@CRC,25,8),16,10))
    ) AS sig
FROM
    testcrc

LOCAL : 100 211184068521202404576502196746869468230923726643
REMOTE: 100 211184068521202404576502196746869468230923726643
NO DIFFERENCES

The first number is a simple record count. The second one is a concatenated string from the four sums calculated on the MD5 chunks.
When we run the script for the second time, column "j" in record nr. 50 is increased by 1. This displays the following result:

LOCAL : 100 208246712599204490057913196197913536230317654094
REMOTE: 100 211184068521202404576502196746869468230923726643
there are differences

For the third run, we modify the source code and change "set j = j+1" to "set j = j-1". Again, the script reports that there are no differences.


You should notice that what we actually get from the get_table_CRC function is a few dozen bytes, even if the tables contain a million records. The database server may have some number crunching to perform, but you don't need to send gigabytes of data through the network.
This technique can save you hours of searching if you need to know if two remote tables have differences. More granularity is provided by the other functions described in the article.

Comments welcome

=================
Re: Remote MySQL table comparison
by fab (Initiate) on Aug 25, 2004 at 12:07 UTC

I've also read this paper with great interest.

Although there are a few potential bugs in the algorithms and in the implementations presented, the idea is both simple and efficient.

Thus I've implemented a new version which hopefully solves some of the weaknesses I found and has a better theoretical behavior. It is dedicated to PostgreSQL, but may be adapted to other databases.

It is called pg_comparator, a tool for network and time efficient database table content comparison.

see http://www.coelho.net/pg_comparator/ for the perl implementation.
=====================
Another programmer extended Maxia's work even further. Fabien Coelho changed and generalized Maxia's technique, introducing symmetry and avoiding some problems that might have caused too-frequent checksum collisions. This work grew into pg_comparator, http://www.coelho.net/pg_comparator/. Coelho also explained the technique further in a paper titled "Remote Comparison of Database Tables" (http://cri.ensmp.fr/classement/doc/A-375.pdf).

This existing literature mostly addressed how to find the differences between the tables, not how to resolve them once found. I needed a tool that would not only find them efficiently, but would then resolve them. I first began thinking about how to improve the technique further with my article http://tinyurl.com/mysql-data-diff-algorithm, where I discussed a number of problems with the Maxia/Coelho "bottom-up" algorithm. After writing that article, I began to write this tool. I wanted to actually implement their algorithm with some improvements so I was sure I understood it completely. I discovered it is not what I thought it was, and is considerably more complex than it appeared to me at first. Fabien Coelho was kind enough to address some questions over email.

=======

Reference:
http://www.perlmonks.org/?node_id=381053
http://www.maatkit.org/doc/mk-table-sync.html

No comments: