Saturday, December 13, 2008

Remove Smart Quote Quotation / weird unwanted character from a CSV file (Comma separated values)

Remove Smart Quote Quotation / weird unwanted character from a CSV file (Comma separated values)

Today, I was using Regular Expression to parse the data. I encountered a weird character problem, which is so-called the smart quote. The smart quote messes up my rest of data, makes the parser not able to parse the data correctly. Then, I start looking for a method to edit hex in vi / vim . This is what I found:

method 1:
Type following command in the vi's command mode to convert the file to a display in hex.
:%!xxd

Type following command when you are done in hex.
:%!xxd -r

method 2:
start out with vi -b filename to edit a hex or binary file,

method 3:
Type following command in the vi's command mode
:set display=uhex.

The hex code of smart quote can be one of following depends on which encoding type you used:

windows-1252: 93
iso-8859-1: 3F (question mark, i.e., cannot be represented)
Big5: A1A7
UTF-8: E2809C
UTF-16: 201C
UTF-16BE: 201C
UTF-16LE: 1C20
UTF-7: /v8gHA (not the hex representation, that would be: 2F7638674851)
Shift-Jis: 8167
IBM500: 3F (wrongly represented by the serializer: SUB is a control code in IBM encodings)
GB 18030: A1B0
MacRoman: D2 (the mac codepage actually has a 'smart quote')
EUC-JP: A1C8

Finally, I come up with a batch script to replace these smart quotes to # (23) number sign:

Note: I saved the file as the Western European (Windows) encoding (windows-1252), so the code of smart quote is 93 (left quote) and 94 (right quote)

Recently I’ve been running into a lot of silliness that appears in files exported from MS applications (Word, Excel, etc.) called ‘Smart quotes’.

Basically MS uses higher level ascii characters to represent quotes that mean more than regular quotes (whatever for?!). You see this weirdness in vi as <93><92> etc. which are the hex values of these characters. I had to hunt a bunch on google to find out how to fix this, although the fix is very easy.

For each value that you see in your file, just do a string substitution, like so:

:%s/<93>/\’/g

of course, you can’t just type that <93> in there, so to get it in there you use (via: http://www.vim.org/htmldoc/usr_45.html)

CTRL-V x 93

which inserts hex 93 in place.

In recently exported CSV’s from excel, I’ve seen hex 91-97.

Quite annoying, frankly.



You could also use perl, like here

On Unix:
perl -pi -e 's/\x93\x94/"/g' myfile.html

On Windows:
perl -pi -i.bak -e "s/\x93\x94/\"/g" myfile.html

and use it on multiple files.

removeChars.pl:

#!C:\Perl\bin\perl.exe

# this script does four things:
# 1) remove the first line.
# 2) replace all commas inside quotes to underscore (_)
# 3) Remove all quotes
# 4) Remove the lines that its first field is not digit.
# Run: perl parse.pl file_name

#$fileName = @ARGV[0];
#printf("\n");

foreach my $fileName (@ARGV) {
open FILE, " open FILE2, ">zip_file/data/" . $fileName . ".done" or die $!;

$i = 0;
while () {
# I don't need the first line, so we start from second line.
if ($i==1) {
# assigning each line to $str
$str = $_;

$stopLoop = ( $str =~ m/^[^0]/ );
if ($stopLoop) {
last;
}

# $str=~ s/("[^"]+")/remove($1,$1)/eg; # for OO
$str=~ s/("[^"]+")/remove($1)/eg;
# Note, if you want to use your own defined function inside
# the s (substitution) command (remove() in above codes for example).
# You will have to use the e Modifier
# e: s (substitution) command only. Execute the output before you substitute it in

# write $str to file2
print FILE2 $str
} else{
$i = 1;
}
}

close FILE;
close FILE2;
}



#$str = '123, "how, are, you", "i, am", "fine, thank, you, and, you"';
#$str=~ s/("[^"]+")/remove($1)/eg;
#print $str;


sub remove {
# my( $self, $s ) = @_; # this is for object orient
my $s = shift or return;

# find: quote, comma, smart quote (left), smart quote (right)
# replace: nothing
# note: can use {$1} to separate other text
$s =~ s/"|,|\x93|\x94//g;
return $s;
}


#分小塊讀入文件到緩存,然後寫出。優點:服務器系統資源損耗小,客戶端下載速度快。
# while(sysread(DLFILE, my $buffer, 256)){print $buffer;}
# close (DLFILE) || $main->printMessage('關閉文件錯誤,可能文件不存在。出現此錯誤請和網管聯繫,謝謝。');


sub de_cp1252sub de_cp1252 {
my( $self, $s ) = @_;

# Map incompatible CP-1252 characters
$s =~ s/\x82/,/g;
$s =~ s-\x83-<em>f</em>-g;
$s =~ s/\x84/,,/g;
$s =~ s/\x85/.../g;

$s =~ s/\x88/^/g;
$s =~ s-\x89- °/°°-g;

$s =~ s/\x8B/</g;
$s =~ s/\x8C/Oe/g;

$s =~ s/\x91/'/g;
$s =~ s/\x92/'/g;
$s =~ s/\x93/"/g;
$s =~ s/\x94/"/g;
$s =~ s/\x95/*/g;
$s =~ s/\x96/-/g;
$s =~ s/\x97/--/g;
$s =~ s-\x98-<sup>~</sup>-g;
$s =~ s-\x99-<sup>TM</sup>-g;

$s =~ s/\x9B/>/g;
$s =~ s/\x9C/oe/g;

# Now check for any remaining untranslated characters.
if ($s =~ m/[\x00-\x08\x10-\x1F\x80-\x9F]/) {
for( my $i = 0; $i < length($s); $i++) {
my $c = substr($s, $i, 1);
if ($c =~ m/[\x00-\x09\x10-\x1F\x80-\x9F]/) {
printf(STDERR "warning--untranslated character 0x%02X i
+n input line %s\n",
unpack('C', $c), $s );
}
}
}

$s;
}


Reference:
http://aditya.sublucid.com/2008/01/18/replacing-those-pesky-smart-quotes-in-vim/
http://www.usenet-forums.com/linux-general/98618-can-vi-show-edit-binary.html
http://www.biglist.com/lists/xsl-list/archives/200709/msg00358.html
http://lists.freebsd.org/pipermail/freebsd-questions/2003-July/012019.html


Re: [xsl] RE: Smart Quote Encoding
Subject: Re: [xsl] RE: Smart Quote Encoding
From: Abel Braaksma
Date: Fri, 14 Sep 2007 03:17:15 +0200


Hi Roger,

see below



Roger L. Cauvin wrote:
messages to XML. For various reasons (including troubleshooting), I would
like to log the content of the e-mails exactly. It sounds like that's
simply not possible in XML, at least to the extent that "text-only" can
include characters not allowed in XML.
I have a program that is receiving text-only e-mails and logging the


Text only is something that is very much misunderstood, simply because there ain't such a thing as text-only. Is that IBM-437 format (that is: original DOS). Is that ASCII? Some people call it plain old ascii when it is in fact windows-1252. And many people don't know the differences between windows-1252 and ISO-8859-1. If text-only means UTF-8, it will contain a lot of "binary" looking bytes if viewed with windows-1252 encoding. If you have a mail from gmail, it will be send in UTF-7 (yes, I know, strange). Which is yet another binary format for text-only.

And last but not least, if you have text-only in EBCDIC encoding, it will look like a mess in Windows. But luckily, XML can handle all these encodings, but parsers are only required to handle UTF-8 and UTF-16. Saxon, in my experience, can handle ebcdic well, but not UTF-7 (but that's to blame with Sun who's been refusing for ages to include UTF-7).


I inserted the ISO 8859-1 encoding declaration myself. Apparently, Saxon
6.3 doesn't support windows-1252 encoding. Saxon 8.9J, which I just now
installed, does appear to support that encoding. However, it still
(correctly) flags the U+18 character as illegal.


that does not depend on Saxon but on the Java version you use. Btw, why would you want to use an age-old version of Saxon? Saxon 8.9 can do XML 1.1 and that can represent character U+18.

But you should workout what the encoding is your email program is using. It may be that it stores the file in the same encoding as it is received, but it is more likely that the email program transforms it into some other format. If you read your email from the interface of the email program (i.e., Thunderbird) then you will see what is done with the code. But when you view your email in bare format, you'll have to find out what the email program does to it. In the case of Thunderbird, I believe it stores all in one file and encodes it as UTF-8, but I am not sure (and you probably use a different mailer). Maybe TB even stores different text formats in one file...


More about U+18
---------------
Unfortunately, without knowing what the byte sequence of that character is and without a binary view of your whole file, it is quite hard to determine what the real encoding is. Like David already suggested: you should check the encoding in your file.

Furthermore, you are dealing with a lot of text it seems. If you want more control over what encoding you can choose, you can try unparsed-text($url, $encoding) in XSLT 2.0. Note that you must remove the XML declaration then, because the spec states that the encoding in the declaration has higher precedence than the user specified encoding.

If you want to check a bunch of encodings all at once and see if one fits, you can lookup the list of supported encodings at Sun's website. Make it into a sequence, i.e. and you can loop through all possibilities using:



But to me, it sounds like you have encountered a mail in the lesser used UTF-7 encoding. You were talking of the windows quote. Let's examine that (assuming that you are correct in your analysis that the quote appears where you have found U+18 (which is not U+18, but a byte with hex value 18 that does not translate to the correct character in the encoding that you guessed the file was in)), and this 18h is part of a longer byte sequence where the smart quote is used. The smart code, I hope, is ” (or &_#x201D if the browser/mailer screws it up), which is the one MS Word uses.

You read the file using 'ISO-8859-1'. That means that the first 127 bytes equal there counterparts in the Unicode table (i.e. U+0 to U+7F are the same as the bytes in the stream).

Now, let's see how the smart quote really should look like in certain encodings that I think are likely to be encountered in mail (alle bytes zijn in hex):

windows-1252: 93
iso-8859-1: 3F (question mark, i.e., cannot be represented)
Big5: A1A7
UTF-8: E2809C
UTF-16: 201C
UTF-16BE: 201C
UTF-16LE: 1C20
UTF-7: /v8gHA (not the hex representation, that would be: 2F7638674851)
Shift-Jis: 8167
IBM500: 3F (wrongly represented by the serializer: SUB is a control code in IBM encodings)
GB 18030: A1B0
MacRoman: D2 (the mac codepage actually has a 'smart quote')
EUC-JP: A1C8


I got this information by combining saxon:serialize, saxon:string-to-hexBinary and saxon:string-to-base64Binary (the last one for creating the not supported and non-unicode standard UTF-7). As you can see, there is no encoding in existence that has ever used byte 18 to represent a high character.

The only situation that I can think of to legally have a byte 18 in a sequence of bytes to represent a character, is in multi byte encoding formats. I.e., UTF-16, GB 18030, Big5 etc. I believe that, because it is a control character, that you will only find it as the second byte in any two byte sequence and never as the first byte.


All this just for one character that is illegally encoded in the source? Well, email programs, like I said, save their data in a variety of formats. Or set the output options to XML 1.1 and use unparsed-text.

Hope this "little" story clarifies things a bit for you.


cheers,
-- Abel Braaksma

1 comment:

Anonymous said...

Thank you so much for this post! I used to remember how to do this, but it's been driving me crazy trying to figure out how to type in hex values into my search/replace queries. By far the most informative post on this topic that I found!!! :)