Saturday, January 29, 2011

To check if a file contains UTF-8 BOM at header

To check if a file contains UTF-8 BOM at header:

# hexdump -n 3 -C 2.txt
00000000 ef bb bf

ef bb bf // YES

====================

ISO8859-1 is almost identical to -15 where -15 replaces one encoding
with the Euro symbol and includes a few more french symbols. The only
way to tell them apart would be to look at the symbols in context.

UTF-8 is identical to ISO8859 for the first 128 ASCII characters which
include all the standard keyboard characters. After that, characters
are encoded as a multi-byte sequence.

Unicode is usually encoded in UTF-16. If you're lucky, there might be
a BOM (Byte Order Mark) of 0xFFFE or 0xFEFF as the first two characters
in the file. Otherwise, look for a 0x00 (Null character) as every
other character if the text file contains basic 7-bit ASCII characters.

http://www.xpheads.com/forums/microsoft-public-windowsxp-help_and_support/164700-how-detect-if-text-file-iso8859-1-iso8859-15-utf-8-unicode-encoded.html

No comments: