File Encoding Validation

If you’re using a Unix or Unix-like operating system you can leverage the GNU iconv library to validate the encoding of a file or files. Although, the GNU iconv library is actually meant to do file conversions, it can still be used in a way that will give you some understanding if the file(s) contain invalid byte sequences.

$ iconv -f UTF-8 the_file -o /dev/null

In the above example you actually don’t need to necessarily specify the UTF-8 as this is the assumed default encoding for iconv. This example will output a 0 if the file is able to be converted (no encoding errors) or a 1 if it cannot (encoding errors). It will also display the byte offset where the invalid byte sequence occurs. Also, an alternative syntax for this command might be something like:

$ iconv -f UTF-8 < the_file > /dev/null

For more information refer to:

Written on May 3, 2012