public class CharsetToolkit extends Object
Utility class to guess the encoding of a given text file.
Unicode files encoded in UTF-16 (low or big endian) or UTF-8 files with a Byte Order Marker are correctly discovered. For UTF-8 files with no BOM, if the buffer is wide enough, the charset should also be discovered.
A byte buffer of 4KB is used to be able to guess the encoding.
Usage:
CharsetToolkit toolkit = new CharsetToolkit(file);
// guess the encoding
Charset guessedCharset = toolkit.getCharset();
// create a reader with the correct charset
BufferedReader reader = toolkit.getReader();
// read the file content
String line;
while ((line = br.readLine())!= null)
{
System.out.println(line);
}
| Constructor and description |
|---|
CharsetToolkit
(File file)Constructor of the CharsetToolkit utility class. |
| Type Params | Return Type | Name and description |
|---|---|---|
|
public static Charset[] |
getAvailableCharsets()Retrieves all the available Charsets on the platform,
among which the default charset. |
|
public Charset |
getCharset() |
|
public Charset |
getDefaultCharset()Retrieves the default Charset |
|
public static Charset |
getDefaultSystemCharset()Retrieve the default charset of the system. |
|
public boolean |
getEnforce8Bit()Gets the enforce8Bit flag, in case we do not want to ever get a US-ASCII encoding. |
|
public BufferedReader |
getReader()Gets a BufferedReader (indeed a LineNumberReader) from the File
specified in the constructor of CharsetToolkit using the charset discovered or the default
charset if an 8-bit Charset is encountered. |
|
public boolean |
hasUTF16BEBom()Has a Byte Order Marker for UTF-16 Big Endian (utf-16 and ucs-2). |
|
public boolean |
hasUTF16LEBom()Has a Byte Order Marker for UTF-16 Low Endian (ucs-2le, ucs-4le, and ucs-16le). |
|
public boolean |
hasUTF8Bom()Has a Byte Order Marker for UTF-8 (Used by Microsoft's Notepad and other editors). |
|
public void |
setDefaultCharset(Charset defaultCharset)Defines the default Charset used in case the buffer represents
an 8-bit Charset. |
|
public void |
setEnforce8Bit(boolean enforce)If US-ASCII is recognized, enforce to return the default encoding, rather than US-ASCII. |
Constructor of the CharsetToolkit utility class.
file - of which we want to know the encoding. Retrieves all the available Charsets on the platform,
among which the default charset.
Charsets.Retrieves the default Charset
Retrieve the default charset of the system.
Charset.Gets the enforce8Bit flag, in case we do not want to ever get a US-ASCII encoding.
Gets a BufferedReader (indeed a LineNumberReader) from the File
specified in the constructor of CharsetToolkit using the charset discovered or the default
charset if an 8-bit Charset is encountered.
BufferedReaderHas a Byte Order Marker for UTF-16 Big Endian (utf-16 and ucs-2).
Has a Byte Order Marker for UTF-16 Low Endian (ucs-2le, ucs-4le, and ucs-16le).
Has a Byte Order Marker for UTF-8 (Used by Microsoft's Notepad and other editors).
Defines the default Charset used in case the buffer represents
an 8-bit Charset.
defaultCharset - the default Charset to be returned
if an 8-bit Charset is encountered. If US-ASCII is recognized, enforce to return the default encoding, rather than US-ASCII.
It might be a file without any special character in the range 128-255, but that may be or become
a file encoded with the default charset rather than US-ASCII.
enforce - a boolean specifying the use or not of US-ASCII.Copyright © 2003-2024 The Apache Software Foundation. All rights reserved.