public class CharsetToolkit
extends java.lang.Object
Unicode files encoded in UTF-16 (low or big endian) or UTF-8 files with a Byte Order Marker are correctly discovered. For UTF-8 files with no BOM, if the buffer is wide enough, the charset should also be discovered.
A byte buffer of 4KB is used to be able to guess the encoding.
Usage:
CharsetToolkit toolkit = new CharsetToolkit(file);
// guess the encoding
Charset guessedCharset = toolkit.getCharset();
// create a reader with the correct charset
BufferedReader reader = toolkit.getReader();
// read the file content
String line;
while ((line = br.readLine())!= null)
{
System.out.println(line);
}
| Constructor and Description |
|---|
CharsetToolkit(java.io.File file)
Constructor of the
CharsetToolkit utility class. |
| Modifier and Type | Method and Description |
|---|---|
static java.nio.charset.Charset[] |
getAvailableCharsets()
Retrieves all the available
Charsets on the platform,
among which the default charset. |
java.nio.charset.Charset |
getCharset() |
java.nio.charset.Charset |
getDefaultCharset()
Retrieves the default Charset
|
static java.nio.charset.Charset |
getDefaultSystemCharset()
Retrieve the default charset of the system.
|
boolean |
getEnforce8Bit()
Gets the enforce8Bit flag, in case we do not want to ever get a US-ASCII encoding.
|
java.io.BufferedReader |
getReader()
Gets a
BufferedReader (indeed a LineNumberReader) from the File
specified in the constructor of CharsetToolkit using the charset discovered or the default
charset if an 8-bit Charset is encountered. |
boolean |
hasUTF16BEBom()
Has a Byte Order Marker for UTF-16 Big Endian
(utf-16 and ucs-2).
|
boolean |
hasUTF16LEBom()
Has a Byte Order Marker for UTF-16 Low Endian
(ucs-2le, ucs-4le, and ucs-16le).
|
boolean |
hasUTF8Bom()
Has a Byte Order Marker for UTF-8 (Used by Microsoft's Notepad and other editors).
|
void |
setDefaultCharset(java.nio.charset.Charset defaultCharset)
Defines the default
Charset used in case the buffer represents
an 8-bit Charset. |
void |
setEnforce8Bit(boolean enforce)
If US-ASCII is recognized, enforce to return the default encoding, rather than US-ASCII.
|
public CharsetToolkit(java.io.File file)
throws java.io.IOException
CharsetToolkit utility class.file - of which we want to know the encoding.java.io.IOExceptionpublic void setDefaultCharset(java.nio.charset.Charset defaultCharset)
Charset used in case the buffer represents
an 8-bit Charset.defaultCharset - the default Charset to be returned
if an 8-bit Charset is encountered.public java.nio.charset.Charset getCharset()
public void setEnforce8Bit(boolean enforce)
charset rather than US-ASCII.enforce - a boolean specifying the use or not of US-ASCII.public boolean getEnforce8Bit()
public java.nio.charset.Charset getDefaultCharset()
public static java.nio.charset.Charset getDefaultSystemCharset()
Charset.public boolean hasUTF8Bom()
public boolean hasUTF16LEBom()
public boolean hasUTF16BEBom()
public java.io.BufferedReader getReader()
throws java.io.FileNotFoundException
BufferedReader (indeed a LineNumberReader) from the File
specified in the constructor of CharsetToolkit using the charset discovered or the default
charset if an 8-bit Charset is encountered.BufferedReaderjava.io.FileNotFoundException - if the file is not found.public static java.nio.charset.Charset[] getAvailableCharsets()
Charsets on the platform,
among which the default charset.Charsets.