La marque d'ordre des octets bousille la lecture des fichiers en Java

Question

J'essaie de lire des fichiers CSV en utilisant Java. Certains fichiers peuvent avoir une marque d'ordre des octets au début, mais pas tous. Lorsqu'il est présent, l'ordre des octets est lu en même temps que le reste de la première ligne, ce qui pose des problèmes avec les comparaisons de chaînes.

Existe-t-il un moyen facile de sauter la marque d'ordre des octets lorsqu'elle est présente?

Merci!

Existe-t-il un moyen facile de sauter la marque d'ordre des octets lorsqu'elle est présente?

Merci!

Gregory Pakosz · Accepted Answer

EDIT: J'ai publié une version appropriée sur GitHub: https://github.com/gpakosz/UnicodeBOMInputStream

Voici une classe que j'ai codée il y a quelque temps, je viens de modifier le nom du paquet avant de le coller. Rien de spécial, c'est assez similaire aux solutions publiées dans la base de données de bogues de Sun. Incorporez-le dans votre code et tout ira bien.

/* ____________________________________________________________________________ * * File: UnicodeBOMInputStream.Java * Author: Gregory Pakosz. * Date: 02 - November - 2005 * ____________________________________________________________________________ */ package com.stackoverflow.answer; import Java.io.IOException; import Java.io.InputStream; import Java.io.PushbackInputStream; /** * The <code>UnicodeBOMInputStream</code> class wraps any * <code>InputStream</code> and detects the presence of any Unicode BOM * (Byte Order Mark) at its beginning, as defined by * <a href="http://www.faqs.org/rfcs/rfc3629.html">RFC 3629 - UTF-8, a transformation format of ISO 10646</a> * * <p>The * <a href="http://www.unicode.org/unicode/faq/utf_bom.html">Unicode FAQ</a> * defines 5 types of BOMs:<ul> * <li><pre>00 00 FE FF = UTF-32, big-endian</pre></li> * <li><pre>FF FE 00 00 = UTF-32, little-endian</pre></li> * <li><pre>FE FF = UTF-16, big-endian</pre></li> * <li><pre>FF FE = UTF-16, little-endian</pre></li> * <li><pre>EF BB BF = UTF-8</pre></li> * </ul></p> * * <p>Use the {@link #getBOM()} method to know whether a BOM has been detected * or not. * </p> * <p>Use the {@link #skipBOM()} method to remove the detected BOM from the * wrapped <code>InputStream</code> object.</p> */ public class UnicodeBOMInputStream extends InputStream { /** * Type safe enumeration class that describes the different types of Unicode * BOMs. */ public static final class BOM { /** * NONE. */ public static final BOM NONE = new BOM(new byte[]{},"NONE"); /** * UTF-8 BOM (EF BB BF). */ public static final BOM UTF_8 = new BOM(new byte[]{(byte)0xEF, (byte)0xBB, (byte)0xBF}, "UTF-8"); /** * UTF-16, little-endian (FF FE). */ public static final BOM UTF_16_LE = new BOM(new byte[]{ (byte)0xFF, (byte)0xFE}, "UTF-16 little-endian"); /** * UTF-16, big-endian (FE FF). */ public static final BOM UTF_16_BE = new BOM(new byte[]{ (byte)0xFE, (byte)0xFF}, "UTF-16 big-endian"); /** * UTF-32, little-endian (FF FE 00 00). */ public static final BOM UTF_32_LE = new BOM(new byte[]{ (byte)0xFF, (byte)0xFE, (byte)0x00, (byte)0x00}, "UTF-32 little-endian"); /** * UTF-32, big-endian (00 00 FE FF). */ public static final BOM UTF_32_BE = new BOM(new byte[]{ (byte)0x00, (byte)0x00, (byte)0xFE, (byte)0xFF}, "UTF-32 big-endian"); /** * Returns a <code>String</code> representation of this <code>BOM</code> * value. */ public final String toString() { return description; } /** * Returns the bytes corresponding to this <code>BOM</code> value. */ public final byte[] getBytes() { final int length = bytes.length; final byte[] result = new byte[length]; // Make a defensive copy System.arraycopy(bytes,0,result,0,length); return result; } private BOM(final byte bom[], final String description) { assert(bom != null) : "invalid BOM: null is not allowed"; assert(description != null) : "invalid description: null is not allowed"; assert(description.length() != 0) : "invalid description: empty string is not allowed"; this.bytes = bom; this.description = description; } final byte bytes[]; private final String description; } // BOM /** * Constructs a new <code>UnicodeBOMInputStream</code> that wraps the * specified <code>InputStream</code>. * * @param inputStream an <code>InputStream</code>. * * @throws NullPointerException when <code>inputStream</code> is * <code>null</code>. * @throws IOException on reading from the specified <code>InputStream</code> * when trying to detect the Unicode BOM. */ public UnicodeBOMInputStream(final InputStream inputStream) throws NullPointerException, IOException { if (inputStream == null) throw new NullPointerException("invalid input stream: null is not allowed"); in = new PushbackInputStream(inputStream,4); final byte bom[] = new byte[4]; final int read = in.read(bom); switch(read) { case 4: if ((bom[0] == (byte)0xFF) && (bom[1] == (byte)0xFE) && (bom[2] == (byte)0x00) && (bom[3] == (byte)0x00)) { this.bom = BOM.UTF_32_LE; break; } else if ((bom[0] == (byte)0x00) && (bom[1] == (byte)0x00) && (bom[2] == (byte)0xFE) && (bom[3] == (byte)0xFF)) { this.bom = BOM.UTF_32_BE; break; } case 3: if ((bom[0] == (byte)0xEF) && (bom[1] == (byte)0xBB) && (bom[2] == (byte)0xBF)) { this.bom = BOM.UTF_8; break; } case 2: if ((bom[0] == (byte)0xFF) && (bom[1] == (byte)0xFE)) { this.bom = BOM.UTF_16_LE; break; } else if ((bom[0] == (byte)0xFE) && (bom[1] == (byte)0xFF)) { this.bom = BOM.UTF_16_BE; break; } default: this.bom = BOM.NONE; break; } if (read > 0) in.unread(bom,0,read); } /** * Returns the <code>BOM</code> that was detected in the wrapped * <code>InputStream</code> object. * * @return a <code>BOM</code> value. */ public final BOM getBOM() { // BOM type is immutable. return bom; } /** * Skips the <code>BOM</code> that was found in the wrapped * <code>InputStream</code> object. * * @return this <code>UnicodeBOMInputStream</code>. * * @throws IOException when trying to skip the BOM from the wrapped * <code>InputStream</code> object. */ public final synchronized UnicodeBOMInputStream skipBOM() throws IOException { if (!skipped) { in.skip(bom.bytes.length); skipped = true; } return this; } /** * {@inheritDoc} */ public int read() throws IOException { return in.read(); } /** * {@inheritDoc} */ public int read(final byte b[]) throws IOException, NullPointerException { return in.read(b,0,b.length); } /** * {@inheritDoc} */ public int read(final byte b[], final int off, final int len) throws IOException, NullPointerException { return in.read(b,off,len); } /** * {@inheritDoc} */ public long skip(final long n) throws IOException { return in.skip(n); } /** * {@inheritDoc} */ public int available() throws IOException { return in.available(); } /** * {@inheritDoc} */ public void close() throws IOException { in.close(); } /** * {@inheritDoc} */ public synchronized void mark(final int readlimit) { in.mark(readlimit); } /** * {@inheritDoc} */ public synchronized void reset() throws IOException { in.reset(); } /** * {@inheritDoc} */ public boolean markSupported() { return in.markSupported(); } private final PushbackInputStream in; private final BOM bom; private boolean skipped = false; } // UnicodeBOMInputStream

Et vous l'utilisez de cette façon:

import Java.io.BufferedReader; import Java.io.FileInputStream; import Java.io.InputStreamReader; public final class UnicodeBOMInputStreamUsage { public static void main(final String[] args) throws Exception { FileInputStream fis = new FileInputStream("test/offending_bom.txt"); UnicodeBOMInputStream ubis = new UnicodeBOMInputStream(fis); System.out.println("detected BOM: " + ubis.getBOM()); System.out.print("Reading the content of the file without skipping the BOM: "); InputStreamReader isr = new InputStreamReader(ubis); BufferedReader br = new BufferedReader(isr); System.out.println(br.readLine()); br.close(); isr.close(); ubis.close(); fis.close(); fis = new FileInputStream("test/offending_bom.txt"); ubis = new UnicodeBOMInputStream(fis); isr = new InputStreamReader(ubis); br = new BufferedReader(isr); ubis.skipBOM(); System.out.print("Reading the content of the file after skipping the BOM: "); System.out.println(br.readLine()); br.close(); isr.close(); ubis.close(); fis.close(); } } // UnicodeBOMInputStreamUsage

rescdsk · Answer

La bibliothèque Apache Commons IO a une variable InputStream qui peut détecter et supprimer les nomenclatures: BOMInputStream (javadoc) :

BOMInputStream bomIn = new BOMInputStream(in); int firstNonBOMByte = bomIn.read(); // Skips BOM if (bomIn.hasBOM()) { // has a UTF-8 BOM }

Si vous devez également détecter différents codages, vous pouvez également faire la distinction entre différentes marques d'ordre d'octet, par exemple. UTF-8 vs UTF-16 big + little endian - détails au lien doc ci-dessus. Vous pouvez ensuite utiliser le ByteOrderMark détecté pour choisir un Charset pour décoder le flux. (Il y a probablement une manière plus simple de faire cela si vous avez besoin de toutes ces fonctionnalités - peut-être le UnicodeReader dans la réponse de BalusC?). Notez qu'en général, il n'y a pas un très bon moyen de détecter le codage de certains octets, mais si le flux commence par une nomenclature, apparemment cela peut être utile.

Edit: si vous devez détecter la nomenclature dans UTF-16, UTF-32, etc., le constructeur doit être:

new BOMInputStream(is, ByteOrderMark.UTF_8, ByteOrderMark.UTF_16BE, ByteOrderMark.UTF_16LE, ByteOrderMark.UTF_32BE, ByteOrderMark.UTF_32LE)

Commentaire positif de @ martin-charlesworth :)

user1092126 · Answer

Solution plus simple:

public class BOMSkipper { public static void skip(Reader reader) throws IOException { reader.mark(1); char[] possibleBOM = new char[1]; reader.read(possibleBOM); if (possibleBOM[0] != '\ufeff') { reader.reset(); } } }

Exemple d'utilisation:

BufferedReader input = new BufferedReader(new InputStreamReader(new FileInputStream(file), fileExpectedCharset)); BOMSkipper.skip(input); //Now UTF prefix not present: input.readLine(); ...

Cela fonctionne avec tous les 5 encodages UTF!

BalusC · Answer

Google Data API a une UnicodeReader qui détecte automatiquement l'encodage.

Vous pouvez l'utiliser au lieu de InputStreamReader. Voici un extrait - légèrement compacté - de sa source qui est assez simple:

public class UnicodeReader extends Reader { private static final int BOM_SIZE = 4; private final InputStreamReader reader; /** * Construct UnicodeReader * @param in Input stream. * @param defaultEncoding Default encoding to be used if BOM is not found, * or <code>null</code> to use system default encoding. * @throws IOException If an I/O error occurs. */ public UnicodeReader(InputStream in, String defaultEncoding) throws IOException { byte bom[] = new byte[BOM_SIZE]; String encoding; int unread; PushbackInputStream pushbackStream = new PushbackInputStream(in, BOM_SIZE); int n = pushbackStream.read(bom, 0, bom.length); // Read ahead four bytes and check for BOM marks. if ((bom[0] == (byte) 0xEF) && (bom[1] == (byte) 0xBB) && (bom[2] == (byte) 0xBF)) { encoding = "UTF-8"; unread = n - 3; } else if ((bom[0] == (byte) 0xFE) && (bom[1] == (byte) 0xFF)) { encoding = "UTF-16BE"; unread = n - 2; } else if ((bom[0] == (byte) 0xFF) && (bom[1] == (byte) 0xFE)) { encoding = "UTF-16LE"; unread = n - 2; } else if ((bom[0] == (byte) 0x00) && (bom[1] == (byte) 0x00) && (bom[2] == (byte) 0xFE) && (bom[3] == (byte) 0xFF)) { encoding = "UTF-32BE"; unread = n - 4; } else if ((bom[0] == (byte) 0xFF) && (bom[1] == (byte) 0xFE) && (bom[2] == (byte) 0x00) && (bom[3] == (byte) 0x00)) { encoding = "UTF-32LE"; unread = n - 4; } else { encoding = defaultEncoding; unread = n; } // Unread bytes if necessary and skip BOM marks. if (unread > 0) { pushbackStream.unread(bom, (n - unread), unread); } else if (unread < -1) { pushbackStream.unread(bom, 0, 0); } // Use given encoding. if (encoding == null) { reader = new InputStreamReader(pushbackStream); } else { reader = new InputStreamReader(pushbackStream, encoding); } } public String getEncoding() { return reader.getEncoding(); } public int read(char[] cbuf, int off, int len) throws IOException { return reader.read(cbuf, off, len); } public void close() throws IOException { reader.close(); } }

Kevin Meredith · Answer

Le BOMInputStream de la bibliothèque Apache Commons IO a déjà été mentionné par @rescdsk, mais je ne l'ai pas vu mentionner comment obtenir un InputStream sans la nomenclature.

Voici comment je l'ai fait à Scala.

 import Java.io._ val file = new File(path_to_xml_file_with_BOM) val fileInpStream = new FileInputStream(file) val bomIn = new BOMInputStream(fileInpStream, false); // false means don't include BOM

Brian Agnew · Answer

Malheureusement pas. Vous devrez vous identifier et vous passer. Cette page détaille ce que vous devez surveiller. Voir aussi cette SO question pour plus de détails.

Andreas Baaserud · Answer

Pour supprimer simplement les caractères de nomenclature de votre fichier, je vous recommande d’utiliser Apache Common IO

public BOMInputStream(InputStream delegate, boolean include) Constructs a new BOM InputStream that detects a a ByteOrderMark.UTF_8 and optionally includes it. Parameters: delegate - the InputStream to delegate to include - true to include the UTF-8 BOM or false to exclude it

Définissez include sur false et vos caractères de nomenclature seront exclus.

Amy B Higgins · Answer

J'ai eu le même problème, et comme je ne lisais pas dans un tas de fichiers, j'ai proposé une solution plus simple. Je pense que mon encodage était au format UTF-8, car lorsque j’ai imprimé le caractère incriminé à l’aide de cette page: Obtenir la valeur unicode d’un caractère j’ai trouvé qu’il s’agissait de \ufeff. J'ai utilisé le code System.out.println( "\u" + Integer.toHexString(str.charAt(0) | 0x10000).substring(1) ); pour imprimer la valeur unicode incriminée.

Une fois que j'ai eu la valeur unicode incriminée, je l'ai remplacée à la première ligne de mon fichier avant de poursuivre la lecture. La logique commerciale de cette section:

String str = reader.readLine().trim(); str = str.replace("\ufeff", "");

Cela a résolu mon problème. Ensuite, j'ai pu continuer à traiter le fichier sans problème. J'ai ajouté sur trim() juste au cas où un espace soit précédé ou suivi, vous pouvez le faire ou non, en fonction de vos besoins spécifiques.