Comment convertir une chaîne UTF-8 en Unicode?

Question

J'ai une chaîne qui affiche des caractères codés UTF-8 et je veux la reconvertir en Unicode.

Pour l'instant, mon implémentation est la suivante:

public static string DecodeFromUtf8(this string utf8String) { // read the string as UTF-8 bytes. byte[] encodedBytes = Encoding.UTF8.GetBytes(utf8String); // convert them into unicode bytes. byte[] unicodeBytes = Encoding.Convert(Encoding.UTF8, Encoding.Unicode, encodedBytes); // builds the converted string. return Encoding.Unicode.GetString(encodedBytes); }

Je joue avec la Parole "déjà". Je l'ai converti en UTF-8 grâce à ceci outil en ligne , et j'ai donc commencé à tester ma méthode avec la chaîne "dÃ©jÃ".

Malheureusement, avec cette implémentation, la chaîne reste la même.

Où ai-je tort?

bames53 · Accepted Answer

Le problème est donc que les valeurs des unités de code UTF-8 ont été stockées sous la forme d'une séquence d'unités de code 16 bits dans un C # string. Il vous suffit de vérifier que chaque unité de code se trouve dans la plage d'un octet, de copier ces valeurs en octets, puis de convertir la nouvelle séquence d'octets UTF-8 en UTF-16.

public static string DecodeFromUtf8(this string utf8String) { // copy the string as UTF-8 bytes. byte[] utf8Bytes = new byte[utf8String.Length]; for (int i=0;i<utf8String.Length;++i) { //Debug.Assert( 0 <= utf8String[i] && utf8String[i] <= 255, "the char must be in byte's range"); utf8Bytes[i] = (byte)utf8String[i]; } return Encoding.UTF8.GetString(utf8Bytes,0,utf8Bytes.Length); } DecodeFromUtf8("d\u00C3\u00A9j\u00C3\u00A0"); // déjà

C'est facile, mais il serait préférable de trouver la cause première; l'emplacement où quelqu'un copie des unités de code UTF-8 en unités de code 16 bits. Le coupable probable est que quelqu'un convertit les octets en C # string en utilisant le mauvais encodage. Par exemple. Encoding.Default.GetString(utf8Bytes, 0, utf8Bytes.Length).

Alternativement, si vous êtes sûr de connaître le codage incorrect qui a été utilisé pour produire la chaîne, et que la transformation de codage incorrecte était sans perte (généralement le cas si le codage incorrect est un codage à un octet), alors vous pouvez simplement faire le codage inverse étape pour obtenir les données UTF-8 d'origine, puis vous pouvez effectuer la conversion correcte à partir des octets UTF-8:

public static string UndoEncodingMistake(string mangledString, Encoding mistake, Encoding correction) { // the inverse of `mistake.GetString(originalBytes);` byte[] originalBytes = mistake.GetBytes(mangledString); return correction.GetString(originalBytes); } UndoEncodingMistake("d\u00C3\u00A9j\u00C3\u00A0", Encoding(1252), Encoding.UTF8);

MEN · Answer

Si vous avez une chaîne UTF-8, où chaque octet est correct ('Ö' -> [195, 0], [150, 0]), vous pouvez utiliser ce qui suit:

public static string Utf8ToUtf16(string utf8String) { /*************************************************************** * Every .NET string will store text with the UTF-16 encoding, * * known as Encoding.Unicode. Other encodings may exist as * * Byte-Array or incorrectly stored with the UTF-16 encoding. * * * * UTF-8 = 1 bytes per char * * ["100" for the ansi 'd'] * * ["206" and "186" for the russian '?'] * * * * UTF-16 = 2 bytes per char * * ["100, 0" for the ansi 'd'] * * ["186, 3" for the russian '?'] * * * * UTF-8 inside UTF-16 * * ["100, 0" for the ansi 'd'] * * ["206, 0" and "186, 0" for the russian '?'] * * * * First we need to get the UTF-8 Byte-Array and remove all * * 0 byte (binary 0) while doing so. * * * * Binary 0 means end of string on UTF-8 encoding while on * * UTF-16 one binary 0 does not end the string. Only if there * * are 2 binary 0, than the UTF-16 encoding will end the * * string. Because of .NET we don't have to handle this. * * * * After removing binary 0 and receiving the Byte-Array, we * * can use the UTF-8 encoding to string method now to get a * * UTF-16 string. * * * ***************************************************************/ // Get UTF-8 bytes and remove binary 0 bytes (filler) List<byte> utf8Bytes = new List<byte>(utf8String.Length); foreach (byte utf8Byte in utf8String) { // Remove binary 0 bytes (filler) if (utf8Byte > 0) { utf8Bytes.Add(utf8Byte); } } // Convert UTF-8 bytes to UTF-16 string return Encoding.UTF8.GetString(utf8Bytes.ToArray()); }

Dans mon cas, le résultat DLL est également une chaîne UTF-8, mais malheureusement, la chaîne UTF-8 est interprétée avec le codage UTF-16 ('Ö' -> [195, 0], [ 19, 32]). Donc l'ANSI '-' qui est 150 a été converti en UTF-16 '-' qui est 8211. Si vous avez ce cas aussi, vous pouvez utiliser ce qui suit à la place:

public static string Utf8ToUtf16(string utf8String) { // Get UTF-8 bytes by reading each byte with ANSI encoding byte[] utf8Bytes = Encoding.Default.GetBytes(utf8String); // Convert UTF-8 bytes to UTF-16 bytes byte[] utf16Bytes = Encoding.Convert(Encoding.UTF8, Encoding.Unicode, utf8Bytes); // Return UTF-16 bytes as UTF-16 string return Encoding.Unicode.GetString(utf16Bytes); }

Ou la méthode native:

[DllImport("kernel32.dll")] private static extern Int32 MultiByteToWideChar(UInt32 CodePage, UInt32 dwFlags, [MarshalAs(UnmanagedType.LPStr)] String lpMultiByteStr, Int32 cbMultiByte, [Out, MarshalAs(UnmanagedType.LPWStr)] StringBuilder lpWideCharStr, Int32 cchWideChar); public static string Utf8ToUtf16(string utf8String) { Int32 iNewDataLen = MultiByteToWideChar(Convert.ToUInt32(Encoding.UTF8.CodePage), 0, utf8String, -1, null, 0); if (iNewDataLen > 1) { StringBuilder utf16String = new StringBuilder(iNewDataLen); MultiByteToWideChar(Convert.ToUInt32(Encoding.UTF8.CodePage), 0, utf8String, -1, utf16String, utf16String.Capacity); return utf16String.ToString(); } else { return String.Empty; } }

Si vous en avez besoin dans l'autre sens, voir tf16ToUtf8 . J'espère que je pourrais être utile.

Hans Passant · Answer

J'ai une chaîne qui affiche des caractères codés UTF-8

Il n'y a rien de tel dans .NET. La classe de chaîne ne peut stocker que des chaînes dans le codage UTF-16. Une chaîne encodée en UTF-8 ne peut exister qu'en tant qu'octet []. Essayer de stocker des octets dans une chaîne ne se terminera pas bien; UTF-8 utilise des valeurs d'octet qui n'ont pas de point de code Unicode valide. Le contenu sera détruit lorsque la chaîne sera normalisée. Il est donc déjà trop tard pour récupérer la chaîne au moment où votre DecodeFromUtf8 () démarre.

Gérer uniquement le texte codé UTF-8 avec l'octet []. Et utilisez UTF8Encoding.GetString () pour le convertir.

Mark Tolonen · Answer

Ce que vous avez semble être un string incorrectement décodé à partir d'un autre encodage, probablement page de code 1252 , qui est la valeur par défaut de Windows US. Voici comment inverser, en supposant qu'aucune autre perte. Une perte qui n'est pas immédiatement apparente est le non-breaking space (U + 00A0) à la fin de votre chaîne qui n'est pas affichée. Bien sûr, il serait préférable de lire correctement la source de données en premier lieu, mais peut-être que la source de données a été mal stockée au départ.

using System; using System.Text; class Program { static void Main(string[] args) { string junk = "dÃ©jÃ\xa0"; // Bad Unicode string // Turn string back to bytes using the original, incorrect encoding. byte[] bytes = Encoding.GetEncoding(1252).GetBytes(junk); // Use the correct encoding this time to convert back to a string. string good = Encoding.UTF8.GetString(bytes); Console.WriteLine(good); } }

Résultat:

déjà