By Xueming ShenThe UTF-8 charset implementation, which is available in all JDK/JRE releases from Sun, has been updated recently to reject non-shortest-form UTF-8 byte sequences. This is because the old implementation might be leveraged in security attacks. Since then I have been asked many times about what this “non-shortest-form” issue is and what the possible impact might be, so here are some answers.The first question usually goes: “What is the non-shortest-form issue”? The detailed and official answer is at Unicode Corrigendum #1: UTF-8 Shortest Form. Simply put, the problem is that Unicode characters can be represented in more than one way (form) in the “UTF-8 encoding” than many people think or believe. When asked what UTF-8 encoding looks like, the simplest explanation would be the following bit pattern: # Bits Bit pattern 1 7 0xxxxxxx 2 11 110xxxxx 10xxxxxx 3 16 1110xxxx 10xxxxxx 10xxxxxx 4 21 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx The pattern is close, but it’s actually wrong, based on the latest definition of UTF-8. The preceding pattern has a loophole in that you can actually have more than one form represent a Unicode character.For ASCII characters from u+0000 to u+007f, for example, the UTF-8 encoding form maintains transparency for all of them, so they keep their ASCII code values of 0x00..0x7f (in one-byte form) in UTF-8. Based on the preceding pattern, however, these characters can also be represented in 2-bytes form as [c0, 80]..[c1, bf], the “non-shortest-form”. The following code shows all of the non-shortest-2-bytes-form for these ASCII characters, if you run code against the “old” version of the JDK and JRE (Java Runtime Environment). <pre class="prettyprint"><code> byte[] bb = new byte[2]; for (int b1 = 0xc0; b1 < 0xc2; b1++) { for (int b2 = 0x80; b2 < 0xc0; b2++) { bb[0] = (byte)b1; bb[1] = (byte)b2; String cstr = new String(bb, "UTF8"); char c = cstr.toCharArray()[0]; System.out.printf("[%02x, %02x] -> U+%04x [%s]%n", b1, b2, c & 0xffff, (c>=0x20)?cstr:"ctrl"); } } The output would be as follows: <pre class="prettyprint"><code> ... [c0, a0] -> U+0020 [ ] [c0, a1] -> U+0021 [!] ... [c0, b6] -> U+0036 [6] [c0, b7] -> U+0037 [7] [c0, b8] -> U+0038 [8] [c0, b9] -> U+0039 [9] ... [c1, 80] -> U+0040 [@] [c1, 81] -> U+0041 [A] [c1, 82] -> U+0042 [B] [c1, 83] -> U+0043 [C] [c1, 84] -> U+0044 [D] ... So, for a string like “ABC”, you would have two forms of UTF-8 sequences: <pre class="prettyprint"><code> "0x41 0x42 0x43" and "0xc1 0x81 0xc1 0x82 0xc1 0x83" The Unicode Corrigendum #1: UTF-8 Shortest Form specifies explicitly that “The definition of each UTF specifies the illegal code unit sequences in that UTF. For example, the definition of UTF-8 (D36) specifies that code unit sequences such as [C0, AF] are illegal.”Our old implementation accepts those non-shortest-form (while it never generates them when encoding). The new UTF_8 charset now rejects the non-shortest-form byte sequences for all BMP characters. Only the “legal byte sequences” listed below are accepted. <pre class="prettyprint"><code> /* Legal UTF-8 Byte Sequences * * # Code Points Bits Bit/Byte pattern * 1 7 0xxxxxxx * U+0000..U+007F 00..7F * 2 11 110xxxxx 10xxxxxx * U+0080..U+07FF C2..DF 80..BF * 3 16 1110xxxx 10xxxxxx 10xxxxxx * U+0800..U+0FFF E0 A0..BF 80..BF * U+1000..U+FFFF E1..EF 80..BF 80..BF * 4 21 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx * U+10000..U+3FFFF F0 90..BF 80..BF 80..BF * U+40000..U+FFFFF F1..F3 80..BF 80..BF 80..BF * U+100000..U10FFFF F4 80..8F 80..BF 80..BF */ The next question usually is: “What would be the issue/problem if we keep using the old version of JDK/JRE?” First, I’m not a lawyer — oops, I meant I’m not a security expert:-) — so my word does not count. We consulted with our security experts instead. Their conclusion is that while “it is not a security vulnerability in Java SE per se, it may be leveraged to attack systems running software that relies on the UTF-8 charset to reject these non-shortest form of UTF-8 sequences”.A simple scenario that might give you an idea about what the above “may be leveraged to attack…” really means: A Java application would like to filter the incoming UTF-8 input stream to reject certain key words, for example “ABC”. Instead of decoding the input UTF-8 byte sequences into Java char representation and then filter out the keyword string “ABC” at Java “char” level, for example: <pre class="prettyprint"><code> String utfStr = new String(bytes, "UTF-8"); if ("ABC".equals(strUTF)) { ... } The application might choose to filter the raw UTF-8 byte sequences “0x41 0x42 0x43” (only) directly against the UTF-8 byte input stream and then rely on (assume) the Java UTF-8 charset to reject any other non-shortest-form of the target keyword, if there is any. The consequence is the non-shortest form input “0xc1 0x81 0xc1 0x82 0xc1 0x83” will penetrate the filter and trigger a possible security vulnerability, if the underlying JDK/JRE runtime is an OLD version.So the recommendation is: Update to the latest JDK/JRE releases to avoid the potential risk. Wait, there is another big bonus for updating: performance.The UTF-8 charset implementation has not been updated or touched for years. UTF-8 encoding is very widely used as the default encoding for XML, and more and more websites use UTF-8 as their page encoding. Given that fact, we have taken the defensive position of “don’t change it if it works” during the past years.So Martin and I decided to take this opportunity to give it a speed boost as well. The following data is from one of my benchmark’s run data, which compares the decoding/encoding operations of new implementation and old implementation under -server vm. (This is not an official benchmark: it is provided only to give a rough idea of the performance boost.) The new implementation is much faster, especially when decoding or encoding single bytes (those ASCIIs). The new decoding and encoding are faster under -client vm as well, but the gap is not as big as in -server vm. I wanted to show you the best:-)<pre class="prettyprint"><code> Method Millis Millis(OLD) Decoding 1b UTF-8 : 1786 12689 Decoding 2b UTF-8 : 21061 30769 Decoding 3b UTF-8 : 23412 44256 Decoding 4b UTF-8 : 30732 35909 Decoding 1b (direct)UTF-8 : 16015 22352 Decoding 2b (direct)UTF-8 : 63813 82686 Decoding 3b (direct)UTF-8 : 89999 111579 Decoding 4b (direct)UTF-8 : 73126 60366 Encoding 1b UTF-8 : 2528 12713 Encoding 2b UTF-8 : 14372 33246 Encoding 3b UTF-8 : 25734 26000 Encoding 4b UTF-8 : 23293 31629 Encoding 1b (direct)UTF-8 : 18776 19883 Encoding 2b (direct)UTF-8 : 50309 59327 Encoding 3b (direct)UTF-8 : 77006 74286 Encoding 4b (direct)UTF-8 : 61626 66517 The new UTF-8 charset implementation has been integrated in JDK7, Open JDK 6, JDK 6 update 11 and later, JDK5.0u17, and 1.4.2_19.If you are interested in what the change looks like, you can take a peek at the webrev of the new UTF_8.java for OpenJDK7. Xueming Shen is an engineer at Sun Microsystems, working in the Java core technologies group. Java