Java String length confusion
Facts and Terminology
As you probably know, Java uses UTF-16 to represent Strings. In order to understand the confusion about String.length(), you need to be familiar with some Encoding/Unicode terms.
Code Point: A unique integer value which represents a character in the code space.
Code Unit: A bit sequence used to encode characters (Code Points). One or more Code Units may be required to represent a Code Point.
UTF-16
Unicode Code Points are logically divided into 17 planes. The first plane, the Basic Multilingual Plane (BMP) contains the “classic” characters (from U+0000 to U+FFFF). The other planes contain the supplementary characters (from U+10000 to U+10FFFF).
Characters (Code Points) from the first plane are encoded in one 16-bit Code Unit with the same value. Supplementary characters (Code Points) are encoded in two Code Units (encoding-specific, see Wiki for the explanation).
Example
Character: A
Unicode Code Point: U+0041
UTF-16 Code Unit(s): 0041
Character: Mathematical double-struck capital A
Unicode Code Point: U+1D538
UTF-16 Code Unit(s): D835 DD38
As you can see here, there are characters which are encoded in two Code Units.
String.length()
Let’s take a look at the Javadoc of the length() method:
public int length() Returns the length of this string. The length is equal to the number of Unicode code units in the string.
So if you have one supplementary character which consists of two code units, the length of that single character is two.
// Mathematical double-struck capital A String str = "\uD835\uDD38"; System.out.println(str); System.out.println(str.length()); //prints 2
Which is correct according to the documentation, but maybe it’s not expected.
~Solution
You need to count the code points not the code units:
String str = "\uD835\uDD38"; System.out.println(str); System.out.println(str.codePointCount(0, str.length()));
See: codePointCount(int beginIndex, int endIndex)
References/Sources
- The Java Language Specification
- Unicode Glossary: Code Point
- Wiki: Code Point
- Unicode Glossary: Code Unit
- Wiki: Code Unit
- Wiki: Unicode
- Wiki: UTF-16
- Supplementary Characters in the Java Platform
- Wiki: Unicode Planes