New Methods on Java Strings With JDK 11
It appears likely that Java's String class will be gaining some new methods with JDK 11, expected to be released in September 2018.
BUG # | BUG TITLE | NEW String METHOD |
DESCRIPTION |
---|---|---|---|
JDK-8200425 | String::lines | lines() |
"String instance method that uses a specialized Spliterator to lazily provide lines from the source string." |
JDK-8200378 | String::strip, String::stripLeading, String::stripTrailing | strip() |
"Unicode-aware" evolution of trim() |
stripLeading() |
"removal of Unicode white space from the beginning" | ||
stripTrailing() |
"removal of Unicode white space from the ... end" | ||
JDK-8200437 | String::isBlank | isBlank() |
"instance method that returns true if the string is empty or contains only white space" |
Evidence of the progress that has been made related to these methods can be found in messages requesting "compatibility and specification reviews" (CSR) on the core-libs-dev mailing list:
- Please review CSR: JDK-8200425 String#lines (25 April 2018)
- Please review CSR: JDK-8200378 String#strip, String#stripLeading, String#stripTrailing (25 April 2018)
- Please review CSR: JDK-8200425 String#lines (25 April 2018)
A common characteristic of four of these five new methods is that they use a different (newer) definition of "white space" than did old methods such as String.trim(). Bug JDK-8200373 ["String::trim JavaDoc should clarify meaning of space"] even addresses this for the String.trim()
method (mailing list review request):
The current JavaDoc for String::trim does not make it clear which definition of "space" is being used in the code. With additional trimming methods coming in the near future that use a different definition of space, clarification is imperative. String::trim uses the definition of space as any codepoint that is less than or equal to the space character codepoint (\u0040.) Newer trimming methods will use the definition of (white) space as any codepoint that returns true when passed to the Character::isWhitespace predicate.
The method isWhitespace(char) was added to Character with JDK 1.1, but the method isWhitespace(int) was not introduced to the Character class until JDK 1.5. The latter method (the one accepting a parameter of type int
) was added to support supplementary characters. The Javadoc comments for the Character class define supplementary characters (typically modeled with int-based "code point") versus BMP characters (typically modeled with single character):
The set of characters from U+0000 to U+FFFF is sometimes referred to as the Basic Multilingual Plane (BMP). Characters whose code points are greater than U+FFFF are called supplementary characters. The Java platform uses the UTF-16 representation inchar
arrays and in theString
andStringBuffer
classes. In this representation, supplementary characters are represented as a pair ofchar
values ... Achar
value, therefore, represents Basic Multilingual Plane (BMP) code points, including the surrogate code points, or code units of the UTF-16 encoding. Anint
value represents all Unicode code points, including supplementary code points. ... The methods that only accept achar
value cannot support supplementary characters. ... The methods that accept anint
value support all Unicode characters, including supplementary characters.
I added the emphasis in the above quote to emphasize the significance of a "code point," which is defined for the Java context as "a value that can be used in a coded character set". Four of the five proposed new methods for String
in JDK 11 rely heavily on the concept embodied in Character.isWhitespace(int)
to determine how to "trim" a given string or when determining if a given string is "blank."
Speaking of Unicode, JEP 327 ["Unicode 10"] has been proposed to be added to JDK 11 as well. As that JEP states, its intent is to "upgrade existing platform APIs to support version 10.0 of the Unicode Standard." This will be especially exciting news for anyone wishing to work with the "56 new emoji characters" supported by this new version.
Conclusion
The new methods on String
currently proposed for JDK 11 provide a more consistent approach to handling white space in strings that can better handle internationalization, provide methods for trimming white space only at the beginning of the string or at the end of the string, and provide a method especially intended for coming raw string literals.
Additional References
- Java's String.trim has a strange idea of whitespace
- Java Tutorial: Unicode
- Supplementary Characters in the Java Platform
- Unicode Chart
- STR01-J. Do not assume that a Java char fully represents a Unicode code point
- StackOverflow.com: Removing whitespace from strings in Java
- Checking for Null or Empty or White Space Only String in Java
- JEP 327: Unicode 10