Encoding Differentials: Why Charset Matters

2024-11-28

Check out the following HTTP response. Notice anything?

HTTP/1.1 200 OK
Server: Some Server
Content-Type: text/html
Content-Length: 1337

<!DOCTYPE html>
<html>
<head><title>Some Page</title></head>
<body>
…

HTTP/1.1 200 OK

Server: Some Server

Content-Type: text/html

Content-Length: 1337

<!DOCTYPE html>

<html>

<body>

…

Looking at this small portion of the HTTP response, you can assume this web app is likely prone to a cross-site scripting (XSS) vulnerability. If you are questioning the Content-Type header, you are right. There is a minor imperfection here: The header is missing a charset attribute. A charset is a collection of characters that a computer can use to represent text. That may not sound like a big deal, but attackers can easily exploit this to inject arbitrary JavaScript code into a website by consciously changing the character set that the browser assumes.

Character Encodings

A common Content-Type header in an HTTP response may look like this:

HTTP/1.1 200 OK
Server: Some Server
Content-Type: text/html; charset=utf-8
…

HTTP/1.1 200 OK

Server: Some Server

Content-Type: text/html; charset=utf-8

…

This charset attribute tells the browser that UTF-8 was used to encode the HTTP response body. This type of encoding defines a mapping between characters and bytes. When a web server sends an HTML document to the browser, it maps the characters of the document to turn the corresponding bytes and transmits these in the response body. This process of turning characters into bytes is called encoding.

Once the browser receives these bytes in the response body, it translates them back to the characters of the document. This process of turning bytes into characters is called decoding.

UTF-8 is just one of many character encodings that a modern browser has to support. It’s crucial for the browser to know which of all these encodings (including UTF-16, ISO-8859-xx, windows-125x, GBK, Big5 and more) the server used or it can’t decode the bytes in the response body.

But what if there’s no charset attribute in the Content-Type header? What if it’s invalid?

Alternate Charset Attributes

When this happens, the browser looks for a <meta> tag in the HTML document itself, which may have a charset attribute that indicates the character encoding. Here, the browser is performing a balancing act: To read the HTML document, it needs to decode the response body, meaning it has to assume some encoding beforehand, decode the body, look for the <meta> tag, and then potentially re-decode the body with the indicated character encoding.

Less commonly, a browser may use the byte-order mark. This specific Unicode character, U+FEFF, can be placed in front of a string to indicate the byte endianness and character encoding. We mainly see this in files, but since these might be sent by a web server, modern browsers support it. When at the beginning of an HTML document, a byte-order mark even takes precedence over a charset attribute in the Content-Type header and <meta> tag.

But what if the charset information is missing altogether?

Missing Charset Information

The byte-order mark mentioned before is unusual. The charset attribute isn’t always present in a Content-Type header or can be invalid. Also, there’s often no <meta> tag to indicate a character encoding. In this case, the browser doesn’t have any idea what character-set to use.

But you won’t see an error message for this. Similar to faulty HTML syntax, browsers work hard to recover from a missing charset when parsing the content from a server and make the best out of it. This non-strict behavior makes a great user experience, but it also opens the door to vulnerabilities and exploitation techniques.

When it comes to missing character information, browsers use the HTTP response body to make an educated guess. This is called auto-detection. It’s similar to Multipurpose Internet Mail Extensions (MIME)-type sniffing, but instead operates on a character encoding level. For example, Chromium’s rendering engine Blink uses the Compact Encoding Detection (CED) library to automatically detect character encodings. From an attacker’s point of view, this auto-detection feature is incredibly powerful.

Knowing there are multiple mechanisms for a browser to determine character encoding, how can this fact be exploited?

Encoding Differentials

Character encodings translate characters into a computer-processable byte sequence, which can be transmitted over a network and decoded back to characters by the receiver. Thanks to this, the exact same characters the sender intended to transmit are restored.

But this only works when the sender and receiver agree upon the character encoding used. If there’s a mismatch, they’ll see different characters. This is called an encoding differential. For web apps, encoding differentials become vital when user-controlled data is sanitized to prevent cross-site scripting vulnerabilities: They can theoretically break sanitization and lead to severe vulnerabilities. There’s one encoding that is particularly interesting to attackers: ISO-2022-JP.

ISO-2022-JP

ISO-2022-JP is a Japanese character encoding that browsers must support to follow the HTML standard. It supports four specific escape sequences to switch between different character sets.

This feature is very flexible but can also break some fundamental assumptions. This encoding is also auto-detected by Chrome through Blink and Firefox through Gecko. A single occurrence of one of these four escape sequences for ISO-2022-JP is usually enough to convince the auto-detection algorithm that the HTTP response body is encoded with ISO-2022-JP.

Attackers can use two different exploitation techniques to make use of the IOS-2022-JP charset, depending on their capabilities:

Negating backslash escaping: This technique can be used to negate a backslash that would otherwise escape, such as a double quote in a JavaScript string context.
Breaking HTML context: Commonly used in websites that support markdown, this technique requires an attacker to control values in two different HTML contexts. By consuming HTML special characters that designate the end of an HTML context, this technique allows attackers to inject data into an unintended HTML context.

Both of these techniques can be used by attackers to inject malicious JavaScript code into a website.

Charset Makes a Difference in Security

To ensure an application is as attacker-proof as possible, it’s critical for developers to remember the importance of providing charset information when serving HTML documents. Forgetting to do so can lead to severe XSS vulnerabilities when attackers can manipulate a browser’s charset assumption.

Although the actual vulnerability boils down to the missing character set, a browser’s auto-detection greatly increases its overall impact. As long as browsers support auto-detection for the ISO-2022-JP character encoding, attackers might be able to exploit this.

Created with Sketch.

Stefan Schiller is a vulnerability researcher in the Sonar R&D team. He has been passionate about software and programming since his early childhood. With a background in red teaming, he has been working in the field of offensive IT security...