Bug 61053 - Using null bytes when setting innerHTML in XTHML results in assertion and a crash due to null-pointer dereference
Summary: Using null bytes when setting innerHTML in XTHML results in assertion and a c...
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: DOM (show other bugs)
Version: 528+ (Nightly build)
Hardware: PC Windows Vista
: P1 Normal
Assignee: Vicki Pfau
URL:
Keywords:
: 62070 (view as bug list)
Depends on:
Blocks:
 
Reported: 2011-05-18 07:49 PDT by Berend-Jan Wever
Modified: 2011-06-16 22:59 PDT (History)
9 users (show)

See Also:


Attachments
Repro ASSERT (268 bytes, application/xhtml+xml)
2011-05-18 07:49 PDT, Berend-Jan Wever
no flags Details
Repro NULL ptr (268 bytes, application/xhtml+xml)
2011-05-18 07:50 PDT, Berend-Jan Wever
no flags Details
Patch (7.45 KB, patch)
2011-06-07 11:12 PDT, Vicki Pfau
no flags Details | Formatted Diff | Diff
Patch (12.88 KB, patch)
2011-06-10 17:35 PDT, Vicki Pfau
no flags Details | Formatted Diff | Diff
Patch (9.61 KB, patch)
2011-06-16 13:58 PDT, Vicki Pfau
no flags Details | Formatted Diff | Diff
Patch (9.64 KB, patch)
2011-06-16 18:08 PDT, Vicki Pfau
no flags Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Berend-Jan Wever 2011-05-18 07:49:29 PDT
Created attachment 93913 [details]
Repro ASSERT

Two very similar repros trigger an ASSERT and a NULL ptr:

Repro ASSERT:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
  <body onload="document.body.innerHTML = 'x\x00\x3C';"></body>
</html>
(Ascii codes hint: that is "x\0<")

Source:
webkit\source\webcore\dom\xmldocumentparserlibxml2.cpp @ 1440:
bool XMLDocumentParser::appendFragmentSource(const String& chunk)
{
    ASSERT(!m_context);
    ASSERT(m_parsingFragment);

    CString chunkAsUtf8 = chunk.utf8();
    initializeParserContext(chunkAsUtf8.data());
    xmlParseContent(context());
    endDocument(); // Close any open text nodes.

    // FIXME: If this code is actually needed, it should probably move to finish()
    // XMLDocumentParserQt has a similar check (m_stream.error() == QXmlStreamReader::PrematureEndOfDocumentError) in doEnd().
    // Check if all the chunk has been processed.
    long bytesProcessed = xmlByteConsumed(context());
    if (bytesProcessed == -1 || ((unsigned long)bytesProcessed) != chunkAsUtf8.length()) {
        // FIXME: I don't believe we can hit this case without also having seen an error.
        // If we hit this ASSERT, we've found a test case which demonstrates the need for this code.
        ASSERT(m_sawError);
        return false;
    }

    // No error if the chunk is well formed or it is not but we have no error.
    return context()->wellFormed || !xmlCtxtGetLastError(context());
}

@eric: You wrote that comment and added the ASSERT. I've found your test case; it seems NULL bytes terminate processing of the string, so you do not end up processing as many bytes as there are in the chunkAsUtf8 string :)

----

Repro NULL ptr:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
  <body onload="document.body.innerHTML = '\x00\x3Cx';"></body>
</html>
(Ascii codes hint: that is "\0<x")

Source:
Let's see what happens in appendFragmentSource if chunk == ' <x':
    initializeParserContext(chunkAsUtf8.data());
    xmlParseContent(context());
==> "initializeParserContext" sets m_context, "xmlParseContent(context())" parses it. There is no check for a NULL m_context; this causes the NULL pointer.

webkit\source\webcore\dom\xmldocumentparserlibxml2.cpp @ 1279:
void XMLDocumentParser::initializeParserContext(const char* chunk)
{
<<<snip>>>
    if (m_parsingFragment)
        m_context = XMLParserContext::createMemoryParser(&sax, this, chunk);
<<<snip>>>
}

webkit\source\webcore\dom\xmldocumentparserlibxml2.cpp @ 504:
PassRefPtr<XMLParserContext> XMLParserContext::createMemoryParser(xmlSAXHandlerPtr handlers, void* userData, const char* chunk)
{
<<<snip>>>
    xmlParserCtxtPtr parser = xmlCreateMemoryParserCtxt(chunk, xmlStrlen((const xmlChar*)chunk));

    if (!parser)
        return 0;
<snip>

==> "xmlStrlen" (libxml\src\xmlstring.c @ 421) looks for the first '\0' and returns 0.
==> "xmlCreateMemoryParserCtxt" will return NULL if size == 0.
==> "createMemoryParser" will return NULL so "m_context" will be 0.
Comment 1 Berend-Jan Wever 2011-05-18 07:50:00 PDT
Created attachment 93914 [details]
Repro NULL ptr
Comment 2 Berend-Jan Wever 2011-05-18 07:58:32 PDT
Chromium: https://code.google.com/p/chromium/issues/detail?id=83053
Comment 3 Eric Seidel (no email) 2011-05-19 08:23:49 PDT
I suspect the fix would be to tell the TextDecoder to remove nulls (change them to ? or something).
Comment 4 Eric Seidel (no email) 2011-05-19 08:28:26 PDT
It's possible that convertUTF16ToUTF8 (in UTF8.h) should convert nulls to the replacement character in lenient mode (just like how it converts other illegal chars).
Comment 5 Darin Adler 2011-05-19 08:32:46 PDT
(In reply to comment #4)
> It's possible that convertUTF16ToUTF8 (in UTF8.h) should convert nulls to the replacement character in lenient mode (just like how it converts other illegal chars).

I don’t think that’s right.

For one thing, a null character is not an “illegal character”. We may have some call sites that can’t handle null characters because they are using null terminated strings, and they do need to be protected from unexpected null characters, but a null character is not intrinsically bad.

For another, I believe other illegal things are illegal sequences, not characters at all.
Comment 6 Eric Seidel (no email) 2011-05-19 08:40:30 PDT
I believe you to be much more familiar with this all than I Darin. :)

It does seem like:
CString String::utf8(bool strict) const
Should replace nulls, since it returns a null terminated string.

For other utf8 conversions, I have no opinion.
Comment 7 Berend-Jan Wever 2011-05-19 09:31:32 PDT
Not that I'm an expert, but what about "Modified utf-8" which encodes the nulls? I believe it was developed to address such issues.
http://en.wikipedia.org/wiki/UTF-8#Modified_UTF-8
Comment 8 Darin Adler 2011-05-19 09:51:17 PDT
(In reply to comment #6)
> It does seem like:
> CString String::utf8(bool strict) const
> Should replace nulls, since it returns a null terminated string.

That’s an interesting point.

Although constructing a CString does add a null terminator, the class will work with strings that have embedded nulls. The CString length function will still return the correct length that includes all the nulls and not the null terminator, and as long as we don’t use the data pointer as a null-terminated string we can make code that works with it just fine.

But since CString clients are highly likely to use it as a null-terminated string, this could lead to problems and maybe even security vulnerabilities so we may want to change the design of the CString class or the String::utf8 function.

We’ll have to decide if we want different behavior, and if so, implement it.

None of that is necessarily required to fix this bug. Once we decide what behavior we want we can almost certainly fix this problem in the XML parser code, without changing the current design and implementation of CString.
Comment 9 Vicki Pfau 2011-06-07 10:06:55 PDT
*** Bug 62070 has been marked as a duplicate of this bug. ***
Comment 10 Vicki Pfau 2011-06-07 10:44:26 PDT
In the XML spec, null characters are disallowed. (In the 1.0 spec it is not shown in the permissible character range, and in 1.1 it is explicitly stated that it is disallowed.) However, this doesn't mean we should strip out the null characters; they're still there in the string that's being parsed. But we also shouldn't ignore them or treat them as an artificial string terminator, and given that they're disallowed in XML, we shouldn't just replace them either.

My proposed solution is to pass the length of the string along with the string that potentially has nulls in it. libxml2 keeps parsing until it hits the first NULL (even though you pass the length of the string in), at which point it returns that it processed however many characters there are until the null character. Since we know the expected number of bytes to be processed, we can tell if it stopped midway through the string and throw an error if so; we should always process the full string as it is, and there is already an ASSERT to that effect.
Comment 11 Vicki Pfau 2011-06-07 11:12:41 PDT
Created attachment 96263 [details]
Patch
Comment 12 Alexey Proskuryakov 2011-06-07 11:20:22 PDT
Comment on attachment 96263 [details]
Patch

This looks almost like the patch from bug 62070, with review comments not having been addressed.
Comment 13 Darin Adler 2011-06-07 11:37:41 PDT
Comment on attachment 96263 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=96263&action=review

> Source/WebCore/dom/XMLDocumentParserLibxml2.cpp:1446
> -    initializeParserContext(chunkAsUtf8.data());
> +    initializeParserContext(chunkAsUtf8.data(), chunkAsUtf8.length());

It’s not safe to pass a value that’s a size_t to a function that takes an int. This is especially true on 64-bit systems where size_t is a 64-bit unsigned integer and int is a 32-bit signed integer. Large lengths can turn into negative numbers or even zeros and cause problems such as security vulnerabilities.

The type size_t needs to be used throughout, or we need something here to explicitly check that the size fits into an int.
Comment 14 Vicki Pfau 2011-06-07 12:12:08 PDT
(In reply to comment #12)
> (From update of attachment 96263 [details])
> This looks almost like the patch from bug 62070, with review comments not having been addressed.

I can add the test cases from the other bug to this patch in the next revision.

I verified that my test case passes in Firefox. I prefixed the variable names with "chunk", but I see now that you meant unabbreviating "len" to "length" as well. I was following precedent that is set elsewhere in that file when I named the variable in the first place.

I fixed the out of bounds problem by short-circuiting the check for a null byte in the string if the length is negative.

I don't know why m_sawError doesn't get set for a null byte, and why it just bails out with the number of bytes up to the null being processed, but this appears to be behavior in libxml2 and not in WebKit code. m_sawError is generally set in the SAX callbacks that get set in intializeParserContext and invoked by xmlParseContext.

The JS exception itself is likely thrown when it realizes that it can't parse the XML, so it can't replace the outer/innerHTML of the element without corrupting the XML of the entire page--so it just throws an exception instead of killing the page.

As for the size_t issue, libxml2 uses int as its buffer length, so it probably can't handle any XML document larger than 2GiB. I can add a check to make sure that the value is within 2GiB, but beyond that, this patch doesn't really change the behavior of handling XML documents that are larger than 2GiB.
Comment 15 Vicki Pfau 2011-06-10 17:35:28 PDT
Created attachment 96828 [details]
Patch
Comment 16 Eric Seidel (no email) 2011-06-10 17:52:32 PDT
Comment on attachment 96828 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=96828&action=review

> Source/WebCore/dom/XMLDocumentParser.h:59
> +        static PassRefPtr<XMLParserContext> createMemoryParser(xmlSAXHandlerPtr, void*, const char*, int);

This argument should be named as the name adds clarity.  The void* too! :)

> Source/WebCore/dom/XMLDocumentParserLibxml2.cpp:1448
> +        // libxml2 takes an int for a length, and therefore can't handle XML chunks larger than 2 GiB

I would have put the comment outside the if block and thus removed the { }.  Also all comments should be sentences, beginning with a capital and ending with a period.  In this case only the latter applies. :)

> Source/WebCore/dom/XMLDocumentParserLibxml2.cpp:1452
> +    initializeParserContext(chunkAsUtf8.data(), chunkAsUtf8.length());

Should we just be passing the CString to this call instead of a pointer/length pair?

> Source/WebCore/dom/XMLDocumentParserLibxml2.cpp:1463
> +        ASSERT(m_sawError || (bytesProcessed >= 0 && !chunkAsUtf8.data()[bytesProcessed]));

So we're just bailing when we hit a null byte?
Comment 17 Vicki Pfau 2011-06-10 18:32:16 PDT
(In reply to comment #16)
> (From update of attachment 96828 [details])
> View in context: https://bugs.webkit.org/attachment.cgi?id=96828&action=review
> 
> > Source/WebCore/dom/XMLDocumentParser.h:59
> > +        static PassRefPtr<XMLParserContext> createMemoryParser(xmlSAXHandlerPtr, void*, const char*, int);
> 
> This argument should be named as the name adds clarity.  The void* too! :)

I didn't create this function signature, but I can add the argument names.

> > Source/WebCore/dom/XMLDocumentParserLibxml2.cpp:1448
> > +        // libxml2 takes an int for a length, and therefore can't handle XML chunks larger than 2 GiB
> 
> I would have put the comment outside the if block and thus removed the { }.  Also all comments should be sentences, beginning with a capital and ending with a period.  In this case only the latter applies. :)

I figured since it's the name of something, I should go with the capitalization of the library name, but I can change this, too.
 
> > Source/WebCore/dom/XMLDocumentParserLibxml2.cpp:1452
> > +    initializeParserContext(chunkAsUtf8.data(), chunkAsUtf8.length());
> 
> Should we just be passing the CString to this call instead of a pointer/length pair?

This sounds like a good idea to me.

> > Source/WebCore/dom/XMLDocumentParserLibxml2.cpp:1463
> > +        ASSERT(m_sawError || (bytesProcessed >= 0 && !chunkAsUtf8.data()[bytesProcessed]));
> 
> So we're just bailing when we hit a null byte?

The XML specification does not list null byte (nor most ASCII control characters) as acceptable characters in XML, and the XML 1.1 spec explicitly disallows them. If we hit a null byte, then the XML is technically malformed.
Comment 18 Darin Adler 2011-06-11 09:10:28 PDT
(In reply to comment #17)
> > > Source/WebCore/dom/XMLDocumentParserLibxml2.cpp:1448
> > > +        // libxml2 takes an int for a length, and therefore can't handle XML chunks larger than 2 GiB
> > 
> > I would have put the comment outside the if block and thus removed the { }.  Also all comments should be sentences, beginning with a capital and ending with a period.  In this case only the latter applies. :)
> 
> I figured since it's the name of something, I should go with the capitalization of the library name, but I can change this, too.

Maciej said “only the latter”, which means you should end with a period, but not begin with a capital letter.

> > > Source/WebCore/dom/XMLDocumentParserLibxml2.cpp:1463
> > > +        ASSERT(m_sawError || (bytesProcessed >= 0 && !chunkAsUtf8.data()[bytesProcessed]));
> > 
> > So we're just bailing when we hit a null byte?
> 
> The XML specification does not list null byte (nor most ASCII control characters) as acceptable characters in XML, and the XML 1.1 spec explicitly disallows them. If we hit a null byte, then the XML is technically malformed.

Sure, but the XML specification doesn’t tell us how innerHTML should work. Stopping at the first null byte is one possible behavior. Another is doing nothing at all if the string contains any null bytes. Yet another is to strip the null bytes (that’s a bad idea). Another possibility is to treat any string with null bytes in it as if you had set innerHTML to the empty string. It’s not immediately obvious to me which is the correct behavior. Since you’ve had more time to think about it and investigate you might have come to a good decision and rationale for why it’s right. I think that’s the drift of Maciej’s question.
Comment 19 Vicki Pfau 2011-06-13 16:25:47 PDT
(In reply to comment #18)
> > > > Source/WebCore/dom/XMLDocumentParserLibxml2.cpp:1463
> > > > +        ASSERT(m_sawError || (bytesProcessed >= 0 && !chunkAsUtf8.data()[bytesProcessed]));
> > > 
> > > So we're just bailing when we hit a null byte?
> > 
> > The XML specification does not list null byte (nor most ASCII control characters) as acceptable characters in XML, and the XML 1.1 spec explicitly disallows them. If we hit a null byte, then the XML is technically malformed.
> 
> Sure, but the XML specification doesn’t tell us how innerHTML should work. Stopping at the first null byte is one possible behavior. Another is doing nothing at all if the string contains any null bytes. Yet another is to strip the null bytes (that’s a bad idea). Another possibility is to treat any string with null bytes in it as if you had set innerHTML to the empty string. It’s not immediately obvious to me which is the correct behavior. Since you’ve had more time to think about it and investigate you might have come to a good decision and rationale for why it’s right. I think that’s the drift of Maciej’s question.

The Element.innerHTML method is defined in the HTML 5 spec (in this case, the XHTML portion thereof), and it explicitly says that, if a character that does not match the Char specification in the XML spec (e.g. a null byte) arises in the input stream, that the user-agent must throw an INVALID_STATE_ERR exception. The document does not specify an Element.outerHTML method, but by analog, a similar logic can be employed. Although I had not found this segment of the spec until after submitting the patch, it lines up with the behavior the patch gives.

So yes, we're just bailing if we hit a null byte.
Comment 20 Darin Adler 2011-06-14 08:18:42 PDT
(In reply to comment #19)
> Although I had not found this segment of the spec until after submitting the patch, it lines up with the behavior the patch gives.

Sounds good.
Comment 21 Alexey Proskuryakov 2011-06-15 12:06:14 PDT
Comment on attachment 96828 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=96828&action=review

Looks good to me. Please address Eric's comments too.

> LayoutTests/ChangeLog:8
> +        Added test cases covering four cases of using innerHTML and outerHTML with null bytes in XHTML

I don't really like that the tests check assigning strings that are not valid fragments even if nulls are ignored, so checking these is confusing at best. I would have tested strings such as "<p>Test fails</p>\x00" or "\x00<p>Test fails</p>". 

Please change the tests to fail if the null bytes are manually removed, so that they actually check what happens with these. Please double-check Firefox compatibility once you make this change.

> LayoutTests/fast/parser/xhtml-innerhtml-null-byte-first.xhtml:14
> +        // Base64 says: \0

Is there a reason to use Base64, not \x00 in a string literal? The latter would be more readable, and wouldn't require comments.

> Source/WebCore/dom/XMLDocumentParserLibxml2.cpp:1447
> +    if (chunkAsUtf8.length() > 0x7FFFFFFFL) {

Why not INT_MAX? Or you could even check for something like "chunkAsUtf8.length() == static_cast<int>(chunkAsUtf8.length())", although the latter is not a common WebKit style.
Comment 22 Vicki Pfau 2011-06-16 13:58:26 PDT
Created attachment 97489 [details]
Patch
Comment 23 Eric Seidel (no email) 2011-06-16 14:03:37 PDT
Comment on attachment 97489 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=97489&action=review

> Source/WebCore/dom/XMLDocumentParser.h:61
> +        static PassRefPtr<XMLParserContext> createMemoryParser(xmlSAXHandlerPtr handlers, void* userData, const CString& chunk);
> +        static PassRefPtr<XMLParserContext> createStringParser(xmlSAXHandlerPtr handlers, void* userData);

The handlers name doesn't add anything (it's already described well by the type), so I wouldn't add it.  But I appreciate you adding the void*.

> Source/WebCore/dom/XMLDocumentParserLibxml2.cpp:514
> +    ASSERT(chunk.length() <= INT_MAX);

What happens if we do pass a length over INT_MAX to xmlCreateMemoryParserCtxt?

> Source/WebCore/dom/XMLDocumentParserLibxml2.cpp:1464
> +        ASSERT(m_sawError || (bytesProcessed >= 0 && !chunkAsUtf8.data()[bytesProcessed]));

I'm not sure I understand why we're adjusting this assert.  Should we instead be setting m_sawError?  I'm not sure w/o looking what m_sawError does?
Comment 24 Vicki Pfau 2011-06-16 14:18:37 PDT
(In reply to comment #23)
> > Source/WebCore/dom/XMLDocumentParserLibxml2.cpp:514
> > +    ASSERT(chunk.length() <= INT_MAX);
> 
> What happens if we do pass a length over INT_MAX to xmlCreateMemoryParserCtxt?

xmlCreateMemoryParserCtxt takes an int, so if we pass something larger than INT_MAX, it's an integer overflow. I can add a comment that states such before the assert.

> > Source/WebCore/dom/XMLDocumentParserLibxml2.cpp:1464
> > +        ASSERT(m_sawError || (bytesProcessed >= 0 && !chunkAsUtf8.data()[bytesProcessed]));
> 
> I'm not sure I understand why we're adjusting this assert.  Should we instead be setting m_sawError?  I'm not sure w/o looking what m_sawError does?

m_sawError is only set in the handlers. As far as I know, if a null byte is encountered, the (memory) processor treats it as an EOF, so m_sawError can't be set by a handler, as it never sees the null byte. This appears to be a bug in libxml2, as it looks like the stream processor handles null characters properly.
Comment 25 Eric Seidel (no email) 2011-06-16 14:40:19 PDT
Comment on attachment 97489 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=97489&action=review

> Source/WebCore/dom/XMLDocumentParserLibxml2.cpp:516
> +    xmlParserCtxtPtr parser = xmlCreateMemoryParserCtxt(chunk.data(), chunk.length());

I'm surprised the compiler won't warn here then, if length() is unsigned and we're using it as signed.
Comment 26 Alexey Proskuryakov 2011-06-16 14:59:34 PDT
Comment on attachment 97489 [details]
Patch

View in context: https://bugs.webkit.org/attachment.cgi?id=97489&action=review

> LayoutTests/ChangeLog:8
> +        Added test cases covering four cases of using innerHTML with null bytes in XHTML.

four cases?

>> Source/WebCore/dom/XMLDocumentParser.h:61
>> +        static PassRefPtr<XMLParserContext> createStringParser(xmlSAXHandlerPtr handlers, void* userData);
> 
> The handlers name doesn't add anything (it's already described well by the type), so I wouldn't add it.  But I appreciate you adding the void*.

Yup, "handlers" needn't be here.

>> Source/WebCore/dom/XMLDocumentParserLibxml2.cpp:514
>> +    ASSERT(chunk.length() <= INT_MAX);
> 
> What happens if we do pass a length over INT_MAX to xmlCreateMemoryParserCtxt?

This assertion is not particularly useful. It won't help us uncover any problems earlier, because we will never have chunks that long in practice.

Perhaps a comment explaining where the length check is made would be more appropriate.
Comment 27 Vicki Pfau 2011-06-16 18:08:02 PDT
Created attachment 97531 [details]
Patch
Comment 28 WebKit Review Bot 2011-06-16 22:54:26 PDT
Comment on attachment 97531 [details]
Patch

Clearing flags on attachment: 97531

Committed r89118: <http://trac.webkit.org/changeset/89118>
Comment 29 WebKit Review Bot 2011-06-16 22:54:33 PDT
All reviewed patches have been landed.  Closing bug.