Bug 156137 - Document the native format of JSChar type
Summary: Document the native format of JSChar type
Status: RESOLVED FIXED
Alias: None
Product: WebKit
Classification: Unclassified
Component: JavaScriptCore (show other bugs)
Version: WebKit Nightly Build
Hardware: Unspecified Unspecified
: P2 Normal
Assignee: David Kilzer (:ddkilzer)
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-04-02 13:55 PDT by David Kilzer (:ddkilzer)
Modified: 2016-06-14 15:54 PDT (History)
7 users (show)

See Also:


Attachments
Patch v1 (1.51 KB, patch)
2016-04-02 14:06 PDT, David Kilzer (:ddkilzer)
no flags Details | Formatted Diff | Diff
Patch v2 (1.24 KB, patch)
2016-04-03 10:45 PDT, David Kilzer (:ddkilzer)
no flags Details | Formatted Diff | Diff
Patch v3 (1.33 KB, patch)
2016-06-14 15:21 PDT, David Kilzer (:ddkilzer)
no flags Details | Formatted Diff | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description David Kilzer (:ddkilzer) 2016-04-02 13:55:43 PDT
We should document that JSStringCreateWithCharacters() takes UTF16LE as its native format.

At least one customer had trouble figuring this out:
<https://twitter.com/jeremy_wyld/status/716053938148143104>
Comment 1 David Kilzer (:ddkilzer) 2016-04-02 14:06:13 PDT
Created attachment 275484 [details]
Patch v1
Comment 2 David Kilzer (:ddkilzer) 2016-04-02 14:08:20 PDT
(In reply to comment #0)
> We should document that JSStringCreateWithCharacters() takes UTF16LE as its
> native format.
> 
> At least one customer had trouble figuring this out:
> <https://twitter.com/jeremy_wyld/status/716053938148143104>

And I didn't know off the top of my head, either.
Comment 3 Darin Adler 2016-04-02 20:51:19 PDT
Comment on attachment 275484 [details]
Patch v1

View in context: https://bugs.webkit.org/attachment.cgi?id=275484&action=review

> Source/JavaScriptCore/API/JSStringRef.h:54
> +@abstract         Creates a JavaScript string from a buffer of Unicode characters 
> + in UTF16LE format.

If we want to document the use of UTF-16, I think the best place to add a comment would be in the definition of the JSChar type. That’s the approach the Cocoa platform takes with UniChar in the MacTypes.h header.

This is the same semantic as CFStringCreateWithCharacters; the documentation for that function doesn’t specify UTF-16 explicitly. If we did have a reason to specify UTF-16 for this function, we would also want to specify it in JSStringGetLength and JSStringGetCharactersPtr as well.

If we do make this type of change, we should say UTF-16, not UTF16LE. It’s little endian in practice on Apple platforms, sure, but that’s because each JSChar is in the CPU’s native byte ordering. Because Apple platforms run Intel and ARM in little endian mode it’s always little endian on those. UTF-16 is a good name for this behavior. It might make sense to specify UTF-16 LE if the argument type was an array of bytes rather than an array of JSChar. Back when Apple had PowerPC versions of this, the function would take UTF-16 BE, and that’s still the case any time WebKit is compiled for a big endian processor.
Comment 4 Darin Adler 2016-04-02 20:52:42 PDT
I just read the thread with Jeremy and see that he was confused. The answer is that this has the same semantic as CFStringCreateWithCharacters, UTF-16 with CPU’s native encoding.
Comment 5 David Kilzer (:ddkilzer) 2016-04-03 10:45:40 PDT
Created attachment 275504 [details]
Patch v2
Comment 6 David Kilzer (:ddkilzer) 2016-04-03 10:46:04 PDT
(In reply to comment #4)
> I just read the thread with Jeremy and see that he was confused. The answer
> is that this has the same semantic as CFStringCreateWithCharacters, UTF-16
> with CPU’s native encoding.

Ah, makes sense.  Updated patch.
Comment 7 Darin Adler 2016-04-03 11:58:09 PDT
Comment on attachment 275504 [details]
Patch v2

View in context: https://bugs.webkit.org/attachment.cgi?id=275504&action=review

> Source/JavaScriptCore/API/JSStringRef.h:45
> +@abstract A Unicode character in UTF-16 format.  Endianness depends on the 
> + underlying architecture.

I don’t think we should write “Endianness depends on the underlying architecture”; that’s true of all scalar types in C all the time. Just as true of "unsigned short" or "int" or even "char*" as it is of JSChar. I understand that Jeremy was confused about this, but I don’t think writing this sentence is the best way to make that clear.

Makes sense to mention UTF-16, but it’s inaccurate to call this a “character in UTF-16 format” since some UTF-16 code units are not characters. The comment should say something more like this:

    A UTF-16 code unit; one, or a sequence of two, can encode any Unicode character.
    Often loosely referred to as a "Unicode character" because only one code unit was needed in pre-1996 Unicode.

If we really want to mention endianness we could add a third sentence:

    As with all scalar types, endianness depends on the underlying architecture.

I’m not sure my long comments should be used. They are more accurate but may not be as easy to understand.
Comment 8 David Kilzer (:ddkilzer) 2016-06-14 15:21:57 PDT
Created attachment 281290 [details]
Patch v3
Comment 9 David Kilzer (:ddkilzer) 2016-06-14 15:23:58 PDT
Comment on attachment 275504 [details]
Patch v2

View in context: https://bugs.webkit.org/attachment.cgi?id=275504&action=review

>> Source/JavaScriptCore/API/JSStringRef.h:45
>> + underlying architecture.
> 
> I don’t think we should write “Endianness depends on the underlying architecture”; that’s true of all scalar types in C all the time. Just as true of "unsigned short" or "int" or even "char*" as it is of JSChar. I understand that Jeremy was confused about this, but I don’t think writing this sentence is the best way to make that clear.
> 
> Makes sense to mention UTF-16, but it’s inaccurate to call this a “character in UTF-16 format” since some UTF-16 code units are not characters. The comment should say something more like this:
> 
>     A UTF-16 code unit; one, or a sequence of two, can encode any Unicode character.
>     Often loosely referred to as a "Unicode character" because only one code unit was needed in pre-1996 Unicode.
> 
> If we really want to mention endianness we could add a third sentence:
> 
>     As with all scalar types, endianness depends on the underlying architecture.
> 
> I’m not sure my long comments should be used. They are more accurate but may not be as easy to understand.

I updated the documentation to use your example, but omitted the "Often loosely...." sentence.
Comment 10 David Kilzer (:ddkilzer) 2016-06-14 15:28:15 PDT
(In reply to comment #7)
> I don’t think we should write “Endianness depends on the underlying
> architecture”; that’s true of all scalar types in C all the time. Just as
> true of "unsigned short" or "int" or even "char*" as it is of JSChar. I
> understand that Jeremy was confused about this, but I don’t think writing
> this sentence is the best way to make that clear.

What I was thinking at the time was whether one would ever have a string in UTF-16 BE format in memory on a little-endian system, or a string in UTF-16 LE format in memory on a big-endian system, rather than always matching endianness.  (That's because files saved on disk can be in either format, and the system that loads them has to deal with both.)

After thinking about it, though, I realize that you'd be crazy to store UTF-16 in an endianness format that didn't match the underlying architecture, since you'd just create a lot of extra work for yourself if you did that.  :)
Comment 11 WebKit Commit Bot 2016-06-14 15:54:24 PDT
Comment on attachment 281290 [details]
Patch v3

Clearing flags on attachment: 281290

Committed r202069: <http://trac.webkit.org/changeset/202069>
Comment 12 WebKit Commit Bot 2016-06-14 15:54:29 PDT
All reviewed patches have been landed.  Closing bug.