<div>On Sun, Mar 11, 2012 at 1:33 PM, Jason Dusek <<a href="mailto:jason.dusek@gmail.com">jason.dusek@gmail.com</a>> wrote:</div><div><br></div><div>> 2012/3/11 Jeremy Shaw <<a href="mailto:jeremy@n-heptane.com">jeremy@n-heptane.com</a>>:</div>
<div>> > Also, URIs are not defined in terms of octets.. but in terms</div><div>> > of characters. If you write a URI down on a piece of paper --</div><div>> > what octets are you using? None.. it's some scribbles on a</div>
<div>> > paper. It is the characters that are important, not the bit</div><div>> > representation.</div><div>></div><div><br></div><div><br></div><div>To quote RFC1738:</div><div><br></div><div> URLs are sequences of characters, i.e., letters, digits, and special</div>
<div> characters. A URLs may be represented in a variety of ways: e.g., ink</div><div> on paper, or a sequence of octets in a coded character set. The</div><div> interpretation of a URL depends only on the identity of the</div>
<div> characters used.</div><div><br></div><div><br></div><div>Well, to quote one example from RFC 3986:</div><div>></div><div>> 2.1. Percent-Encoding</div><div>></div><div>> A percent-encoding mechanism is used to represent a data octet in a</div>
<div>> component when that octet's corresponding character is outside the</div><div>> allowed set or is being used as a delimiter of, or within, the</div><div>> component.</div><div>></div><div><br></div>
<div>Right. This describes how to convert an octet into a sequence of</div><div>characters, since the only thing that can appear in a URI is sequences of</div><div>characters.</div><div><br></div><div><br></div><div>> The syntax of URIs is a mechanism for describing data octets,</div>
<div>> not Unicode code points. It is at variance to describe URIs in</div><div>> terms of Unicode code points.</div><div><br></div><div><br></div><div>Not sure what you mean by this. As the RFC says, a URI is defined entirely</div>
<div>by the identity of the characters that are used. There is definitely no</div><div>single, correct byte sequence for representing a URI. If I give you a</div><div>sequence of bytes and tell you it is a URI, the only way to decode it is to</div>
<div>first know what encoding the byte sequence represents.. ascii, utf-16, etc.</div><div>Once you have decoded the byte sequence into a sequence of characters, only</div><div>then can you parse the URI.</div><div><br></div>
<div><br></div><div>> > If you render a URI in a utf-8 encoded document versus a</div><div>> > utf-16 encoded document.. the octets will be diffiFor example, let's sa=</div><div>y</div><div>> that we have a unicode string and we want to use it in the URI path.</div>
<div>></div><div>> > the meaning will be the same. Because it is the characters</div><div>> > that are important. For a URI Text would be a more compact</div><div>> > representation than String.. but ByteString is a bit dodgy</div>
<div>> > since it is not well defined what those bytes represent.</div><div>> > (though if you use a newtype wrapper around ByteString to</div><div>> > declare that it is Ascii, then that would be fine).</div>
<div>></div><div>> This is all fine well and good for what a URI is parsed from</div><div>> and what it is serialized too; but once parsed, the major</div><div>> components of a URI are all octets, pure and simple.</div>
<div><br></div><div><br></div><div>Not quite. We can not, for example, change uriPath to be a ByteString and</div><div>decode any percent encoded characters for the user, because that would</div><div>change the meaning of the path and break applications.</div>
<div><br></div><div>For example, let's say we have the path segments ["foo", "bar/baz"] and we</div><div>wish to use them in the path info of a URI. Because / is a special</div><div>character it must be percent encoded as %2F. So, the path info for the url</div>
<div>would be:</div><div><br></div><div> foo/bar%2Fbaz</div><div><br></div><div>If we had the path segments, ["foo","bar","baz"], however that would be</div><div>encoded as:</div><div><br></div>
<div> foo/bar/baz</div><div><br></div><div>Now let's look at decoding the path. If we simple decode the percent</div><div>encoded characters and give the user a ByteString then both urls will</div><div>decode to:</div>
<div><br></div><div> pack "foo/bar/baz"</div><div><br></div><div>Which is incorrect. ["foo", "bar/baz"] and ["foo","bar","baz"] represent</div><div>different paths. The percent encoding there is required to distinguish</div>
<div>between to two unique paths.</div><div><br></div><div>Let's look at another example, Let's say we want to encode the path</div><div>segments:</div><div><br></div><div> ["I=E2=9D=A4=CE=BB"]</div><div>
<br></div><div>How do we do that?</div><div><br></div><div>Well.. the RFCs do not mandate a specific way. While a URL is a sequence of</div><div>characters -- the set of allow characters in pretty restricted. So, we must</div>
<div>use some application specific way to transform that string into something</div><div>that is allowed in a uri path. We could do it by converting all characters</div><div>to their unicode character numbers like:</div><div>
<br></div><div> "u73u2764u03BB"</div><div><br></div><div>Since the string now only contains acceptable characters, we can easily</div><div>convert it to a valid uri path. Later when someone requests that url, our</div>
<div>application can convert it back to a unicode character sequence.</div><div><br></div><div>Of course, no one actually uses that method. The commonly used (and I</div><div>believe, officially endorsed, but not required) method is a bit more</div>
<div>complicated.</div><div><br></div><div> 1. first we take the string "I=E2=9D=A4=CE=BB" and utf-8 encoded it to get=</div><div> a octet</div><div>sequence:</div><div><br></div><div> 49 e2 9d a4 ce bb</div><div>
<br></div><div> 2. next we percent encode the bytes to get *back* to a character sequence</div><div>(such as a String, Text, or Ascii)</div><div><br></div><div> "I%E2%9D%A4%CE%BB"</div><div><br></div><div>So, that is character sequence that would appear in the URI. *But* we do</div>
<div>not yet have octets that we can transmit over the internet. We only have a</div><div>sequence of characters. We must now convert those characters into octets.</div><div>For example, let's say we put the url as an 'href' in an <a> tag in a web</div>
<div>page that is UTF-16 encoded.</div><div><br></div><div> 3. Now we must convert the character sequence to a (big endian) utf-16</div><div>octet sequence:</div><div><br></div><div> 00 49 00 25 00 45 00 32 00 25 00 39 00 44 00 25 00 41 00 34 00 25 00 43 00</div>
<div>45 00 25 00 42 00 42</div><div><br></div><div> So those are the octets that actually get embedded in the utf-16 encoded</div><div>.html document and transmitted over the net.</div><div><br></div><div> 4. the browser then decodes the utf-16 web page and gets back the sequence</div>
<div>of characters:</div><div><br></div><div> "I%E2%9D%A4%CE%BB"</div><div><br></div><div> Note that here the browser has a sequence of characters -- we know nothing</div><div>about how those bytes are represented internally by the browser. If the</div>
<div>browser was written in Haskell it might be String or Text.</div><div><br></div><div> Now let's say the browser wants to request the URL. It *must* encode the</div><div>url as ASCII (as per the spec).</div><div><br>
</div><div> 5. So, the browser encodes the string as the octet sequence</div><div><br></div><div> 49 25 45 32 25 39 44 25 41 34 25 43 45 25 42 42</div><div><br></div><div> 6. The server can now decode that sequence of octets back into a sequence</div>
<div>of characters:</div><div><br></div><div> "I%E2%9D%A4%CE%BB"</div><div><br></div><div> Now, the low-level Network.URI library can not really do much more than</div><div>that, because it does not know what those octets are really supposed to</div>
<div>mean (see the / example above).</div><div><br></div><div> 7. the application specific code, however, knows that it should now first</div><div>split the path on any / characters to get</div><div><br></div><div> ["I%E2%9D%A4%CE%BB"]</div>
<div><br></div><div> 8. next it should percent decode each path segment to get a ByteString</div><div>sequence:</div><div><br></div><div> 49 e2 9d a4 ce bb</div><div><br></div><div> 9. And now it can utf-8 decode that octet sequence get a unicode character</div>
<div>sequence:</div><div><br></div><div> I=E2=9D=A4=CE=BB</div><div><br></div><div>So... the basic gist is that if you unicode characters embedded in an html</div><div>document, they will generally be encoded *three* different times. (First</div>
<div>the unicode characters are converted to a utf-8 byte sequence, then the</div><div>byte sequence is percent encoded, and then the percent encoded character</div><div>sequence is encoded as another byte sequence). But, applications can choose</div>
<div>to use other methods as well.</div><div><br></div><div>In terms of applicability to the URI type.. uriPath :: ByteString</div><div>definitely does not work. It is possible that uriPath :: [ByteString] might</div><div>
work... assuming / is the only special character we need to worry about in</div><div>the uriPath. But, doing all the breaking on '/' and the percent decoding</div><div>may not be required for many applications. So, choosing to always do the</div>
<div>extra work raises some concerns.</div><div><br></div><div>Also, even with, uriPath :: [ByteString], we are losing some information.</div><div>The browser is free to percent encode characters -- even if it is not</div>
<div>required. For example the browser could request:</div><div><br></div><div> "hello"</div><div><br></div><div>Or it could request:</div><div><br></div><div> "%68%65%6c%6c%6f"</div><div><br></div><div>
In this case the *meaning* is the same. So, doing the decoding is less</div><div>problematic. But I wonder if there might still be cases where we still want</div><div>to distinguished between those two requests?</div><div>
<br></div><div>hope this helps.</div><div><br></div><div>- jeremy</div><div><br></div>