<br><br><div class="gmail_quote">On Mon, Mar 22, 2010 at 4:20 PM, Jeremy Shaw <span dir="ltr">&lt;<a href="mailto:jeremy@n-heptane.com">jeremy@n-heptane.com</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">

<div class="im">On Sun, Mar 21, 2010 at 12:04 AM, Michael Snoyman <span dir="ltr">&lt;<a href="mailto:michael@snoyman.com" target="_blank">michael@snoyman.com</a>&gt;</span> wrote:</div><div class="gmail_quote"><div class="im">

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<div class="gmail_quote"><div>That made perfect sense, thank you for doing such thorough research on this.</div><div><br></div><div>I&#39;ve attached two files; test1.html is UTF-8 encoded, test3.html is windows-1255 (Hebrew). On my system, both links point to the same location, implying to me that you are spot on that UTF-8 should always be used for URLs. I had made a mistake with my test on Friday; apparently we only have the encoding issue with the query string.</div>


</div></blockquote><div><br></div></div><div>Hmm. Those files do not contain value urls. The strings in the hrefs contain characters that are not in the limited set allowed by the URI spec. The part that is true is that even though the files have different encodings (utf-8 vs windows-1255) the characters in the strings are the same, so the urls are the same. I guess maybe the reason you put in invalid characters is because it is hard to test whether different encodings matter if you are only testing characters that are represented by the same octets in both encodings. </div>


<div><br></div></div></blockquote><div>Well, you guessed correctly at my reason for constructing the files as I did. Not this is actually relevant to the discussion at hand, I believe that it is valid HTML to put values in the HREF fields that are not in the appropriate character range and assume the web browser will take care of things. &lt;/off-topic&gt;</div>

<div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;"><div class="gmail_quote"><div></div><div>Regarding your encoding issue with the query string. I believe there may have been &#39;nothing wrong&#39;. At the URI level there is no specification as to how the query string is to be interpreted, or what underlying charset it should be associated with. It does have the requirement that it can only contain a limited set up characters, and that other characters must be converted to octets and then percent encoded.</div>


<div><br></div><div>Now, things get interesting when you look at forms and application/x-www-form-urlencoded. When you create a form you have a form element that looks something like this:</div><div><br></div><div>&lt;form action=&quot;/submit&quot; method=POST enctype=&quot;application/x-www-form-urlencoded;charset=utf-8&quot;&gt;...&lt;/form&gt;</div>


<div><br></div><div>Except internet explorer, and a bunch of servers get stupid if you actually set the charset=utf-8. So the de facto standard is that the form is submitted using the same character encoding as the page it came from.  So if the &lt;head&gt; contains &lt;meta charset=&quot;windows-1255&quot;&gt;, then the form data will be encoded as windows-1255, converted to octets, and then percent encoded, plus the other things that url encoding does (such as + for spaces). You can also add the, accept-charset=&quot;utf-8&quot; if you want to override the default and have the form submit some other character encoding. Not sure how widely supported that is.</div>


<div><br></div><div>Now, if we were to change the method=POST to method=GET, then the urlencoded data would be passed as a query string, with its windows-1255 encoded payload. And that is perfectly valid.</div><div><br></div>


<div>So, the choice of how to encode the pathInfo and query string is pretty much application specific. For the URLT stuff we are both generating and parsing the path components, so we can choose whatever encoding we want -- with utf-8 being a good choice.  </div>

</div></blockquote><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;"><div class="gmail_quote"><div class="im"><div> </div></div></div></blockquote><div>I agree; the issue of query-string encoding not being under our control is further reason to discourage its inclusion in URLT.</div>

<div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;"><div class="gmail_quote"><div class="im"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<div class="gmail_quote"><div></div><div>Now, back to your point: I&#39;m not sure why you want to include the query string and fragment as part of the URL. Regarding the fragment: it will never be passed to the server, so it&#39;s *impossible* to consider it for parsing URLs. I understand that you might want to generate URLs with a fragment, but we would then need to have parse and render functions which do not parallel each other properly.</div>


</div></blockquote><div><br></div></div><div>Right. I forgot about how fragments actually work.</div><div class="im"><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<div class="gmail_quote">

<div>Regarding the query string, I can see more of an argument being made to include it, but it feels wrong to me. Precedence in most places does not allow you to route requests based on the query string, and this seems like a Good Idea. I know it would be nice to be guaranteed that there is a certain GET parameter present, but I really think this should be dealt with at the handler level.</div>


</div></blockquote><div><br></div></div><div>What do you mean by &#39;precedence&#39; ?</div><div><br></div></div></blockquote><div>I mean I&#39;ve never seen a system that allows routing based on the query string. In PHP, you create files that match the pathinfo; in Django, you match regexs on the path info; I believe the same is true for Rails. This isn&#39;t a proof that this is the Right Thing, merely an observation.</div>

<div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;"><div class="gmail_quote"><div></div><div>Including query string in urlt is certainly nice for some contexts. For example:</div>

<div><br></div><div>data UserURL = AllUsers SortOrder</div>

<div><br></div><div>data SortOrder = Asc | Desc</div><div><br></div><div>Here the sort order is required. But the sort order does not really add hiearchy to the system, so it belongs more in the query string and less in the path. We might want a URL like:</div>


<div><br></div><div>/allusers?sortOrder=asc</div><div><br></div></div></blockquote><div>On the other hand, those two possible URLs are not really *unique resources* (to use more RESTful terminology). The sortOrder is not really specifying *what* to return, just *how* to return it. Most well-designed URL schemes would work that way. The badly designed ones, like /user.php?id=5&amp;name=michael&amp;... shouldn&#39;t really be considered I think.</div>

<div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;"><div class="gmail_quote"><div></div><div>Now let&#39;s say we wrap that up in a larger site:</div><div>

<br></div><div>data SiteURL = Users UserURL </div><div><br></div><div>The Users constructor is adding hierarchy, so it shouldn&#39;t be modifying the query string. So it will just add something like:</div>

<div><br></div><div>/users/allusers?sortOrder=asc</div><div><br></div><div>So only the last component gets to add a query string. </div><div><br></div></div></blockquote><div>Not quite sure how we should enforce something like that.</div>

<div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;"><div class="gmail_quote"><div></div><div>The big trip up would be forms with method GET. The form submission is handled by taking the form set data, encoding it as application/x-www-form-urlencoded, and then append ? and the encoded data to the end of the action. If the action already contained a ?, that would not work out.</div>


<div><br></div></div></blockquote><div>You can&#39;t have a URL containing a ?; the closest you can come is a URL containing an *escaped* ?, which will simply be absorbed by the [String] piece of the URL. Unless I&#39;m missing your point here.</div>

<div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;"><div class="gmail_quote"><div></div><div>So, the toUrl / fromUrl instances would have to know if the url was going to be used as the target for an action and prohibit the use of a query string. That could be tricky :-/</div>

<div><br></div><div>

Also, in my example, I am handling parameters that are url specific. But many sites might have some sort of global parameters that can be tacked on to every query string. Not really sure how that would work out either.</div>

<div class="im">

<div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="gmail_quote"><div>If we can agree on this, I don&#39;t see a necessity to rely on an external package to provide the URL datatype (since we would just be using [String]). I can provide the encodeURL/decodeURL functions in web-encodings if that&#39;s acceptable- your implementation seems correct to me. However, since it does not function on fully-qualified URLs, perhaps we should call it encodePathInfo/decodePathInfo?</div>


</div></blockquote><div><br></div></div><div>encodePathInfo  / decodePathInfo is probably a good choice of names. Adding them to web-encodings is likely useful, but I will just use local copies in urlt, because web-encodings brings in too many extra dependencies that I don&#39;t want at that level.  I don&#39;t think I will export them though, so it should not cause a conflict.</div>


<div><br></div></div></blockquote><div>I have no problem with that decision, but out of curiosity which dependencies are problematic? The only non-HP packages are failure, safe, text and wai. The only ones which could in theory be eliminated are failure and safe; if there is desire for me to do so, I&#39;ll look into it.</div>

<div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;"><div class="gmail_quote"><div></div><div>Also, my implementation is not quite right. It escapes more characters than is strictly required. path segments have the following ABNF:</div>

<div><br></div><div><div>      path_segments = segment *( &quot;/&quot; segment )</div>

<div>      segment       = *pchar *( &quot;;&quot; param )</div><div>      param         = *pchar</div><div><br></div><div>      pchar         = unreserved | escaped |</div><div>                      &quot;:&quot; | &quot;@&quot; | &quot;&amp;&quot; | &quot;=&quot; | &quot;+&quot; | &quot;$&quot; | &quot;,&quot;</div>


<div><br></div><div>Also, . and .. are allowed in a path segment, but have special meaning. Not sure what we want to do about those. I like the property that *any* String value is automatically escaped and has no special meaning. So the same should be true for &#39;.&#39; and &#39;..&#39;. But if you do need to use &#39;.&#39; and &#39;..&#39; for some reason, there is no mechanism to do it in the current system. Though I am not sure what a compelling use case would be, so I am ok with just not allowing them for now. </div>


</div><div><br></div></div>

</blockquote></div>I&#39;m not sure if they have meaning at the HTTP level. At the HTML level, they specify relative paths, but I don&#39;t think they mean anything once it enters HTTP.<div><br></div><div>Michael</div>