Rendering URLs Considered Hard
One of the ways how I judge the maturity of a web framework is how it renders URLs. Is it just slamming strings together or is it doing «The Right Thing»™.
Correctly rendering an URL is more involved than one might naïvely assume and includes several steps. Cutting corners is likely to cause trouble with non-ASCII content at some point. If think you’ll only ever use ASCII then maybe Hüttenkäse can convice you otherwise.
Break the URL Into Subcomponents
First you have to break the URL into subcomponents. If you treat an URL as an opaque string you don’t have enough contextual information to render the individual parts correctly, we learned this the hard way. A & in a path element has to be treated differently than the & separating parameters. So if you start from a string, you first have to parse it into a high level URL object.
Translate to Octets
This is the step that most often gets ignored or done wrong. You cannot directly encode a character but only bytes aka octets. This means you’ll first have to find the right encoding. So which one is it? RFC 1738 assumes you’ll only use US-ASCII, RFC 2396 says you’re free to use whatever you want which probably often was either ISO-8859-1 or CP1251 (which are not the same BTW) and RFC 3986 says UTF-8. What now? Use whatever your server uses to decode the GET parameters. This is often the same as the page encoding but not always.
In addition for the domain name you have to use Punycode. This includes so much Unicode that if you’re not on Java or .NET you have to use ICU.
Percent Encode
Most people know about this step, everything that’s not “safe” has to be turned into %hex-value
. Either here or in the next step you have to change from the octets you got after the previous step back into characters.
HTML Escape
URLs like every other page content have to be properly HTML escaped. This especially means turning &
into &
. Forgetting this is one of the main causes for invalid HTML.
Response Encode
And finally like every other page content the URL has to be encoded using the page encoding. This can be the same as an step two but doesn’t have to be.
Further Reading
- useBodyEncodingForURI in the Tomcat configuration
- Tomcat Character Encoding FAQ
- java.net.URLEncode#encode(String)
- UTF-8 Sampler