URL normalization is the process by which URLs are modified and standardized in a consistent manner. The goal of the normalization process is to transform a URL into a normalized or canonical URL so it is possible to determine if two syntactically different URLs may be equivalent.
For our normalization we will use normalizations that preserve semantics. You should normalize a given url using the next rules (only these rules. They are slightly different from RFC).
1. Converting the scheme and host to lower case.
HTTP://www.Example.com/ → http://www.example.com/
2. Capitalizing letters in escape sequences.
All letters within a percent-encoding triplet (e.g., "%3B") are case-insensitive, and should be capitalized.
http://www.example.com/a%c2%b1b → http://www.example.com/a%C2%B1b
3. Decoding percent-encoded octets of unreserved characters. For consistency, percent-encoded octets in the
of ALPHA (%41–%5A and %61–%7A), DIGIT (%30–%39), hyphen (%2D), period (%2E), underscore (%5F), or tilde
should not be created by Uniform Resource Identifiers (URI) producers and, when found in a URI, should be decoded to their corresponding
unreserved characters by URI normalizers.
http://www.example.com/%7Eusername/ → http://www.example.com/~username/
4. Removing the default port. The default port (port 80 for the “http” scheme) should be removed from a URL.
http://www.example.com:80/bar.html → http://www.example.com/bar.html
5. Removing dot-segments. The segments “..” and “.” can be removed from a URL according to the algorithm
described in RFC 3986 (or a similar algorithm). ".." is a parent directory, "." is the same directory.
http://www.example.com/a/b/../c/./d.html → http://www.example.com/a/c/d.html
Input: URL, an unicode string.
Output: Normalized URL, a string.
checkio("Http://Www.Checkio.org") == "http://www.checkio.org" checkio("http://www.checkio.org/%cc%b1bac") == "http://www.checkio.org/%CC%B1bac" checkio("http://www.checkio.org/task%5F%31") == "http://www.checkio.org/task_1" checkio("http://www.checkio.org:80/home/") == "http://www.checkio.org/home/" checkio("http://www.checkio.org:8080/home/") == "http://www.checkio.org:8080/home/" checkio("http://www.checkio.org/task/./1/../2/././name") == "http://www.checkio.org/task/2/name"
How it is used: This concept will help you in parsing and analytical processing. URL normalization is required if you need to compare the various URL addresses or you are running a system where letter-casing is sensitive.
Precondition: All input urls are valid.