Friendly URLs and Character Encoding

Collapse
X
Collapse
 

  • Darren Gordon
    started a blog post Friendly URLs and Character Encoding

    Friendly URLs and Character Encoding

    As implied by the name, vBulletin 4 Friendly URLs allow you to use 'friendlier' URLs for the links on your Forums, Blogs and CMS. Instead of a script name followed by obscure resource ids, friendly URLs are more informative to your users and give more clues to search engines about the nature of your content.

    vB4 comes with four options for Friendly URLs:
    • Standard URL:
      Code:
      http://www.example.com/showthread.php?t=42&page=3
    • Basic Friendly URL:
      Code:
      http://www.example.com/showthread.php?42-Honda-Motorbikes/page3
    • Advanced Friendly URL:
      Code:
      http://www.example.com/showthread.php/42-Honda-Motorbikes/page3
    • Rewritten Friendly URL:
      Code:
      http://www.example.com/threads/42-Honda-Motorbikes/page3


    Canonical URLs

    As a result of Friendly URLs we also enforce a canonical URL for any changes that may be made through vBulletin. This is important if you change your board setting so that if other websites are linking to your old URLs they can still find the same resource despite the URL being out of date. This is also important for SEO so that search engines do not split the indexing of your content over various URLs.

    So if you use Basic Friendly URLs:
    Code:
    http://www.example.com/entry.php?433-Top-10-Horror-Movies
    other websites may link to this content, and search engines will index it. If you later switch to Rewritten Friendly URLs due to a server upgrade that allows them:
    Code:
    http://www.example.com/entries/433-Top-10-Horror-Movies
    other websites may still link to your Basic Friendly URLs, and search engines will still have results for them, however the old link will still be accessible. When a user or search engine follows the link, they will be redirected with a "301 Found" message that takes users to the new URL and informs search engines that the content has moved and to update their records.

    All of this is all well and good, until (cue Jaws soundtrack)... we start to consider character encoding.


    Character Encoding

    With the exception of the CMS, Friendly URLs are generated from your existing content. Threads, Forums and Blog entries are generated from their titles, while user Blogs and individual member profile pages are generated from the user's username. The CMS URLs are generated from an alias specified when creating content.

    UTF-8
    In an ideal world we'd all be using multibyte UTF-8. UTF-8 provides character encoding for almost every known language in the world. Even older browsers behave well with UTF-8 and it is the de facto standard in the modern world. URLs sent to the browser and then requested by the user are sent as urlencoded characters, which are simple for PHP to decode, parse and match up with the canonical URL. All modern browsers also display UTF-8 URLs correctly in the status bar, and more importantly the address bar.

    So why not just do everything in UTF-8?
    The problem, as many of you are aware is legacy data. It may be relatively simple to enforce UTF-8 on a brand new vBulletin installation with the knowledge that it's the safest standard to use, however vBulletin was not always that way. In the past, vB has tried to be 'charset agnostic', allowing any encoding to be passed back and forth between the web server, database and the browser; and the many Arabic, Turkish, Cyrillic and Chinese (and many other) language boards testifies that it has sort of worked.

    This was an understandable (and possibly even a correct) approach to character encoding in the past, when there was often no, or ill defined multibyte support in databases, servers and browsers; the processing overhead for multibyte characters was too high and the storage capacity for multibyte was unacceptable (multibyte can make your database multiple times the size).

    These days the processing and storage for multibyte character encoding is neglibable and standards have improved in all areas, and modern browsers allow UTF-8 to be used in the address bar.

    Currently, vBulletin allows board administrators to specify the charset of their content within their language definitions, and often alternative single byte character sets have been chosen (such as Windows-1256). When a page is loaded, the user's browser is informed of the charset defined in the language, letting the browser know how to display the content.

    This also has a side affect of infering to the browser what character sets can be sent to the server, such as when submitting a new post. If you tell the browser that the page is encoded in Windows-1256 and don't specifically tell it what character set to send data to the server as, then Windows-1256 will be sent and eventually stored in the database unchanged. This problem is compounded if an administrator has created several languages with different charsets as you end up with characters of different encodings all in the same table. When it comes to converting the database or handling any of the data (like matching Canonical URLs) there is no way to know what character set the data is encoded in.

    Now that we are implementing features that completely rely on consistent character encoding we can see the approaches of the past posing major challenges for us.


    Handling Character Data for Friendly URLs

    The design of Friendly URLs has three goals to consider:
    1. Build clean and friendly URLs that can be displayed by the browser.
    2. Understand those URLs when they are submitted by the browser in order to fetch the requested resource (forum, thread, blog etc).
    3. Ensure that the requested URL matches the canonical URL for that resource, and redirect the user to the correct URL if it does not match.


    If everything were UTF-8 then Friendly URLs would be built in a consistent fashion, displayed correctly by the browser, and requested by the browser in a consistent way that always matches the canonical URL.

    With other character sets, the differences in behaviour both from browsers and and webservers can make it difficult to achieve all of these goals.

    Some form of cleaning is always needed. Both for security reasons, and to simply allow us to extract the needed information from the URL to locate the requested resource. Strange characters in the URL can trip up even the simplest regex patterns. Because we need to clean the URL we need to know what character set to use. In addition, by the very definition of 'Canonical URLs' we also need to ensure that all URLs use the same character set.

    As many admins are using many (and some times several) character sets, then inevitably some of these will no longer look correct when cleaned with a different character set. All characters that are invalid in the character set used for cleaning will be removed, in some character sets this leaves nothing left for the Friendly URL. Additionally, as all of the Friendly URLs must be of the same character set, they will inevitably look wrong when sent to the browser in a different character set.

    The implementation of Friendly URLs in the initial vBulletin 4 release uses UTF-8 to clean the characters. This provides the safest solution for now, and guarantees that the last two goals are met. If you consistently use a character set that does not match the one defined in your language, this also exploits a behaviour found in all browsers: when unrecognised characters are found, UTF-8 is assumed. For this reason, if you use a character set like ISO-8859-1 but are always submitting Arabic, or CJK then you are likely to find that they are displayed correctly in URLs. However, if you specify a different character set, such as Windows-1256 and your users submit most of your content in that character set then you will almost certainly find that all of your characters are either stripped, or displayed incorrectly.


    Moving Forward

    In order to deal with those situations more elegantly, the first point release after vBulletin 4.0 will include a few options and improved flexibility to allow support for other character sets. However, there are some limitations which need to be understood.

    All browsers apart from IE don't display any characters in the URL that are part of the query (after the ?) that are not UTF-8, and URL encodes them so they appear like:
    Code:
    http://www.example.com/entry.php?433-%E1%E9%ED%F3%FA
    This affects Basic Friendly URLs for all of vBulletin, and also Standard URLs for vBCMS. We can work around this issue by sticking to standards and reencoding the URL as UTF-8 and urlencoding it for non IE browsers; however IE will not display these correctly. As an alternative for IE we can send the URLs in their original character set, but IE will not reencode them to UTF-8 for Basic Friendly and Standard URLs when sent in the request. If we accept this then it will mean there are two accessible URLs to a resource (the original character set, and UTF-8) which breaks the second goal of having a single canonical URL.

    All browsers also reencode any characters that are a part of the path (after the last /) to UTF-8 regardless of the character set. This is standards compliant.

    For these reasons, support for other character sets will require iconv. Our minimum requirements for vBulletin are now PHP 5.1.6 which has iconv built in. However, some shared hosts still may not have iconv compiled into PHP. Please let me know in the comments if you are not using UTF-8 as your Character , are using PHP 5.1.6 and still don't have iconv.


    vBulletin in the Future

    In the future we hope to make Friendly URLs the easy feature that they should be by moving vBulletin to be completely UTF-8. This is a much greater task than Friendly URLs and poses it's own set of challenges. However, we will continue to support issues with other character sets until we are completely confident that we are providing a robust set of tools to allow everyone to migrate to UTF-8.

    So, to wrap up – Friendly URLs and Character Encoding is a complex issue, compounded by legacy data and mixed character sets. The initial vBulletin 4.0 release has a one-size approach to enabling them, but more flexibility and character set support will be provided in the coming weeks.

    By the way, if you have any bugs to report with Friendly URLs, the following information is really important to be able to fix them:
    1. Webserver and Version
    2. Browser and Version used to Create Content
    3. Character Set used when Creating Content
    4. Browser and Version used to View Content
    5. Character Set used when Viewing Content
    6. Friendly URL Option used when Viewing Content


    Comments and question are welcome, as usual.

    • obmob
      #31
      obmob commented
      Editing a comment
      Is there any fix to rewrite spanish threads correctly?

      I mean, getting rid of ¡! ¿? áéíóú would be ideal.

    • R.o.o.t
      #32
      R.o.o.t commented
      Editing a comment
      Thank's
      We Well Waite

    • kingtech
      #33
      kingtech commented
      Editing a comment
      is there any way to suppress characters in the post title that will screw up the SEF? for example, if someone uses \ or / or ? in the topic title, I would rather the SEF simply stripped those characters from the title before generating the URL.
    Posting comments is disabled.

Related Topics

Collapse

Working...