Vbulletin Translation and encoding issue

**Dody** · Thu 19 Nov '09, 2:32am

Originally posted by vishnubhatia

Hi,

Has anyone been able to upload the Arbic UTF-8 language pack successfully?? I have installed my DB in UTF-8 General CI, changed the config.php file, converted the master language file to utf-8 and after all this when I am trying to upload the arabic language file which is created in UTF-8 by PHP Dev (http://www.vbulletin.com/forum/showt...guage-packages), all the phrases are converted to ?????... anyone who can help me?

You are using the wrong vbulletin language pack, it is not well encoded. Use my attemp to generate the right utf8 encoding that works both with utf8 and windows-1256: check it here #24

**vishnubhatia** · Thu 19 Nov '09, 3:59am

Check Where?

Originally posted by Dody

You are using the wrong vbulletin language pack, it is not well encoded. Use my attemp to generate the right utf8 encoding that works both with utf8 and windows-1256: check it here #24

Hi Dody,

Check where? What is #24?

**Dody** · Thu 19 Nov '09, 2:33pm

Sorry, it was ment to be a link, anyway here you go: #24

**vishnubhatia** · Sat 21 Nov '09, 10:49pm

Thanks I had found it and reply you back on the same thread.

**Darren Gordon** · Mon 11 Jan '10, 7:24am

Originally posted by sajid09

use iconv("windows-1256", "UTF-8",$text);

Just to point out, in this case that would be quite a bad idea. If you use ISO-8859-1 for English then you would end up with a post table that contains a mix of raw ISO-8859-1 and UTF-8. In fact, using WINDOWS-1256 and ISO-8859-1 for two seperate languages is also a bad idea.

Unfortunately vBulletin does not track the charset used on a per post/per thread basis so if you ever wanted to convert such a table to another charset at a later date (ie UTF-8) then it would be extremely difficuly to do the conversion as the conversion process won't know if the source character set is WINDOWS-1256 or ISO-8859-1 when reading the content.

If you need to use mutliple languages, the safest solution is to use UTF-8 for everything. Even using ISO-8859-1 for arabic content (as well as english) would be a better approach despite being incorrect as then the characters will at least be encoded by the browser to NCRs (the htmlentities that you refered to). If you convert to UTF-8 later the conversion can decode the NCRs into UTF-8 along with the surrounding content in the source characterset.

**Dody** · Mon 11 Jan '10, 12:28pm

Beside what Darren has said, I noticed by looking at my post table, that it includes a mix of windows-1256 chars and the unicode characters in &ampersand; entities. I have always used windows-1256 and nothing else and my content is only in arabic and yet this still happens. So converting to UTF-8 isn't quite easy, especially for larg forums.

**Merjawy** · Mon 11 Jan '10, 12:43pm

I've always used windows-1256 for all language packs I use and worked just fine... but not with 4.0 (email issue with 4)

**apply** · Mon 18 Jan '10, 5:53pm

I have been running vB3.x on a persian forum for quite a while with no issues, but after I upgraded to vB4 search for non-english words no longer works. In fact, it never returns any results. All my tables use utf8 charset with utf8_general_ci collation. I do not use $config['Mysqli']['charset'] = 'utf8' in my config. If I enable it, my forum starts throwing errors.
I have rebuilt the index a few times, but the problem persists. It is really frustrating and my users are really pissed off because of this issue. Any help is appreciated.

**Darren Gordon** · Wed 20 Jan '10, 1:49am

Hey apply, have a look at this bug and the solution from Strateges.

**Darren Gordon** · Wed 20 Jan '10, 2:01am

Originally posted by Dody

I noticed by looking at my post table, that it includes a mix of windows-1256 chars and the unicode characters in &ampersand; entities. I have always used windows-1256 and nothing else and my content is only in arabic and yet this still happens. So converting to UTF-8 isn't quite easy

The &ampersand; entities are NCRs (Numeric Code References). They're submitted by your user's browser whenever they enter characters that are outside of the range of Windows-1256. NCRs are code point values of Unicode which maps to pretty much every known character used in languages. Because of this, NCRs are very simple to decode and convert to another character set; they present no loss of data and are actually the only thing we reliably know the encoding of. If you convert to UTF-8 later then NCRs won't pose any problems.

They also take more space (not a big problem these days but it'd still be better using a more efficient character encoding) and can pose logical problems - Σ counts as 6 characters; but these aren't much different from the challenges posed by multibyte characters anyway.

**apply** · Thu 21 Jan '10, 8:21am

Originally posted by Darren Gordon

Hey apply, have a look at this bug and the solution from Strateges.

Yes, Darren. I saw that bug and the solution a few days ago, but it didn't fix the problem. As I said in my previous post, I don't use $config['Mysqli']['charset'] = 'utf8' in my config.php. Could it be causing the problem?

Vbulletin Translation and encoding issue

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Related Topics