PDA

View Full Version : Ampersand encoding broken.



Kier
Tue 30th Jul '02, 3:22pm
Encoded ampersands are not being handled properly. I don't know when this happened, but I suspect it is something to do with htmlspecialchars_uni() vs htmlspecialchars().

In the template editor, and the post text editor you can enter an encoded ampersand - & and when you view the same thing again it will be reduced to a single &.

This badly breaks XHTML compatability.

Kier
Tue 30th Jul '02, 3:39pm
In fact, the same is true for   and < and > and « and ©... all of which are very important for template layout etc.

Allowing all these exceptions (and there are a lot more) is going to make htmlspecialchars_uni() a very big function indeed. But it's absolutely VITAL that these exceptions are made, or the templates are very soon going to be corrupted, as each time I edit a template with a do=somethng&id=9 URI, it gets fudged by htmlspecialchars_uni().

Kier
Tue 30th Jul '02, 3:52pm
Can we change this line in htmlspecialchars_uni() from this:
$text = preg_replace('/&(?![a-z0-9#]+;)/si', '&', $text); to this
$text = preg_replace('/&(?![#0-9]+;)/s', '&', $text);I think that would solve the problem, especially as html character entities are &[a-z]+; whereas unicode characters are &#[0-9]+;

Kier
Tue 30th Jul '02, 4:19pm
ARSE.

That change severely breaks the template editor and lots of other things that expect full htmlspecialchars().

So we need to have a serious discussion about the use of htmlspecialchars_uni() vs htmlspecialchars().

Dev meeting tomorrow? (wednesday)

Mike Sullivan
Tue 30th Jul '02, 6:06pm
I believe I've fixed it now by replacing the function with:


// ######################## Start htmlspecialchars_uni ###################
function htmlspecialchars_uni($text) {
// this is a version of htmlspecialchars that still allows unicode to function correctly

// this is breaking more things than it fixes, so I'm disabling it for now - Kier
//return htmlspecialchars($text);


$text = preg_replace('/&(?!#[0-9]+;)/si', '&', $text);
$text = str_replace('<', '&lt;', $text);
$text = str_replace('>', '&gt;', $text);
$text = str_replace('"', '&amp;quot;', $text);
return $text;

} The regex just needed to be moved up. (Otherwise the entities were getting double replaced.) This should be committed by the time you read this. Just need to do a bit more testing on it first.

Then, to fix the templates, this search/replace regex should work:
Find: &amp;(?![#a-z0-9]+;)
Replace: &amp;amp;
(Hopefully this gets displayed correctly.)

Mike Sullivan
Tue 30th Jul '02, 6:29pm
The updated version is committed. Various editors worked correctly with it, so I believe this bug has now been quashed.

I haven't run the regex yet, but that should work. Basically, it will replace and & with &amp;amp; if it's not followed by 1 or more #, a-z, or 0-9 characters then a semi-colon.