PDA

View Full Version : [vb3 fix] search feature: ongoing questions and worries


sparky2
Sun 22nd Sep '02, 12:48am
I'm fairly convinced that something is wrong with the vBulletin search algorithm (as of v2.2.7) and/or the underlying db schema, and offer the following as "proof":

During the past 2 months, I've tweaked our MySQL conf, php.ini,
and Apache httpd.conf countless times, trying to eliminate
anything else which might be a/the limiting factor... and it all
comes back to (down to) continual I/O bottlenecks from MySQL
threads reading vBulletin's halfGig+ searchindex.MYD and related
tables, with everything queueing up behind them.

THE SEARCHINDEX SHOULD NOT BE THIS LARGE.
-=-
SOMETHING _MUST_ BE WRONG WITH THE REGEX // PREG LOGIC USED IN BUILDING THE SEARCH INDEX!

I'm currently rebuilding the search index (for the nth time).
After emptying the prior (via the vbb adminCP interface)
and before rebuilding, I watched the WORD table begin
filling realtime ~~ looking for words I might want to define
as badwords array elements.

Hooboy! I went away for 12hrs and came back to find 560,000 records in the WORD table (and that about a third of our 2million posts had been indexed at that point). "There aren't that many words in the English language, dammit!" (I muttered) and started poking for answers.

I paged through the first 10,000 (or so) entries in the WORD table
e.g. SELECT title FROM word where wordid>1000 and wordid<2000;
and they were all CRAP ~~ non-words, with leading "punctuation" chars
( $something, &something, %something, ...something, *something, 45bucks,and even a.m )
!!!

"Oh, but these won't ALL actually be referenced in the searchindex table, right?!?" I wondered.
WRONG!

mysql> select distinct wordid from searchindex order by wordid limit 500;

I picked several to check, f'rinstance searchindex.wordid = 2737
and then double-checked (xrefed) them in the WORD table:

mysql> select wordid,title from word where title like '$%' limit 500;
+--------+-----------------+
| wordid | title |
+--------+-----------------+
| 2737 | $$ |
| 5639 | $$$ |
| 14701 | $$$$ |
| 14613 | $$$$$ |
{ snip }
| 215960 | $$$$$$$4 |
| 568393 | $$$$$$.tia |
| 219585 | $$$$$$s |
| 571997 | $$$$$...help |
| 194704 | $$$$$110.00 |
{ snip }

Not only is this "non-word" IN THE SEARCHINDEX, it's referenced THOUSANDS of times!, vis:
select count(wordid) from searchindex where wordid=2737;
+---------------+
| count(wordid) |
+---------------+
| 2477 |
+---------------+

What I'm saying (accusing?) is that although search phrases are being explode(ed) into words ~~ and punctuation stripped ~~ at runtime, NON [A-z][0-9] CHARS ARE CLEARLY NOT BEING STRIPPED (ref: vbb.word.title) WHEN THE SEARCH INDEX IS BUILT!

Okay, maybe the regex shouldn't / can't be THAT restrictive
(need to allow for other languages? math characters?)


The 900 (or so) "badwords" ARE being excluded, but these exclusions are tip-of-the-iceberg-insignificant in the scheme of things here.

Maybe the REGEX is okay. Maybe addslashes(htmlspecialcharacters(searchTerm) just needs to be called later, or stripslashes-ed before being checked? Whatever.

PLEASE don't put off fixing this until v3.0
This problem has kept many of us chasing our tails for months!

Scott MacVicar
Mon 23rd Sep '02, 11:18am
I'll have a look tonight but if we strip all non alphanumeric characters it could cause problems with multi-byte languages i think

sparky2
Mon 23rd Sep '02, 3:09pm
there is a wordsonly() function in ~admin/functions.php
which looks like it's designed to do EXACTLY what I'm suggesting/describing. However, I can't find any place in any of the vBulletin code that calls the function.

Considering that lines comprising the "wordsonly" function are immediately prior to the lines containing the "indexpost()" code,
I'm wondering if its inclusion was accidently dropped/forgotten.

I diffed our functions.php against a distribution copy to make sure the file has't been edited here. It hasn't.

============
Above, I said "exactly".
That's not quite correct.

In the line that reads:
$text=ereg_replace("[^ 0-9a-z]"," ",$text);

a space character is used for the replacement.
I'm thinking either of these would be more appropriate:

-- replace any non-alphanumeric char with an EMPTY
(strip, instead of replacing with space)

-=or=-

-- use a temp var, and use an "if" statement
{{{ if a non-alphanumeric char found middle-of-word, do NOT perform $text += $tempvar }}}
(perhaps we should allow a hyphen, or minus, char)
to weed out the "junk" words.


=========== Oh! Lookie what I just found! ===========
http://www.vbulletin.com/forum/showthread.php?s=&threadid=11710

So... this function *WAS* dropped...

...and people have been complaining about the slowness,
system load, and the HUGE size of the searchindex ever since, haven't they?

Scott MacVicar
Mon 23rd Sep '02, 6:20pm
wordsonly was dropped and replaced with indexpost, its identical to wordsonly but indexpost is a bit more complex and uses badwords.php

I'll see what other characters can be stripped, the problem you can't simply just use a regexp and strip everything thats not [a-zA-z0-9] is that it will strip all multi-byte characters which will cause problems for all the foreign customers, such as chinese, japanese and korean where a character can be up to 6 bytes i think depending if its zen-kaku or han-kaku which describes the width of the characters.

I'll try and see if i can find an interna php function to check if a word is multi-byte to use in indexpost.

sparky2
Wed 25th Sep '02, 5:32pm
you mean, like:

mb_detect_encoding()

Wow, I didn't realize what a can of worms this represented ~~
but I understand now that there's a lot more to consider.
Some of it might be dependent upon configuration options which are declared during PHP install. I don't know if you "require" PHP v4+ for the current vBulletin or if you're still coding to accomodate people who still (have to) use PHP3 due to their hosting arrangements.

Below are some bits-n-pieces I found in the PHP docs.
I'm at a loss to see the big picture, though.

PHP4 Character Encoding Requirements
-- Per byte encoding
-- Single byte characters in range of 00h-7fh which is compatible with ASCII
-- Multi-byte characters without 00h-7fh


Character encodings work with PHP4:
ISO-8859-*, EUC-JP, UTF-8


"If you use databases with PHP, it is recommended that you use the same character encoding for both database and internal encoding for ease of use and better performance. "


------php.ini excerpt:
[mbstring]
;mbstring.internal_encoding = EUC-JP
;mbstring.http_input = auto
;mbstring.http_output = SJIS
;mbstring.detect_order = auto
;mbstring.substitute_character = none;

MUG
Wed 25th Sep '02, 5:43pm
Also, vB code appears in the index without the brackets... search for "/quote" on these forums and you'll see what I mean. (351 pages of threads)

This obviously takes up useless rows in the searchindex, making it much less efficient. There should be an option in the Admin CP to enable features (such as the wordsonly() function) that only work with Roman-based languages, like English.

MUG
Wed 25th Sep '02, 5:58pm
If you want to find out how many rows in the database this bug is causing, run this query in phpMyAdmin:

SELECT wordid FROM word WHERE title IN('quote','/quote')

You should get two rows, with two different numbers. Use those two numbers in this query:

SELECT count(*) FROM searchindex WHERE wordid IN(wordid_1,wordid_2)

That query should return one row with the number of rows it is taking in your database.

This vB code problem is 2.5% of the rows in my searchindex table.

TommyBALL
Thu 26th Sep '02, 3:51am
How about making a bad-chars list? Then each admin can for himself decide what characters he does not want in his searchindex?

sparky2
Thu 26th Sep '02, 11:11am
Whew! thanks for chiming in. Re-reading my posts, I worried that I sounded like a l'il whiner.

I refrained from mentioning the error (lesser, and separate issue, I thought), but yeah, I see it all the time:
blahblah delete from searchindex where blah... and it does seem to be a related issue.

colon-bracketed codes (replacement text for smilies) are in there too (word and searchindex records for them). Arguably, this might be desirable in some installations... except you can't actually search for them (colon string colon) via the search interface. (!)

Maybe it's just our userbase (and their writing style),
but the biggest amount of clutter in our WORD table
is from someword-runtogether-anotherword strings
where "words" lack whitespace between them and have
ellipses (dot,dot,dot; n times) or hyphens.

Upon reading the vBulletin staff initial reply, I respectfully understand the multi-byte issue. Although I could "hack together" my own solution to regex exclude "words" with specified middle-of-word characters... I expect that would push our installation into the realm of "we can no longer support the product because it's been modified".

Without a doubt, I love vBulletin. Also without a doubt (for me), it's not meeting the scalability needs for our site, due to the inherent searchindex flaw(s) we're discussing here.

sparky2
Thu 26th Sep '02, 11:20am
Originally posted by TommyBALL
How about making a bad-chars list? Then each admin can for himself decide what characters he does not want in his searchindex? The fix, or improvement, isn't "as easy as just ADDING that".

As is, postindex() isn't even examining characters internal to the "words".
It's just chunking the message text field, handling it like a space-delimited imploded array.

TommyBALL
Thu 26th Sep '02, 3:34pm
I know it's not an "easy fix" :). But I've wished for it for a long time :)

GuruXL
Thu 26th Sep '02, 9:29pm
Originally posted by sparky2
The fix, or improvement, isn't "as easy as just ADDING that".

As is, postindex() isn't even examining characters internal to the "words".
It's just chunking the message text field, handling it like a space-delimited imploded array.

Sparky, the vBulletin search engine wasn't created to be VERY specific, its supposed to find threads that have in it words that are related to your query.

For example if I search for the word "tree" any instance of "tree" in the MySQL database will appear up in your results. If you want a more specific vB search engine, i think you will have to hack vB to get it to do that. Its simply that not everyone is as hung up on this search engine "flaw" as you call it, as you are...

EDIT: However, I can sympathize with your dilemma. Try to create a hack and fine tune it and see what you can work out...

John
Fri 27th Sep '02, 5:40am
We have had a think about this, and due to the fact it is going to require everyone to reindex their search engine to take effect, the changes will be made in vB3 not vB2. You will have to reindex the search engine for vB3 anyway, so this can be done at the same time.