"Did you mean&rdquo; feature on a dictionary database

"Did you mean” feature on a dictionary database

I have a ~300.000 row table; which includes technical terms; queried using PHP and MySQL + FULLTEXT indexes. But when I searching a wrong typed term; for example "hyperpext"; naturally giving no results.

I need to "compansate" little writing errors and getting nearest record from database. How I can accomplish such feaure? I know (actually, learned today) about Levenshtein distance, Soundex and Metaphone algorithms but currently not having a solid idea to implement this to querying against database.

Best regards. (Sorry about my poor English, I'm trying to do my best)

Source: Tips4all

Comments

Tips For AllApril 6, 2012 at 7:37 AM
See this article for how you might implement Levenshtein distance in a MySQL stored function.

For posterity, the author's suggestion is to do this:

CREATE FUNCTION LEVENSHTEIN (s1 VARCHAR(255), s2 VARCHAR(255))
RETURNS INT
DETERMINISTIC
BEGIN
DECLARE s1_len, s2_len, i, j, c, c_temp, cost INT;
DECLARE s1_char CHAR;
DECLARE cv0, cv1 VARBINARY(256);
SET s1_len = CHAR_LENGTH(s1), s2_len = CHAR_LENGTH(s2), cv1 = 0x00, j = 1, i = 1, c = 0;
IF s1 = s2 THEN
RETURN 0;
ELSEIF s1_len = 0 THEN
RETURN s2_len;
ELSEIF s2_len = 0 THEN
RETURN s1_len;
ELSE
WHILE j <= s2_len DO
SET cv1 = CONCAT(cv1, UNHEX(HEX(j))), j = j + 1;
END WHILE;
WHILE i <= s1_len DO
SET s1_char = SUBSTRING(s1, i, 1), c = i, cv0 = UNHEX(HEX(i)), j = 1;
WHILE j <= s2_len DO
SET c = c + 1;
IF s1_char = SUBSTRING(s2, j, 1) THEN SET cost = 0; ELSE SET cost = 1; END IF;
SET c_temp = CONV(HEX(SUBSTRING(cv1, j, 1)), 16, 10) + cost;
IF c > c_temp THEN SET c = c_temp; END IF;
SET c_temp = CONV(HEX(SUBSTRING(cv1, j+1, 1)), 16, 10) + 1;
IF c > c_temp THEN SET c = c_temp; END IF;
SET cv0 = CONCAT(cv0, UNHEX(HEX(c))), j = j + 1;
END WHILE;
SET cv1 = cv0, i = i + 1;
END WHILE;
END IF;
RETURN c;
END

He also supplies a LEVENSHTEIN_RATIO helper method which will evaluate the ratio of different/total characters, rather than a straight edit distance. For instance, if it's 60%, then three-fifths of the characters in the source word are different from the destination word.

CREATE FUNCTION LEVENSHTEIN_RATIO (s1 VARCHAR(255), s2 VARCHAR(255))
RETURNS INT
DETERMINISTIC
BEGIN
DECLARE s1_len, s2_len, max_len INT;
SET s1_len = LENGTH(s1), s2_len = LENGTH(s2);
IF s1_len > s2_len THEN SET max_len = s1_len; ELSE SET max_len = s2_len; END IF;
RETURN ROUND((1 - LEVENSHTEIN(s1, s2) / max_len) * 100);
END
ReplyDelete
Replies
Tips For AllApril 6, 2012 at 7:37 AM
From the comments of http://dev.mysql.com/doc/refman/5.0/en/udf-compiling.html

now i download the package from the mysql udf repository
http://empyrean.lib.ndsu.nodak.edu/~nem/mysql/

wget http://empyrean.lib.ndsu.nodak.edu/~nem/mysql/udf/dludf.cgi?ckey=28

ll

tar -xzvf dludf.cgi\?ckey\=28

gcc -shared -o libmysqllevenshtein.so mysqllevenshtein.cc -I/usr/include/mysql/

mv libmysqllevenshtein.so /usr/lib

mysql -uroot -pPASS

mysql> use DATABASE

mysql> CREATE FUNCTION levenshtein RETURNS INT SONAME 'libmysqllevenshtein.so';

mysql> select levenshtein(w1.word,w2.word) as dist from word w1, word w2 where ETC........... order by dist asc limit 0,10;
ReplyDelete
Replies
Tips For AllApril 6, 2012 at 7:37 AM
I suggest that you generate typo variations on the query input.

i.e. hyperpext > { hyperpeext, hipertext, ... } etc

One of these is bound to be the correct spelling (especially for common misspellings)

The way you identify the most likely match is to do a lookup for each on an index which tells you the document frequency of the term. (make sense?)
ReplyDelete
Replies
Tips For AllApril 6, 2012 at 7:37 AM
Why not add a table column for storing the word in its alternate (e.g., Soundex) form? that way, if your first SELECT does not find the exact match, you can do a second search to look for matching alternate forms.

The trick is to encode each word so that misspelled variations end up converted into the same alternate form.
ReplyDelete
Replies

Add comment

CCNA, CCNP, MCSA, CCNA Final Exam, All Answer Test Module With 100/100

Search This Blog

"Did you mean” feature on a dictionary database

Labels

Comments

Post a Comment

Popular posts from this blog

CCNA 1 Final Exam 2011 latest (hot hot hot)

Slow Android emulator

Java Urban Myths

CCNA, CCNP, MCSA, CCNA Final Exam, All Answer Test Module With 100/100