Skip to main content

"Did you mean” feature on a dictionary database



I have a ~300.000 row table; which includes technical terms; queried using PHP and MySQL + FULLTEXT indexes. But when I searching a wrong typed term; for example "hyperpext"; naturally giving no results.





I need to "compansate" little writing errors and getting nearest record from database. How I can accomplish such feaure? I know (actually, learned today) about Levenshtein distance, Soundex and Metaphone algorithms but currently not having a solid idea to implement this to querying against database.





Best regards. (Sorry about my poor English, I'm trying to do my best)



Source: Tips4all

Comments

  1. See this article for how you might implement Levenshtein distance in a MySQL stored function.

    For posterity, the author's suggestion is to do this:

    CREATE FUNCTION LEVENSHTEIN (s1 VARCHAR(255), s2 VARCHAR(255))
    RETURNS INT
    DETERMINISTIC
    BEGIN
    DECLARE s1_len, s2_len, i, j, c, c_temp, cost INT;
    DECLARE s1_char CHAR;
    DECLARE cv0, cv1 VARBINARY(256);
    SET s1_len = CHAR_LENGTH(s1), s2_len = CHAR_LENGTH(s2), cv1 = 0x00, j = 1, i = 1, c = 0;
    IF s1 = s2 THEN
    RETURN 0;
    ELSEIF s1_len = 0 THEN
    RETURN s2_len;
    ELSEIF s2_len = 0 THEN
    RETURN s1_len;
    ELSE
    WHILE j <= s2_len DO
    SET cv1 = CONCAT(cv1, UNHEX(HEX(j))), j = j + 1;
    END WHILE;
    WHILE i <= s1_len DO
    SET s1_char = SUBSTRING(s1, i, 1), c = i, cv0 = UNHEX(HEX(i)), j = 1;
    WHILE j <= s2_len DO
    SET c = c + 1;
    IF s1_char = SUBSTRING(s2, j, 1) THEN SET cost = 0; ELSE SET cost = 1; END IF;
    SET c_temp = CONV(HEX(SUBSTRING(cv1, j, 1)), 16, 10) + cost;
    IF c > c_temp THEN SET c = c_temp; END IF;
    SET c_temp = CONV(HEX(SUBSTRING(cv1, j+1, 1)), 16, 10) + 1;
    IF c > c_temp THEN SET c = c_temp; END IF;
    SET cv0 = CONCAT(cv0, UNHEX(HEX(c))), j = j + 1;
    END WHILE;
    SET cv1 = cv0, i = i + 1;
    END WHILE;
    END IF;
    RETURN c;
    END


    He also supplies a LEVENSHTEIN_RATIO helper method which will evaluate the ratio of different/total characters, rather than a straight edit distance. For instance, if it's 60%, then three-fifths of the characters in the source word are different from the destination word.

    CREATE FUNCTION LEVENSHTEIN_RATIO (s1 VARCHAR(255), s2 VARCHAR(255))
    RETURNS INT
    DETERMINISTIC
    BEGIN
    DECLARE s1_len, s2_len, max_len INT;
    SET s1_len = LENGTH(s1), s2_len = LENGTH(s2);
    IF s1_len > s2_len THEN SET max_len = s1_len; ELSE SET max_len = s2_len; END IF;
    RETURN ROUND((1 - LEVENSHTEIN(s1, s2) / max_len) * 100);
    END

    ReplyDelete
  2. From the comments of http://dev.mysql.com/doc/refman/5.0/en/udf-compiling.html


    now i download the package from the mysql udf repository
    http://empyrean.lib.ndsu.nodak.edu/~nem/mysql/


    wget http://empyrean.lib.ndsu.nodak.edu/~nem/mysql/udf/dludf.cgi?ckey=28

    ll

    tar -xzvf dludf.cgi\?ckey\=28

    gcc -shared -o libmysqllevenshtein.so mysqllevenshtein.cc -I/usr/include/mysql/

    mv libmysqllevenshtein.so /usr/lib

    mysql -uroot -pPASS

    mysql> use DATABASE

    mysql> CREATE FUNCTION levenshtein RETURNS INT SONAME 'libmysqllevenshtein.so';

    mysql> select levenshtein(w1.word,w2.word) as dist from word w1, word w2 where ETC........... order by dist asc limit 0,10;

    ReplyDelete
  3. I suggest that you generate typo variations on the query input.

    i.e. hyperpext > { hyperpeext, hipertext, ... } etc

    One of these is bound to be the correct spelling (especially for common misspellings)

    The way you identify the most likely match is to do a lookup for each on an index which tells you the document frequency of the term. (make sense?)

    ReplyDelete
  4. Why not add a table column for storing the word in its alternate (e.g., Soundex) form? that way, if your first SELECT does not find the exact match, you can do a second search to look for matching alternate forms.

    The trick is to encode each word so that misspelled variations end up converted into the same alternate form.

    ReplyDelete

Post a Comment

Popular posts from this blog

Slow Android emulator

I have a 2.67 GHz Celeron processor, 1.21 GB of RAM on a x86 Windows XP Professional machine. My understanding is that the Android emulator should start fairly quickly on such a machine, but for me it does not. I have followed all instructions in setting up the IDE, SDKs, JDKs and such and have had some success in staring the emulator quickly but is very particulary. How can I, if possible, fix this problem?