Skip to main content

"Did you mean” feature on a dictionary database



I have a ~300.000 row table; which includes technical terms; queried using PHP and MySQL + FULLTEXT indexes. But when I searching a wrong typed term; for example "hyperpext"; naturally giving no results.





I need to "compansate" little writing errors and getting nearest record from database. How I can accomplish such feaure? I know (actually, learned today) about Levenshtein distance, Soundex and Metaphone algorithms but currently not having a solid idea to implement this to querying against database.





Best regards. (Sorry about my poor English, I'm trying to do my best)



Source: Tips4all

Comments

  1. See this article for how you might implement Levenshtein distance in a MySQL stored function.

    For posterity, the author's suggestion is to do this:

    CREATE FUNCTION LEVENSHTEIN (s1 VARCHAR(255), s2 VARCHAR(255))
    RETURNS INT
    DETERMINISTIC
    BEGIN
    DECLARE s1_len, s2_len, i, j, c, c_temp, cost INT;
    DECLARE s1_char CHAR;
    DECLARE cv0, cv1 VARBINARY(256);
    SET s1_len = CHAR_LENGTH(s1), s2_len = CHAR_LENGTH(s2), cv1 = 0x00, j = 1, i = 1, c = 0;
    IF s1 = s2 THEN
    RETURN 0;
    ELSEIF s1_len = 0 THEN
    RETURN s2_len;
    ELSEIF s2_len = 0 THEN
    RETURN s1_len;
    ELSE
    WHILE j <= s2_len DO
    SET cv1 = CONCAT(cv1, UNHEX(HEX(j))), j = j + 1;
    END WHILE;
    WHILE i <= s1_len DO
    SET s1_char = SUBSTRING(s1, i, 1), c = i, cv0 = UNHEX(HEX(i)), j = 1;
    WHILE j <= s2_len DO
    SET c = c + 1;
    IF s1_char = SUBSTRING(s2, j, 1) THEN SET cost = 0; ELSE SET cost = 1; END IF;
    SET c_temp = CONV(HEX(SUBSTRING(cv1, j, 1)), 16, 10) + cost;
    IF c > c_temp THEN SET c = c_temp; END IF;
    SET c_temp = CONV(HEX(SUBSTRING(cv1, j+1, 1)), 16, 10) + 1;
    IF c > c_temp THEN SET c = c_temp; END IF;
    SET cv0 = CONCAT(cv0, UNHEX(HEX(c))), j = j + 1;
    END WHILE;
    SET cv1 = cv0, i = i + 1;
    END WHILE;
    END IF;
    RETURN c;
    END


    He also supplies a LEVENSHTEIN_RATIO helper method which will evaluate the ratio of different/total characters, rather than a straight edit distance. For instance, if it's 60%, then three-fifths of the characters in the source word are different from the destination word.

    CREATE FUNCTION LEVENSHTEIN_RATIO (s1 VARCHAR(255), s2 VARCHAR(255))
    RETURNS INT
    DETERMINISTIC
    BEGIN
    DECLARE s1_len, s2_len, max_len INT;
    SET s1_len = LENGTH(s1), s2_len = LENGTH(s2);
    IF s1_len > s2_len THEN SET max_len = s1_len; ELSE SET max_len = s2_len; END IF;
    RETURN ROUND((1 - LEVENSHTEIN(s1, s2) / max_len) * 100);
    END

    ReplyDelete
  2. From the comments of http://dev.mysql.com/doc/refman/5.0/en/udf-compiling.html


    now i download the package from the mysql udf repository
    http://empyrean.lib.ndsu.nodak.edu/~nem/mysql/


    wget http://empyrean.lib.ndsu.nodak.edu/~nem/mysql/udf/dludf.cgi?ckey=28

    ll

    tar -xzvf dludf.cgi\?ckey\=28

    gcc -shared -o libmysqllevenshtein.so mysqllevenshtein.cc -I/usr/include/mysql/

    mv libmysqllevenshtein.so /usr/lib

    mysql -uroot -pPASS

    mysql> use DATABASE

    mysql> CREATE FUNCTION levenshtein RETURNS INT SONAME 'libmysqllevenshtein.so';

    mysql> select levenshtein(w1.word,w2.word) as dist from word w1, word w2 where ETC........... order by dist asc limit 0,10;

    ReplyDelete
  3. I suggest that you generate typo variations on the query input.

    i.e. hyperpext > { hyperpeext, hipertext, ... } etc

    One of these is bound to be the correct spelling (especially for common misspellings)

    The way you identify the most likely match is to do a lookup for each on an index which tells you the document frequency of the term. (make sense?)

    ReplyDelete
  4. Why not add a table column for storing the word in its alternate (e.g., Soundex) form? that way, if your first SELECT does not find the exact match, you can do a second search to look for matching alternate forms.

    The trick is to encode each word so that misspelled variations end up converted into the same alternate form.

    ReplyDelete

Post a Comment

Popular posts from this blog

[韓日関係] 首相含む大幅な内閣改造の可能性…早ければ来月10日ごろ=韓国

div not scrolling properly with slimScroll plugin

I am using the slimScroll plugin for jQuery by Piotr Rochala Which is a great plugin for nice scrollbars on most browsers but I am stuck because I am using it for a chat box and whenever the user appends new text to the boxit does scroll using the .scrollTop() method however the plugin's scrollbar doesnt scroll with it and when the user wants to look though the chat history it will start scrolling from near the top. I have made a quick demo of my situation http://jsfiddle.net/DY9CT/2/ Does anyone know how to solve this problem?

Why does this javascript based printing cause Safari to refresh the page?

The page I am working on has a javascript function executed to print parts of the page. For some reason, printing in Safari, causes the window to somehow update. I say somehow, because it does not really refresh as in reload the page, but rather it starts the "rendering" of the page from start, i.e. scroll to top, flash animations start from 0, and so forth. The effect is reproduced by this fiddle: http://jsfiddle.net/fYmnB/ Clicking the print button and finishing or cancelling a print in Safari causes the screen to "go white" for a sec, which in my real website manifests itself as something "like" a reload. While running print button with, let's say, Firefox, just opens and closes the print dialogue without affecting the fiddle page in any way. Is there something with my way of calling the browsers print method that causes this, or how can it be explained - and preferably, avoided? P.S.: On my real site the same occurs with Chrome. In the ex