Skip to main content

PHP library for word clustering/NLP?


What I am trying to implement is a rather trivial "take search results (as in title & short description), cluster them into meaningful named groups" program in PHP.



After hours of googling and countless searches on SO (yielding interesting results as always, albeit nothing really useful) I'm still unable to find any PHP library that would help me handle clustering.



  • Is there such a PHP library out there that I might have missed?

  • If not, is there any FOSS that handles clustering and has a decent API?


Source: Tips4allCCNA FINAL EXAM

Comments

  1. Like this:

    Use a list of stopwords, get all words or phrases not in the stopwords, count occurances of each, sort in descending order.

    The stopwords needs to be a list of all common English terms. It should also include punctuation, and you will need to preg_replace all the punctuation to be a separate word first, e.g. "Something, like this." -> "Something , like this ." OR, you can just remove all punctuation.

    $content=preg_replace('/[^a-z\s]/', '', $content); // remove punctuation

    $stopwords='the|and|is|your|me|for|where|etc...';
    $stopwords=explode('|',$stopwords);
    $stopwords=array_flip($stopwords);

    $result=array(); $temp=array();
    foreach ($content as $s)
    if (isset($stopwords[$s]) OR strlen($s)<3)
    {
    if (sizeof($temp)>0)
    {
    $result[]=implode(' ',$temp);
    $temp=array();
    }
    } else $temp[]=$s;
    if (sizeof($temp)>0) $result[]=implode(' ',$temp);

    $phrases=array_count_values($result);
    arsort($phrases);


    Now you have an associative array in order of the frequency of terms that occur in your input data.

    How you want to do the matches depends upon you, and it depends largely on the length of the strings in the input data.

    I would see if any of the top 3 array keys match any of the top 3 from any other in the data. These are then your groups.

    Let me know if you have any trouble with this.

    ReplyDelete
  2. "... cluster them into meaningful groups" is a bit to vague, you'll need to be more specific.

    For starters you could look into K-Means clustering.

    Have a look at this page and website:

    PHP/irInformation Retrieval and other interesting topics

    EDIT: You could try some data mining yourself by cross referencing search results with something like the open directory dmoz RDF data dump and then enumerate the matching categories.

    EDIT2: And here is a dmoz/category question that also mentions "Faceted Search"!

    Dmoz/Monster algorithme to calculate count of each category and sub category?

    ReplyDelete
  3. You could also have a look at Programming Collective Intelligence (Chapter 3 : Discovering Groups) by Toby Segaran which goes through just this use case using Python. However, you should be able to implement things in PHP once you understand how it works.

    Even though it is not PHP, the Carrot2 project offers several clustering engines and can be integrated with Solr.

    ReplyDelete
  4. This may be way off but check out OpenCalais. They have a web service which allows you to pass a block of text in and it will pass you back a parseable response of things that it found in the text, such as places, people, facts etc. You could use these categories to build your "clouds" and too choose which results to display.

    I've used this library a few times in php and it's always been quite easy to work with.

    Again, might not be relevant to what your trying to do. Maybe you could post an example of what your trying to accomplish?

    ReplyDelete
  5. If you can pre-define the filters for your faceted search (the named groups) then it will be much easier.

    Rather than relying on an algorithm that uses the current searcher's input and their particular results to generate the filter list, you would use an aggregate of the most commonly performed searches by all users and then tag results with them if they match.

    You would end up with a table (or something) of URLs in a many-to-many join to a table of tags, so each result url could have several appropriate tags.

    When the user searches, you simply match their search against the full index. But for the filters, you take the top results from among the current resultset.

    I'll work on query examples if you want.

    ReplyDelete

Post a Comment

Popular posts from this blog

Wildcards in a hosts file

I want to setup my local development machine so that any requests for *.local are redirected to localhost . The idea is that as I develop multiple sites, I can just add vhosts to Apache called site1.local , site2.local etc, and have them all resolve to localhost , while Apache serves a different site accordingly.