Skip to main content

PHP library for word clustering/NLP?


What I am trying to implement is a rather trivial "take search results (as in title & short description), cluster them into meaningful named groups" program in PHP.



After hours of googling and countless searches on SO (yielding interesting results as always, albeit nothing really useful) I'm still unable to find any PHP library that would help me handle clustering.



  • Is there such a PHP library out there that I might have missed?

  • If not, is there any FOSS that handles clustering and has a decent API?


Source: Tips4allCCNA FINAL EXAM

Comments

  1. Like this:

    Use a list of stopwords, get all words or phrases not in the stopwords, count occurances of each, sort in descending order.

    The stopwords needs to be a list of all common English terms. It should also include punctuation, and you will need to preg_replace all the punctuation to be a separate word first, e.g. "Something, like this." -> "Something , like this ." OR, you can just remove all punctuation.

    $content=preg_replace('/[^a-z\s]/', '', $content); // remove punctuation

    $stopwords='the|and|is|your|me|for|where|etc...';
    $stopwords=explode('|',$stopwords);
    $stopwords=array_flip($stopwords);

    $result=array(); $temp=array();
    foreach ($content as $s)
    if (isset($stopwords[$s]) OR strlen($s)<3)
    {
    if (sizeof($temp)>0)
    {
    $result[]=implode(' ',$temp);
    $temp=array();
    }
    } else $temp[]=$s;
    if (sizeof($temp)>0) $result[]=implode(' ',$temp);

    $phrases=array_count_values($result);
    arsort($phrases);


    Now you have an associative array in order of the frequency of terms that occur in your input data.

    How you want to do the matches depends upon you, and it depends largely on the length of the strings in the input data.

    I would see if any of the top 3 array keys match any of the top 3 from any other in the data. These are then your groups.

    Let me know if you have any trouble with this.

    ReplyDelete
  2. "... cluster them into meaningful groups" is a bit to vague, you'll need to be more specific.

    For starters you could look into K-Means clustering.

    Have a look at this page and website:

    PHP/irInformation Retrieval and other interesting topics

    EDIT: You could try some data mining yourself by cross referencing search results with something like the open directory dmoz RDF data dump and then enumerate the matching categories.

    EDIT2: And here is a dmoz/category question that also mentions "Faceted Search"!

    Dmoz/Monster algorithme to calculate count of each category and sub category?

    ReplyDelete
  3. You could also have a look at Programming Collective Intelligence (Chapter 3 : Discovering Groups) by Toby Segaran which goes through just this use case using Python. However, you should be able to implement things in PHP once you understand how it works.

    Even though it is not PHP, the Carrot2 project offers several clustering engines and can be integrated with Solr.

    ReplyDelete
  4. This may be way off but check out OpenCalais. They have a web service which allows you to pass a block of text in and it will pass you back a parseable response of things that it found in the text, such as places, people, facts etc. You could use these categories to build your "clouds" and too choose which results to display.

    I've used this library a few times in php and it's always been quite easy to work with.

    Again, might not be relevant to what your trying to do. Maybe you could post an example of what your trying to accomplish?

    ReplyDelete
  5. If you can pre-define the filters for your faceted search (the named groups) then it will be much easier.

    Rather than relying on an algorithm that uses the current searcher's input and their particular results to generate the filter list, you would use an aggregate of the most commonly performed searches by all users and then tag results with them if they match.

    You would end up with a table (or something) of URLs in a many-to-many join to a table of tags, so each result url could have several appropriate tags.

    When the user searches, you simply match their search against the full index. But for the filters, you take the top results from among the current resultset.

    I'll work on query examples if you want.

    ReplyDelete

Post a Comment

Popular posts from this blog

[韓日関係] 首相含む大幅な内閣改造の可能性…早ければ来月10日ごろ=韓国

div not scrolling properly with slimScroll plugin

I am using the slimScroll plugin for jQuery by Piotr Rochala Which is a great plugin for nice scrollbars on most browsers but I am stuck because I am using it for a chat box and whenever the user appends new text to the boxit does scroll using the .scrollTop() method however the plugin's scrollbar doesnt scroll with it and when the user wants to look though the chat history it will start scrolling from near the top. I have made a quick demo of my situation http://jsfiddle.net/DY9CT/2/ Does anyone know how to solve this problem?

Why does this javascript based printing cause Safari to refresh the page?

The page I am working on has a javascript function executed to print parts of the page. For some reason, printing in Safari, causes the window to somehow update. I say somehow, because it does not really refresh as in reload the page, but rather it starts the "rendering" of the page from start, i.e. scroll to top, flash animations start from 0, and so forth. The effect is reproduced by this fiddle: http://jsfiddle.net/fYmnB/ Clicking the print button and finishing or cancelling a print in Safari causes the screen to "go white" for a sec, which in my real website manifests itself as something "like" a reload. While running print button with, let's say, Firefox, just opens and closes the print dialogue without affecting the fiddle page in any way. Is there something with my way of calling the browsers print method that causes this, or how can it be explained - and preferably, avoided? P.S.: On my real site the same occurs with Chrome. In the ex