Skip to main content

PHP library for word clustering/NLP?


What I am trying to implement is a rather trivial "take search results (as in title & short description), cluster them into meaningful named groups" program in PHP.



After hours of googling and countless searches on SO (yielding interesting results as always, albeit nothing really useful) I'm still unable to find any PHP library that would help me handle clustering.



  • Is there such a PHP library out there that I might have missed?

  • If not, is there any FOSS that handles clustering and has a decent API?


Source: Tips4allCCNA FINAL EXAM

Comments

  1. Like this:

    Use a list of stopwords, get all words or phrases not in the stopwords, count occurances of each, sort in descending order.

    The stopwords needs to be a list of all common English terms. It should also include punctuation, and you will need to preg_replace all the punctuation to be a separate word first, e.g. "Something, like this." -> "Something , like this ." OR, you can just remove all punctuation.

    $content=preg_replace('/[^a-z\s]/', '', $content); // remove punctuation

    $stopwords='the|and|is|your|me|for|where|etc...';
    $stopwords=explode('|',$stopwords);
    $stopwords=array_flip($stopwords);

    $result=array(); $temp=array();
    foreach ($content as $s)
    if (isset($stopwords[$s]) OR strlen($s)<3)
    {
    if (sizeof($temp)>0)
    {
    $result[]=implode(' ',$temp);
    $temp=array();
    }
    } else $temp[]=$s;
    if (sizeof($temp)>0) $result[]=implode(' ',$temp);

    $phrases=array_count_values($result);
    arsort($phrases);


    Now you have an associative array in order of the frequency of terms that occur in your input data.

    How you want to do the matches depends upon you, and it depends largely on the length of the strings in the input data.

    I would see if any of the top 3 array keys match any of the top 3 from any other in the data. These are then your groups.

    Let me know if you have any trouble with this.

    ReplyDelete
  2. "... cluster them into meaningful groups" is a bit to vague, you'll need to be more specific.

    For starters you could look into K-Means clustering.

    Have a look at this page and website:

    PHP/irInformation Retrieval and other interesting topics

    EDIT: You could try some data mining yourself by cross referencing search results with something like the open directory dmoz RDF data dump and then enumerate the matching categories.

    EDIT2: And here is a dmoz/category question that also mentions "Faceted Search"!

    Dmoz/Monster algorithme to calculate count of each category and sub category?

    ReplyDelete
  3. You could also have a look at Programming Collective Intelligence (Chapter 3 : Discovering Groups) by Toby Segaran which goes through just this use case using Python. However, you should be able to implement things in PHP once you understand how it works.

    Even though it is not PHP, the Carrot2 project offers several clustering engines and can be integrated with Solr.

    ReplyDelete
  4. This may be way off but check out OpenCalais. They have a web service which allows you to pass a block of text in and it will pass you back a parseable response of things that it found in the text, such as places, people, facts etc. You could use these categories to build your "clouds" and too choose which results to display.

    I've used this library a few times in php and it's always been quite easy to work with.

    Again, might not be relevant to what your trying to do. Maybe you could post an example of what your trying to accomplish?

    ReplyDelete
  5. If you can pre-define the filters for your faceted search (the named groups) then it will be much easier.

    Rather than relying on an algorithm that uses the current searcher's input and their particular results to generate the filter list, you would use an aggregate of the most commonly performed searches by all users and then tag results with them if they match.

    You would end up with a table (or something) of URLs in a many-to-many join to a table of tags, so each result url could have several appropriate tags.

    When the user searches, you simply match their search against the full index. But for the filters, you take the top results from among the current resultset.

    I'll work on query examples if you want.

    ReplyDelete

Post a Comment

Popular posts from this blog

Why is this Javascript much *slower* than its jQuery equivalent?

I have a HTML list of about 500 items and a "filter" box above it. I started by using jQuery to filter the list when I typed a letter (timing code added later): $('#filter').keyup( function() { var jqStart = (new Date).getTime(); var search = $(this).val().toLowerCase(); var $list = $('ul.ablist > li'); $list.each( function() { if ( $(this).text().toLowerCase().indexOf(search) === -1 ) $(this).hide(); else $(this).show(); } ); console.log('Time: ' + ((new Date).getTime() - jqStart)); } ); However, there was a couple of seconds delay after typing each letter (particularly the first letter). So I thought it may be slightly quicker if I used plain Javascript (I read recently that jQuery's each function is particularly slow). Here's my JS equivalent: document.getElementById('filter').addEventListener( 'keyup', function () { var jsStart = (new Date).getTime()...

Is it possible to have IF statement in an Echo statement in PHP

Thanks in advance. I did look at the other questions/answers that were similar and didn't find exactly what I was looking for. I'm trying to do this, am I on the right path? echo " <div id='tabs-".$match."'> <textarea id='".$match."' name='".$match."'>". if ($COLUMN_NAME === $match) { echo $FIELD_WITH_COLUMN_NAME; } else { } ."</textarea> <script type='text/javascript'> CKEDITOR.replace( '".$match."' ); </script> </div>"; I am getting the following error message in the browser: Parse error: syntax error, unexpected T_IF Please let me know if this is the right way to go about nesting an IF statement inside an echo. Thank you.