Skip to main content

Best way to parse a text document


I'm trying to parse a plain text document in PHP but have no idea how to do it correctly. I want to separate each word, assign them an ID and save the result in JSON format.



Sample text:




"Hello, how are you (today)"



This is what im doing at the moment:




$document_array = explode(' ', $document_text);
json_encode($document_array);



The resulting JSON is




[["Hello,"],["how"],["are"],["you"],["(today)"]]



How do I ensure that spaces are kept in-place and that symbols are not included along with the words...




[["Hello"],[", "],["how"],[" "],["are"],[" "],["you"],[" ("],["today"],[")"]]



I’m sure some sort of regex is required... but have no idea what kind of pattern to apply to deal with all cases... Any suggestions guys?


Source: Tips4allCCNA FINAL EXAM

Comments

  1. This is actually a really complex problem, and one that's subject to a fair amount of academic reaserch. It sounds so simple (just split on whitespace! with maybe a few rules for punctuation...) but you quickly run into issues. Is "didn't" one word or two? What about hyphenated words? Some might be one word, some might be two. What about multiple successive punctuation characters? Possessives versus quotes? etc etc. Even determining the end of a sentence is non-trivial. (It's just a full stop right?!)

    This problem is one of tokenisation and a topic that search engines take very seriously. To be honest you should really look at finding a tokeniser in your language of choice.

    ReplyDelete
  2. Maybe this:?

    array_filter(preg_split('/\b/', $document_text))


    the 'array_filter', removes the empty values at the first and/or last index of the resulting array, which will appear if your string start or ends with a word boundary (\b see: http://php.net/manual/en/regexp.reference.escape.php)

    ReplyDelete

Post a Comment

Popular posts from this blog

Slow Android emulator

I have a 2.67 GHz Celeron processor, 1.21 GB of RAM on a x86 Windows XP Professional machine. My understanding is that the Android emulator should start fairly quickly on such a machine, but for me it does not. I have followed all instructions in setting up the IDE, SDKs, JDKs and such and have had some success in staring the emulator quickly but is very particulary. How can I, if possible, fix this problem?

CCNA 1 Final Exam 2011 latest (hot hot hot)

  Hi! I have been posted content of ccna1 final exam (latest and only question.) I will post the answer and insert image on sunday. If you care, please subscribe your email an become a first person have full test content. Subcribe now  Some question  have not content because this question have images content. So that can you wait for me? SUNDAY 1. A user sees the command prompt: Router(config-if)# . What task can be performed at this mode? Reload the device. Perform basic tests. Configure individual interfaces. Configure individual terminal lines. 2. Refer to the exhibit. Host A attempts to establish a TCP/IP session with host C. During this attempt, a frame was captured with the source MAC address 0050.7320.D632 and the destination MAC address 0030.8517.44C4. The packet inside the captured frame has an IP source address 192.168.7.5, and the destination IP address is 192.168.219.24. At which point in the network was this packet captured? leaving host A leaving ATL leaving...