Skip to main content

shortest and fastest way to parse php data


i have files i need to convert into a database. these files (i have over 100k) are from an old system (generated from a cobol script). i am now part of the team that migrate data from this system to the new system.



now, because we have a lot of files to parse (each files is from 50mb to 100mb) i want to make sure i use the right methods in order to convert them to sql statement.



most of the files have these following format:




#id<tab>name<tab>address1<tab>address2<tab>city<tab>state<tab>zip<tab>country<tab>#\n



the address2 is optional and can be empty or




#id<tab>client<tab>taxid<tab>tagid<tab>address1<tab>address2<tab>city<tab>state<tab>zip<tab>country<tab>#\n



these are the 2 most common lines (i'll say around 50%), other than these, all the line looks the same but with different informatoin.



now, my question is what should i do to open them to be as efficient as possible and parse them correctly? thanks


Source: Tips4allCCNA FINAL EXAM

Comments

  1. Honestly, I wouldn't use PHP for this. I'd use awk. With input that's as predictably formatted as this, it'll run faster, and you can output into SQL commands which you can also insert via a command line.

    If you have other reasons why you need to use PHP, you probably want to investigate the fgetcsv() function. Output is an array which you can parse into your insert. One of the first user-provided examples takes CSV and inserts it into MySQL. And this function does let you specify your own delimiter, so tab will be fine.

    If the id# in the first column is unique in your input data, then you should definitely insert this into a primary key in mysql, to save you from duplicating data if you have to restart your batch.

    ReplyDelete
  2. When I worked on a project where it was necessary to parse huge and complex log files (Apache, firewall, sql), we had a big gain in performance using the function preg_match_all(less than 10% of the time required using explode / trims / formatting).

    Huge files (>100Mb) are parsed in 2 or 3 minutes in a core 2 duo (the drawback is that memory consumption is very high since it creates a giant array with all the information ready to be synthesized).

    Regular expressions allow you to identify the content of line if you have variations within the same file.

    But if your files are simple, try ghoti suggestion (fgetscv), will work fine.

    ReplyDelete
  3. If you're already familiar with PHP then using it is a perfectly fine tool.

    If records do not span multiple lines, the best way to do this to guarantee that you won't run out of memory will be to process one line at a time.

    I'd also suggest looking at the Standard PHP Library. It has nice directory iterators and file objects that make working with files and directories a bit nicer (in my opinion) than it used to be.

    If you can use the CSV features and you use the SPL, make sure to set your options correctly for the tab characters.

    You can use trim to remove the # from the first and last fields easily enough after the call to fgetcsv

    ReplyDelete
  4. Just sit and parse.
    It's one-time operation and looking for the most efficient way makes no sense.
    Just more or less sane way would be enough.
    As a matter of fact, most likely you'll waste more overall time looking for the super-extra-best solution. Say, your code will run for a hour. You will spend another hour to find a solution that runs 30% faster. You'll spend 1,7 hours vs. 1.

    ReplyDelete

Post a Comment

Popular posts from this blog

[韓日関係] 首相含む大幅な内閣改造の可能性…早ければ来月10日ごろ=韓国

div not scrolling properly with slimScroll plugin

I am using the slimScroll plugin for jQuery by Piotr Rochala Which is a great plugin for nice scrollbars on most browsers but I am stuck because I am using it for a chat box and whenever the user appends new text to the boxit does scroll using the .scrollTop() method however the plugin's scrollbar doesnt scroll with it and when the user wants to look though the chat history it will start scrolling from near the top. I have made a quick demo of my situation http://jsfiddle.net/DY9CT/2/ Does anyone know how to solve this problem?

Why does this javascript based printing cause Safari to refresh the page?

The page I am working on has a javascript function executed to print parts of the page. For some reason, printing in Safari, causes the window to somehow update. I say somehow, because it does not really refresh as in reload the page, but rather it starts the "rendering" of the page from start, i.e. scroll to top, flash animations start from 0, and so forth. The effect is reproduced by this fiddle: http://jsfiddle.net/fYmnB/ Clicking the print button and finishing or cancelling a print in Safari causes the screen to "go white" for a sec, which in my real website manifests itself as something "like" a reload. While running print button with, let's say, Firefox, just opens and closes the print dialogue without affecting the fiddle page in any way. Is there something with my way of calling the browsers print method that causes this, or how can it be explained - and preferably, avoided? P.S.: On my real site the same occurs with Chrome. In the ex