i have files i need to convert into a database. these files (i have over 100k) are from an old system (generated from a cobol script). i am now part of the team that migrate data from this system to the new system.
now, because we have a lot of files to parse (each files is from 50mb to 100mb) i want to make sure i use the right methods in order to convert them to sql statement.
most of the files have these following format:
#id<tab>name<tab>address1<tab>address2<tab>city<tab>state<tab>zip<tab>country<tab>#\n
the address2 is optional and can be empty or
#id<tab>client<tab>taxid<tab>tagid<tab>address1<tab>address2<tab>city<tab>state<tab>zip<tab>country<tab>#\n
these are the 2 most common lines (i'll say around 50%), other than these, all the line looks the same but with different informatoin.
now, my question is what should i do to open them to be as efficient as possible and parse them correctly? thanks
Source: Tips4all, CCNA FINAL EXAM
Honestly, I wouldn't use PHP for this. I'd use awk. With input that's as predictably formatted as this, it'll run faster, and you can output into SQL commands which you can also insert via a command line.
ReplyDeleteIf you have other reasons why you need to use PHP, you probably want to investigate the fgetcsv() function. Output is an array which you can parse into your insert. One of the first user-provided examples takes CSV and inserts it into MySQL. And this function does let you specify your own delimiter, so tab will be fine.
If the id# in the first column is unique in your input data, then you should definitely insert this into a primary key in mysql, to save you from duplicating data if you have to restart your batch.
When I worked on a project where it was necessary to parse huge and complex log files (Apache, firewall, sql), we had a big gain in performance using the function preg_match_all(less than 10% of the time required using explode / trims / formatting).
ReplyDeleteHuge files (>100Mb) are parsed in 2 or 3 minutes in a core 2 duo (the drawback is that memory consumption is very high since it creates a giant array with all the information ready to be synthesized).
Regular expressions allow you to identify the content of line if you have variations within the same file.
But if your files are simple, try ghoti suggestion (fgetscv), will work fine.
If you're already familiar with PHP then using it is a perfectly fine tool.
ReplyDeleteIf records do not span multiple lines, the best way to do this to guarantee that you won't run out of memory will be to process one line at a time.
I'd also suggest looking at the Standard PHP Library. It has nice directory iterators and file objects that make working with files and directories a bit nicer (in my opinion) than it used to be.
If you can use the CSV features and you use the SPL, make sure to set your options correctly for the tab characters.
You can use trim to remove the # from the first and last fields easily enough after the call to fgetcsv
Just sit and parse.
ReplyDeleteIt's one-time operation and looking for the most efficient way makes no sense.
Just more or less sane way would be enough.
As a matter of fact, most likely you'll waste more overall time looking for the super-extra-best solution. Say, your code will run for a hour. You will spend another hour to find a solution that runs 30% faster. You'll spend 1,7 hours vs. 1.