I have 20 .txt files each having records ranging from few hundred thousand records to a couple of million records. All these files are interdependent. For example if we take person.txt and address.txt, I have person_id field in person.txt and all the corresponding address can be found in address.txt with person_id as foreign key in address.txt file.<p>My question is what is the best way to generate a slice of the whole data set?<p>For example lets say we have person_id from 1 - 1 million in person.txt file and lets say i want to get all the related data in all the files for person_id = 1 - 10. What would be the best approach?<p>Finally i want to generate 20 new .txt files which contain data for those 10 person.<p>Two approaches I can think of are:<p>1. Store in the data from all the .txt files in different tables. Then write different queries to get data from all the related tables by joining them.<p>2. Write a python script and use different tools like grep, cut etc within the script to get the data slice. For example, first I can run grep on person.txt file to get all the records in person.txt file. Then run grep on other files(using person_id if that is the foreign key in the other file or some other field from the result I got from previous grep).<p>Approach 1 will give the advantage of indexing in database which will make reading from tables faster.<p>On the other hand writing a python script will avoid the overhead of creating a database, storing all the data in the database and joining.<p>I'm not sure which his the best way to get better execution time for my requirement. Any help is appreciated. Thanks.
I’d reach for SQLite. It’s a great and fast way to do the joins. It also would make for a much better way to transport this data than a collection of interrelated text files. If the data will be transmitted and refreshed periodically this will make for a much more scalable, extensible, and smooth process.
It depends on your case.<p>I can say that pandas is amazingly good and fast at joining, if your data is really huge you can parallelize/segment your data with dask.<p>If it relatively simple to do the joining then it will probably be quicker to process the .txt files and not hard to write the script.<p>If the complexity is high you might want to try sqllite.