科技回声

I have 20 .txt files each having records ranging from few hundred thousand records to a couple of million records. All these files are interdependent. For example if we take person.txt and address.txt, I have person_id field in person.txt and all the corresponding address can be found in address.txt with person_id as foreign key in address.txt file.My question is what is the best way to generate a slice of the whole data set?For example lets say we have person_id from 1 - 1 million in person.txt file and lets say i want to get all the related data in all the files for person_id = 1 - 10. What would be the best approach?Finally i want to generate 20 new .txt files which contain data for those 10 person.Two approaches I can think of are:1. Store in the data from all the .txt files in different tables. Then write different queries to get data from all the related tables by joining them.2. Write a python script and use different tools like grep, cut etc within the script to get the data slice. For example, first I can run grep on person.txt file to get all the records in person.txt file. Then run grep on other files(using person_id if that is the foreign key in the other file or some other field from the result I got from previous grep).Approach 1 will give the advantage of indexing in database which will make reading from tables faster.On the other hand writing a python script will avoid the overhead of creating a database, storing all the data in the database and joining.I'm not sure which his the best way to get better execution time for my requirement. Any help is appreciated. Thanks.

3 条评论

unoti超过 7 年前

I’d reach for SQLite. It’s a great and fast way to do the joins. It also would make for a much better way to transport this data than a collection of interrelated text files. If the data will be transmitted and refreshed periodically this will make for a much more scalable, extensible, and smooth process.

PaulHoule超过 7 年前

It depends on your case.I can say that pandas is amazingly good and fast at joining, if your data is really huge you can parallelize/segment your data with dask.If it relatively simple to do the joining then it will probably be quicker to process the .txt files and not hard to write the script.If the complexity is high you might want to try sqllite.

fiddlerwoaroof超过 7 年前

Spark makes processing data in files really easy: at work I’ve been using this to read in a bunch of events and generate various metrics from them.

3 条评论

unoti超过 7 年前

PaulHoule超过 7 年前

fiddlerwoaroof超过 7 年前

Spark makes processing data in files really easy: at work I’ve been using this to read in a bunch of events and generate various metrics from them.

Ask HN: What is the best way to get slice of a bigdata set present in txt files?

3 条评论

Ask HN: What is the best way to get slice of a bigdata set present in txt files?

3 条评论