TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Ask HN: What is the best way to get slice of a bigdata set present in txt files?

2 点作者 sunilkumarc超过 7 年前
I have 20 .txt files each having records ranging from few hundred thousand records to a couple of million records. All these files are interdependent. For example if we take person.txt and address.txt, I have person_id field in person.txt and all the corresponding address can be found in address.txt with person_id as foreign key in address.txt file.<p>My question is what is the best way to generate a slice of the whole data set?<p>For example lets say we have person_id from 1 - 1 million in person.txt file and lets say i want to get all the related data in all the files for person_id = 1 - 10. What would be the best approach?<p>Finally i want to generate 20 new .txt files which contain data for those 10 person.<p>Two approaches I can think of are:<p>1. Store in the data from all the .txt files in different tables. Then write different queries to get data from all the related tables by joining them.<p>2. Write a python script and use different tools like grep, cut etc within the script to get the data slice. For example, first I can run grep on person.txt file to get all the records in person.txt file. Then run grep on other files(using person_id if that is the foreign key in the other file or some other field from the result I got from previous grep).<p>Approach 1 will give the advantage of indexing in database which will make reading from tables faster.<p>On the other hand writing a python script will avoid the overhead of creating a database, storing all the data in the database and joining.<p>I&#x27;m not sure which his the best way to get better execution time for my requirement. Any help is appreciated. Thanks.

3 条评论

unoti超过 7 年前
I’d reach for SQLite. It’s a great and fast way to do the joins. It also would make for a much better way to transport this data than a collection of interrelated text files. If the data will be transmitted and refreshed periodically this will make for a much more scalable, extensible, and smooth process.
PaulHoule超过 7 年前
It depends on your case.<p>I can say that pandas is amazingly good and fast at joining, if your data is really huge you can parallelize&#x2F;segment your data with dask.<p>If it relatively simple to do the joining then it will probably be quicker to process the .txt files and not hard to write the script.<p>If the complexity is high you might want to try sqllite.
fiddlerwoaroof超过 7 年前
Spark makes processing data in files really easy: at work I’ve been using this to read in a bunch of events and generate various metrics from them.