TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Ask HN: Data people, what things do you always look for in a new dataset?

4 点作者 jameskerr将近 2 年前

1 comment

kingkongjaffa将近 2 年前
Presuming the data is tabular and has rows and columns...<p>I look for a manifest or README file that usually explains what the columns are.<p>I look for columns that could be used as unique identifiers or could be primary&#x2F;frgn keys in a db table.<p>I look at the names of all the columns to understand the domain and if I don&#x27;t know what a column represents then I make a note of it to find out more.<p>I look for the data type used for each column.<p>I look for each numerical column what the range of values are, what are some basic stats - min&#x2F;max&#x2F;mean&#x2F;mode&#x2F;std.dev.<p>If the data is in a domain I know then I make a note of if each columns numerical values make sense (does a temperature of -9000 degrees make sense or is it a sensor malfunction &#x2F; no-read value.)<p>I look for incomplete rows and if anything is blank, why is that?<p>I suppose if you understand all of those you should be ready to load the data into a db or for further analytics.<p>Practically you want to understand the magnitude of the data how many columns and rows does an average payload or batch contain?<p>Can the data fit in memory or not?<p>Does the data come in chunks or is it streamed somehow?
评论 #37041923 未加载