TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Ask HN: Who crawls websites on a regular bases and why?

5 点作者 sshaginyan超过 8 年前
Hi folks,<p>I&#x27;m thinking about building a web crawling service very similar to kimonolabs. Before I do so, I&#x27;m trying to figure out who my target audience should&#x2F;could be.<p>I started off thinking about a tool that I would personally use which is sort of a web polling trigger. For example, login to website A, sort a list by relevance, check if the first item on the relevance sorted list is greater than X, poll until true (once true, send an email&#x2F;send an sms&#x2F;make an api call), then login to website B, insert item from website A into an input field and submit. If I were to start off with this, who would use it the most? (Maybe a financial analyst? investor? Sales? Market Research?)<p>I&#x27;ve also been thinking about building the service particularly for data scientists&#x2F;analysts. Some features would include visualizing datasets, clustering analysis, sentiment analysis, relational &amp; and non-relational database modeling (similar to MySQL work bench) directly in the browser, integrations with IBM Watson (https:&#x2F;&#x2F;www.ibm.com&#x2F;watson&#x2F;developercloud&#x2F;personality-insights.html)<p>What do you guys think?

4 条评论

teapot01超过 8 年前
My only suggestion is you should find a problem and build a solution. It sounds like you have a solution and you want to man-handle it onto a problem.
评论 #13164683 未加载
Jugurtha超过 8 年前
This is all great to write down as to what it could eventually become.. I think it has to be chunked to more manageable units, first. I mean, whether you have the individual skills required or not, it&#x27;ll start by requesting a page and inevitably forgetting to show the respect Unicode deserves.<p>That&#x27;s what I&#x27;m doing to learn.<p>- See how a page is structured (does it use schema.org&#x27;s stuff?). Its fields, URL pattern, sitemap, resource urls, selectors, etc.<p>- Fetch a page. - Parse it and extract data. - Save data. - Rinse, repeat.<p>I&#x27;m also learning about D3 and many cool things.<p>Check out <a href="http:&#x2F;&#x2F;atlas.media.mit.edu&#x2F;en&#x2F;profile&#x2F;country&#x2F;usa&#x2F;" rel="nofollow">http:&#x2F;&#x2F;atlas.media.mit.edu&#x2F;en&#x2F;profile&#x2F;country&#x2F;usa&#x2F;</a><p>As to your target audience, I think you&#x27;ll probably serve the Many-Faced God. In a gold rush, people who sell shovels make a good living. You probably want to make something that helps with decision making or triggers buying according to price (from your description).
cblock811超过 8 年前
I used to work for a company Zillabyte that had a lot of web data. Mostly marketers and sales people were looking for lead generation. Let&#x27;s say you work for Mixpanel, and you want to know all the websites out there with Kissmetrics installed. Even better, a report every month showing who uninstalled their analytics tools. Others looked for more general signals than javascript snippets (weak text analysis), but that was the bulk of what people asked me for.
tyingq超过 8 年前
80legs has a customer page with a few short blurbs about why each specific customer is crawling. <a href="http:&#x2F;&#x2F;www.80legs.com&#x2F;our-customers.html" rel="nofollow">http:&#x2F;&#x2F;www.80legs.com&#x2F;our-customers.html</a><p>Competitive analysis is, I suspect, one of the more popular reasons, and not much public info on that...for obvious reasons.