科技回声

I've been web scraping for a while and I am running out of ideas. I use an anonymous crawling service which provides HTML content at scale, it also has a set of predefined scrapers which I use instead of maintaining own scrapers, that speeds up any scraping I do.I need new ideas to build which are unique and can be useful to people. I build services based on scraping which can be used in different fields, like marketing, SEO, drop shipping etc. In many cases I need JS crawling capabilities and with the service I use, I can get those widgets and rendered pages handy so I could focus on the data and the idea.knowing that you have the resources I have, what would you be looking to scrape today? I like services that help people so any feedback would be great. I am trying to think out of the box and find new ideas and would appreciate some inspiration hereby.I'm also open to hear what data you would scrape from the web in realtime, if you have the right tools to scale your scraping.

For 2020 I'd like to get the web browser out of my life as much as possible, that is, motivated by this work I've done<a href="https://ontology2.com/essays/HackerNewsForHackers/" rel="nofollow">https://ontology2.com/essays/HackerNewsForHackers/</a> <a href="https://ontology2.com/essays/ClassifyingHackerNewsArticles/" rel="nofollow">https://ontology2.com/essays/ClassifyingHackerNewsArticles/</a>I'd like to crawl a large number of sites that have quality articles, for instance<a href="https://voxeu.org/" rel="nofollow">https://voxeu.org/</a> <a href="https://www.anandtech.com/" rel="nofollow">https://www.anandtech.com/</a>and put them through a workflow where I never see an article more than once, things get classified, etc.One major issue I have is ads. In 2020 it is not just a matter of ads getting in the way of content, but rather ads getting in the way of ads. That voxeu site doesn't have ads, but it does abuse Javascript in such a way that the back button really works wrong.The web is breaking down to the extent that I'd really like to filter the junk out and have an order-of-magnitude better interface.

I'm toying with scraping a certain type of product listing and running classifiers to help users find the product they want.It sounds like you're building a scraping service and looking for clients.

Ask HN: What info would you web scrape in 2020?

2 条评论

Ask HN: What info would you web scrape in 2020?

2 条评论