TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Web scraping with Python open knowledge

219 点作者 PigiVinci83大约 3 年前

13 条评论

wraptile大约 3 年前
I recently joined a brilliant web scraping API company called ScrapFly, who provided me with the resources to create lots of open knowledge on web scraping that I always wanted to create!<p>So, to add to this list here are my top 3 favorite article that could expand OP&#x27;s document:<p>0 - central guide to avoiding blocking - this one a tough one because there&#x27;s so much information: request headers, http versions, TLS fingerprinting, javascript fingerprinting etc. I spent almost a month working on these and it was an amazing research experience that I could never afford myself before.<p>1, 2 - xpath and css selector introduction articles where I built a widget into our article that allows to test all css and xpath selectors right there in the learning material.<p>3 - introduction to reverse engineering - quick introduction to using browser devtools for web scraping, how to inspect the network and replicate it in your program. This is where I point all beginners as understanding the browser really helps to understand web scraping!<p>0 - <a href="https:&#x2F;&#x2F;scrapfly.io&#x2F;blog&#x2F;parsing-html-with-css&#x2F;" rel="nofollow">https:&#x2F;&#x2F;scrapfly.io&#x2F;blog&#x2F;parsing-html-with-css&#x2F;</a><p>1 - <a href="https:&#x2F;&#x2F;scrapfly.io&#x2F;blog&#x2F;parsing-html-with-xpath&#x2F;" rel="nofollow">https:&#x2F;&#x2F;scrapfly.io&#x2F;blog&#x2F;parsing-html-with-xpath&#x2F;</a><p>2 - <a href="https:&#x2F;&#x2F;scrapfly.io&#x2F;blog&#x2F;how-to-scrape-without-getting-blocked-tutorial&#x2F;" rel="nofollow">https:&#x2F;&#x2F;scrapfly.io&#x2F;blog&#x2F;how-to-scrape-without-getting-block...</a><p>3 - <a href="https:&#x2F;&#x2F;scrapecrow.com&#x2F;reverse-engineering-intro.html" rel="nofollow">https:&#x2F;&#x2F;scrapecrow.com&#x2F;reverse-engineering-intro.html</a>
评论 #31537919 未加载
评论 #31537863 未加载
评论 #31538860 未加载
PigiVinci83大约 3 年前
A work in progress guide about web scraping in python, anti bot softwares and techniques and so on. Please feel free to share and contribute with your own experience too.
评论 #31533691 未加载
评论 #31534764 未加载
评论 #31533978 未加载
tatoalo大约 3 年前
I recently transitioned one of my scraping projects away from selenium to playwright and I must say that the developer experience is way better, in my opinion.<p>I also implemented to receive a telegram message with the debug trace in case of errors in my pipeline, so that I could have the entire scraping flow to analyze. That’s pretty neat.
Xeoncross大约 3 年前
Plug for <a href="https:&#x2F;&#x2F;commoncrawl.org&#x2F;" rel="nofollow">https:&#x2F;&#x2F;commoncrawl.org&#x2F;</a> if you need billions of pages but don&#x27;t want to deal with scraping the web yourself.
评论 #31533796 未加载
评论 #31533152 未加载
评论 #31533825 未加载
afandian大约 3 年前
If you&#x27;re the kind of person who wants &quot;open data&quot; (read as broadly as you like) and could get it in snapshots direct from the source without having to scrape, what would your ideal format be?<p>I know it&#x27;s a very open ended question.
评论 #31541695 未加载
评论 #31533440 未加载
评论 #31534360 未加载
评论 #31554627 未加载
captn3m0大约 3 年前
Good list, confused about the “tabs weighing less” bit. Isn’t that a preference left for the end-devs?<p>Another tip I’ve found is to check if the data is accessible on a mobile app and proxy it to see if there is a JSON API available.
评论 #31534418 未加载
account-5大约 3 年前
I was reading another thread about webscraping, someone mentioned CSS selectors being way quicker than xpath. I&#x27;m easy either way but apart from a more powerful syntax what other benefits are there?
评论 #31533758 未加载
评论 #31533924 未加载
评论 #31538813 未加载
评论 #31538450 未加载
评论 #31534403 未加载
loudthing大约 3 年前
It can be frustrating learning web scraping with Python when so many sites actively block scraping.
评论 #31537536 未加载
评论 #31538931 未加载
评论 #31538661 未加载
jonatron大约 3 年前
As the second sentence says, it&#x27;s a cat and mouse game, so there&#x27;s no incentive on either side of bot vs anti-bot to share information.
评论 #31534426 未加载
holografix大约 3 年前
For websites that require auth via Google Auth this is a non starter. There’s no way to bypass its bot detection
评论 #31539879 未加载
lapser大约 3 年前
Is there an FLOSS project that combines Scrapy scrapers and just makes the results publicly available?
评论 #31541831 未加载
sgtquack大约 3 年前
As someone who recently dealt with scraping sites behind cloudflare...I never want to scrape again
评论 #31539891 未加载
rustdeveloper大约 3 年前
I think that if you don’t want to invest a lot of time into learning web scraping and money to get a pool of residential, or even better mobile, proxies it’s easy to quickly get good results with web scraping API like <a href="https:&#x2F;&#x2F;scrapingfish.com" rel="nofollow">https:&#x2F;&#x2F;scrapingfish.com</a> They have good blogposts, for example, for how to scrape public Instagram profiles: <a href="https:&#x2F;&#x2F;scrapingfish.com&#x2F;blog&#x2F;scraping-instagram" rel="nofollow">https:&#x2F;&#x2F;scrapingfish.com&#x2F;blog&#x2F;scraping-instagram</a>