科技回声

13 条评论

wraptile大约 3 年前

I recently joined a brilliant web scraping API company called ScrapFly, who provided me with the resources to create lots of open knowledge on web scraping that I always wanted to create!So, to add to this list here are my top 3 favorite article that could expand OP's document:0 - central guide to avoiding blocking - this one a tough one because there's so much information: request headers, http versions, TLS fingerprinting, javascript fingerprinting etc. I spent almost a month working on these and it was an amazing research experience that I could never afford myself before.1, 2 - xpath and css selector introduction articles where I built a widget into our article that allows to test all css and xpath selectors right there in the learning material.3 - introduction to reverse engineering - quick introduction to using browser devtools for web scraping, how to inspect the network and replicate it in your program. This is where I point all beginners as understanding the browser really helps to understand web scraping!0 - <a href="https://scrapfly.io/blog/parsing-html-with-css/" rel="nofollow">https://scrapfly.io/blog/parsing-html-with-css/</a>1 - <a href="https://scrapfly.io/blog/parsing-html-with-xpath/" rel="nofollow">https://scrapfly.io/blog/parsing-html-with-xpath/</a>2 - <a href="https://scrapfly.io/blog/how-to-scrape-without-getting-blocked-tutorial/" rel="nofollow">https://scrapfly.io/blog/how-to-scrape-without-getting-block...</a>3 - <a href="https://scrapecrow.com/reverse-engineering-intro.html" rel="nofollow">https://scrapecrow.com/reverse-engineering-intro.html</a>

评论 #31537919 未加载

评论 #31537863 未加载

评论 #31538860 未加载

PigiVinci83大约 3 年前

A work in progress guide about web scraping in python, anti bot softwares and techniques and so on. Please feel free to share and contribute with your own experience too.

评论 #31533691 未加载

评论 #31534764 未加载

评论 #31533978 未加载

tatoalo大约 3 年前

I recently transitioned one of my scraping projects away from selenium to playwright and I must say that the developer experience is way better, in my opinion.I also implemented to receive a telegram message with the debug trace in case of errors in my pipeline, so that I could have the entire scraping flow to analyze. That’s pretty neat.

Xeoncross大约 3 年前

Plug for <a href="https://commoncrawl.org/" rel="nofollow">https://commoncrawl.org/</a> if you need billions of pages but don't want to deal with scraping the web yourself.

评论 #31533796 未加载

评论 #31533152 未加载

评论 #31533825 未加载

afandian大约 3 年前

If you're the kind of person who wants "open data" (read as broadly as you like) and could get it in snapshots direct from the source without having to scrape, what would your ideal format be?I know it's a very open ended question.

评论 #31541695 未加载

评论 #31533440 未加载

评论 #31534360 未加载

评论 #31554627 未加载

captn3m0大约 3 年前

Good list, confused about the “tabs weighing less” bit. Isn’t that a preference left for the end-devs?Another tip I’ve found is to check if the data is accessible on a mobile app and proxy it to see if there is a JSON API available.

评论 #31534418 未加载

account-5大约 3 年前

I was reading another thread about webscraping, someone mentioned CSS selectors being way quicker than xpath. I'm easy either way but apart from a more powerful syntax what other benefits are there?

评论 #31533758 未加载

评论 #31533924 未加载

评论 #31538813 未加载

评论 #31538450 未加载

评论 #31534403 未加载

loudthing大约 3 年前

It can be frustrating learning web scraping with Python when so many sites actively block scraping.

评论 #31537536 未加载

评论 #31538931 未加载

评论 #31538661 未加载

jonatron大约 3 年前

As the second sentence says, it's a cat and mouse game, so there's no incentive on either side of bot vs anti-bot to share information.

评论 #31534426 未加载

holografix大约 3 年前

For websites that require auth via Google Auth this is a non starter. There’s no way to bypass its bot detection

评论 #31539879 未加载

lapser大约 3 年前

Is there an FLOSS project that combines Scrapy scrapers and just makes the results publicly available?

评论 #31541831 未加载

sgtquack大约 3 年前

As someone who recently dealt with scraping sites behind cloudflare...I never want to scrape again

评论 #31539891 未加载

rustdeveloper大约 3 年前

I think that if you don’t want to invest a lot of time into learning web scraping and money to get a pool of residential, or even better mobile, proxies it’s easy to quickly get good results with web scraping API like <a href="https://scrapingfish.com" rel="nofollow">https://scrapingfish.com</a> They have good blogposts, for example, for how to scrape public Instagram profiles: <a href="https://scrapingfish.com/blog/scraping-instagram" rel="nofollow">https://scrapingfish.com/blog/scraping-instagram</a>

13 条评论

wraptile大约 3 年前

评论 #31537919 未加载

评论 #31537863 未加载

评论 #31538860 未加载

PigiVinci83大约 3 年前

A work in progress guide about web scraping in python, anti bot softwares and techniques and so on. Please feel free to share and contribute with your own experience too.

评论 #31533691 未加载

评论 #31534764 未加载

评论 #31533978 未加载

tatoalo大约 3 年前

Xeoncross大约 3 年前

Plug for <a href="https://commoncrawl.org/" rel="nofollow">https://commoncrawl.org/</a> if you need billions of pages but don't want to deal with scraping the web yourself.

评论 #31533796 未加载

评论 #31533152 未加载

评论 #31533825 未加载

afandian大约 3 年前

评论 #31541695 未加载

评论 #31533440 未加载

评论 #31534360 未加载

评论 #31554627 未加载

captn3m0大约 3 年前

评论 #31534418 未加载

account-5大约 3 年前

I was reading another thread about webscraping, someone mentioned CSS selectors being way quicker than xpath. I'm easy either way but apart from a more powerful syntax what other benefits are there?

评论 #31533758 未加载

评论 #31533924 未加载

评论 #31538813 未加载

评论 #31538450 未加载

评论 #31534403 未加载

loudthing大约 3 年前

It can be frustrating learning web scraping with Python when so many sites actively block scraping.

评论 #31537536 未加载

评论 #31538931 未加载

评论 #31538661 未加载

jonatron大约 3 年前

As the second sentence says, it's a cat and mouse game, so there's no incentive on either side of bot vs anti-bot to share information.

评论 #31534426 未加载

holografix大约 3 年前

For websites that require auth via Google Auth this is a non starter. There’s no way to bypass its bot detection

评论 #31539879 未加载

lapser大约 3 年前

Is there an FLOSS project that combines Scrapy scrapers and just makes the results publicly available?

评论 #31541831 未加载

sgtquack大约 3 年前

As someone who recently dealt with scraping sites behind cloudflare...I never want to scrape again

评论 #31539891 未加载

rustdeveloper大约 3 年前

Web scraping with Python open knowledge

13 条评论

Web scraping with Python open knowledge

13 条评论