科技回声

9 条评论

simonw将近 3 年前

A trick I think would be useful to include here is running scrapers in GitHub Actions that write their results back to the repository.This is free(!) to host, and the commit log gives an enormous amount of detail about how the scraped resource changed over time.I wrote more about this trick here: <a href="https://simonwillison.net/2020/Oct/9/git-scraping/" rel="nofollow">https://simonwillison.net/2020/Oct/9/git-scraping/</a>Here are 267 repos that are using it: <a href="https://github.com/topics/git-scraping?o=desc&s=updated" rel="nofollow">https://github.com/topics/git-scraping?o=desc&s=updated</a>

评论 #31504520 未加载

评论 #31501859 未加载

评论 #31503243 未加载

评论 #31504584 未加载

评论 #31504325 未加载

quyleanh将近 3 年前

I usually create some small scraping script for my daily life.- Getting all comic's image and converting to e-book for my Kindle.- Surveying info for buying new house.- Helping my wife in collecting data for her new writing.- Transferring all my Facebook fanpage post to my personal blogAnd I did enjoyed my journey in scraping thing to make my life easier and full of joy.

评论 #31501210 未加载

helsinki将近 3 年前

Hi Sam,It might be worth adding a section on distributed anonymous scrapers that use some form of messaging middleware to distribute the URLs to scrape. Regarding the anonymous aspect (independent of job distribution, of course), you could walk them through using <a href="https://github.com/aaronsw/pytorctl" rel="nofollow">https://github.com/aaronsw/pytorctl</a> or even a rotating tor proxy. This is how I scraped all those Instagram locations + metadata we discussed about five years ago. Hope you’re doing well!

1vuio0pswjnm7将近 3 年前

This is from 2020. Besides a small change to the "Introduction to the Command Line" section, it has not been updated.Back in 2015, the author reported using CasperJS to scrape public LinkedIn profiles. The author reported this was a PITA.Here the author recommends using WebDriver implementations, e.g., chromedriver or geckodriver, in addition to scripting language frameworks such as Puppeteer and Selenium. Is scraping LinkedIn still a PITA.Because the examples given are always relatively simple, i.e., not LinkedIn, I am skeptical when I see "web scraping" tutorials using Python frameworks and cURL as the only recommended option for automated public data/information retrieval from the www.[FN1,2] I use none of the above. For small tasks like the examples given in these tutorials, the approaches I use are not nearly as sophisticated/complicated and yet they are faster and use fewer resources than using Python and/or cURL. They are also easier to update if something changes. That is in part because (1) the binaries I use are smaller, (2) I do not rely on scripting languages[FN3] and third party libraries (and so much less code involved), (3) the programs I use start working immediately whereas Python takes seconds to start up and (4) compared to the programs I use, cURL as a means of sending HTTP requests is inflexible, e.g., one is limited to what "options" cURL provides and cURL has no option for HTTP/1.1 pipelining.1. LinkedIn's so-called "technological measures" to prevent retrieval of public information have failed. Similarly, its attempts to prevent retrieval of public information through intimidation, e.g., cease-and-desist letters and threats of CFAA claims, have failed. Tutorials on "web scraping" that extol Python frameworks should use LinkedIn as an example instead of trivial examples for which using Python is, IMHO, overkill.2. What would be more interesting is a Rosetta Code for "web scraping" tasks. There are many, many ways to do public data/information retrieval from the www. Using scripting languages such as Python, Ruby, NodeJS, etc. and frameworks are one way. That approach may be ideally suited for large scale jobs, like those undertaken by what the author calls "internet companies". But for smaller tasks undertaken by individual www users for noncommercial purposes, e.g., this author's concept of "scrapism", there are also faster, less complicated and more efficient options.3. Other than the Almquist shell

评论 #31504551 未加载

评论 #31504990 未加载

Labo333将近 3 年前

Good guide!The "Scraping XHR" [1] explains how to inspect network requests and reproduce them with Python. I actually built har2requests [2] to automate that process![1]: <a href="https://scrapism.lav.io/scraping-xhr/" rel="nofollow">https://scrapism.lav.io/scraping-xhr/</a> [2]: <a href="https://github.com/louisabraham/har2requests" rel="nofollow">https://github.com/louisabraham/har2requests</a>

评论 #31504602 未加载

saaaam将近 3 年前

Hi! This is a guide that I started during the pandemic but never quite finished. I’m in the process of re-writing/re-recording some parts of it to bring it back up to date, and adding in the bits that are still missing.

CWuestefeld将近 3 年前

I'm bothered that this doesn't mention any of the ethics involved, such as checking the robots.txt file and so forth.More than half of my traffic is from bots, so I'm paying something like half my operational expenses to support them. And we've had to do a lot of work to mitigate what would otherwise be DoS attacks from badly written (or badly intended!) bots. I think that at least a tip of the hat to avoiding damage would be appropriate in a piece like this.

rpastuszak将近 3 年前

Tangentially related question:Is Python still the most common tool used for web scraping and if so, what's the advantage over jsdom/cheerio or, say a headless browser based tool like puppeteer?I've been using these tools for years, but I grew up in the JS world, so I'd be curious to hear people with different backgrounds/biases than mine:)

评论 #31502715 未加载

评论 #31501944 未加载

fjallstrom将近 3 年前

Was happy to find that the person behind it Sam Lavigne, one of the people behind Stupid Hackathon.

9 条评论

simonw将近 3 年前

评论 #31504520 未加载

评论 #31501859 未加载

评论 #31503243 未加载

评论 #31504584 未加载

评论 #31504325 未加载

quyleanh将近 3 年前

评论 #31501210 未加载

helsinki将近 3 年前

1vuio0pswjnm7将近 3 年前

评论 #31504551 未加载

评论 #31504990 未加载

Labo333将近 3 年前

评论 #31504602 未加载

saaaam将近 3 年前

CWuestefeld将近 3 年前

rpastuszak将近 3 年前

评论 #31502715 未加载

评论 #31501944 未加载

fjallstrom将近 3 年前

Was happy to find that the person behind it Sam Lavigne, one of the people behind Stupid Hackathon.

Scrapism

9 条评论

Scrapism

9 条评论