I'd like to help a friend to learn more about web scraping (and also with test automation, but that is less fun). Are you aware of any tutorial, competition or anything in-between which has tasks with varying difficulties?<p>E.g. easy: iterate over the ids of the articles and call curl on it. difficult: you need puppeteer with multiple JS tricks to get through the first few pages, and the end is far away...
Surprised nobody has mentioned ASP websites yet - definitely among the hardest. Those sites carry so much state in cookies rather than URLs, so you have to follow all the UI interactions in order to get to the result you're trying to parse. The markup is also typically really bloated and filled with randomly-generated IDs.
Easy: find a random Wordpress blog. Crawl by category, author, or page.<p>Medium: Scrape Yelp.<p>Hard: Scrape Yelp and exclude all randomly generated garbage data, false phone numbers, incorrect hours when they detect you're a bot and start feeding you bad data instead of blocking you.<p>Hard, expensive: Purchase a pair of limited edition sneakers requiring 3D Secure and 2FA.
An easy challenge that is also very fruitful is “scraping” RSS feeds. A lot of good information is provided by RSS and the challenge could be to aggregate and filter some RSS feeds then generate a new one.
Medium scrape Craigslist, create a database with your results and graph out prices.<p>Then link up reposts to track price history<p>Use image recognition to find reused images<p>- medium hard
Use web scraping to buy a ps5