Hey HN,<p>This is Jan, founder of Apify, a web scraping and automation platform. Drawing on our team's years of experience, today we're launching Crawlee [1], the web scraping and browser automation library for Node.js that's designed for the fastest development and maximum reliability in production.<p>For details, see the short video [2] or read the announcement blog post [3].<p>Main features:<p>- Supports headless browsers with Playwright or Puppeteer<p>- Supports raw HTTP crawling with Cheerio or JSDOM<p>- Automated parallelization and scaling of crawlers for best performance<p>- Avoids blocking using smart sessions, proxies, and browser fingerprints<p>- Simple management and persistence of queues of URLs to crawl<p>- Written completely in TypeScript for type safety and code autocompletion<p>- Comprehensive documentation, code examples, and tutorials<p>- Actively maintained and developed by Apify—we use it ourselves!<p>- Lively community on Discord<p>To get started, visit <a href="https://crawlee.dev" rel="nofollow">https://crawlee.dev</a> or run the following command: npx crawlee create my-crawler<p>If you have any questions or comments, our team will be happy to answer them here.<p>[1] <a href="https://crawlee.dev/" rel="nofollow">https://crawlee.dev/</a><p>[2] <a href="https://www.youtube.com/watch?v=g1Ll9OlFwEQ" rel="nofollow">https://www.youtube.com/watch?v=g1Ll9OlFwEQ</a><p>[3] <a href="https://blog.apify.com/announcing-crawlee-the-web-scraping-and-browser-automation-library/" rel="nofollow">https://blog.apify.com/announcing-crawlee-the-web-scraping-a...</a>
Looks like you took the good ideas from Scrapy's crawling engine and combined it with a great scraping API, which is all I ever wanted in a bot framework!<p>I'm especially excited about the unified API for browser and HTML scraping, which is something I've had to hack on top of Scrapy in the past and it really wasn't a good experience. That, along with puppeteer-heap-snapshot, will make the common case of "we need this to run NOW, you can rewrite it later" so much easier to handle.<p>While I'm not particularly happy to see JavaScript begin taking over another field as it truly is an awful language, more choice is always better and this project looks valuable enough to make dealing with JS a worthwhile tradeoff.
Hi! It looks really REALLY cool!<p>Is there any kind of detection/stealthiness benchmark compared to libraries such as puppeteer-stealth or fakebrowser?<p>Honestly no matter how feature-complete and powerful a scraping tool is, the main "selling point" for me will always be stealthiness/human like behavior no matter how crappy the dev experience is.(and IMHO that's the same for most serious scrapers/bot makers)<p>Will it always be free or could it turn into a product/paid SaaS?(kind of like browserless) I'm kind of wondering if it's worth learning it if the next cool features are going to be for paying users only.<p>Is this something that you use internally or is it just a way to promote your paid products?<p>Thanks :)
This looks cool at first glance. I'll dig into it more.<p>One note that may be helpful, if all you care about is the HTML, it's better to take a "snapshot" of the page by streaming the response directly to blob storage like S3. That way if something fails and you need to retry, you can reference the saved raw data from storage vs making another request and potentially getting blocked. Node pipelines makes it really easy to chain this stuff together with other logic.<p>For reference, I run a company that does large scale scraping / data aggregation.
I see you basically recommend bypassing rate limits by using proxies etc?
Why not just respect rate limits if set properly? A little bit of consideration for what/whomever is on the other end ;)
This looks really neat, I love the idea of a single api for both traditional and headless scraping.<p>From my experience headless scraping is in the order of 10-100x slower and significantly more resource intensive, even if you carefully block requests for images/ads/etc.<p>You should always start with traditional scraping, try as hard as you can to stick with it, and only move to headless if absolutely necessary. Sometimes, even if it will take 10x more “requests” to scrape traditionally, it’s still faster than headless.
Jan, thanks for the open approach to running the tech behind apify!<p>The libraries look useful - one question which wasn't obvious in the doc, how do you manage / suggest approaching rate limiting by domain? Ideally respecting crawl-delay in robots.txt, or just defaulting to some sane value.. most naive queue implementations make it challenging, and queue-per-domain feels annoying to manage.
Sweet that you went down the free route and made it an npm package, following the good way, by providing an optional upgrade to SaaS. Cool stuff. I could have used this dearly last time I scraped. Like others I used mixed methods (headless browser for renders and direct calls) and wrote a lot of error handling boilerplate.
Looks pretty cool. I'm working on a project that relies on regularly scraping large amounts of data. My codebase uses nodejs, and I'd love to try out a few of the features listed under "Helpful utils and configurability" as they might be able to solve a few pain points I have.
It would be very useful if this or some other library came with Captcha solvers or a way to add Captcha solvers to the scrapers. Even regular users get Captchas sometimes.
Can I use this to log into LinkedIn, run a query on posts and then send me an email of the results? (In theory of course as I am sure this will violate some policy)
Great job! Probably the best toolkit for DIY data extraction at the moment. Shines most "against" super-sophisticated sites. Well done, guys!
This looks really great. However, I can't find examples of how to handle scraping behind a login or a paywall, without having to 'type' credentials every time.
Cool<p>One issue I have w/ webdriving headless browser in general is host RAM usage per browser/chromium/puppeteer instance (e.g. ~600-900mb) for a single browser/context/page.<p>Could crawlee make it easier to run more browser contexts with less ram usage?<p>e.g. concurrently running multiple of these (pages requiring js execution):
<a href="https://crawlee.dev/docs/examples/forms" rel="nofollow">https://crawlee.dev/docs/examples/forms</a>
In a way, I hate you, but at the same time I love you. It's because I'm working on something similar to get data for my product. Seems like I'm going to use Ampify instead to save my life.<p>Just a feedback from the developer point of view tho. I think the documentation (both clawlee & Apify) need some work. I took me a while get the difference between clawlee & other headless crawler like playwight etc.