Show HN: Crawlee – Web scraping and browser automation library for Node.js

282 pointsby jancurnover 2 years ago

Hey HN,This is Jan, founder of Apify, a web scraping and automation platform. Drawing on our team's years of experience, today we're launching Crawlee [1], the web scraping and browser automation library for Node.js that's designed for the fastest development and maximum reliability in production.For details, see the short video [2] or read the announcement blog post [3].Main features:- Supports headless browsers with Playwright or Puppeteer- Supports raw HTTP crawling with Cheerio or JSDOM- Automated parallelization and scaling of crawlers for best performance- Avoids blocking using smart sessions, proxies, and browser fingerprints- Simple management and persistence of queues of URLs to crawl- Written completely in TypeScript for type safety and code autocompletion- Comprehensive documentation, code examples, and tutorials- Actively maintained and developed by Apify—we use it ourselves!- Lively community on DiscordTo get started, visit <a href="https://crawlee.dev" rel="nofollow">https://crawlee.dev</a> or run the following command: npx crawlee create my-crawlerIf you have any questions or comments, our team will be happy to answer them here.[1] <a href="https://crawlee.dev/" rel="nofollow">https://crawlee.dev/</a>[2] <a href="https://www.youtube.com/watch?v=g1Ll9OlFwEQ" rel="nofollow">https://www.youtube.com/watch?v=g1Ll9OlFwEQ</a>[3] <a href="https://blog.apify.com/announcing-crawlee-the-web-scraping-and-browser-automation-library/" rel="nofollow">https://blog.apify.com/announcing-crawlee-the-web-scraping-a...</a>

25 comments

franga2000over 2 years ago

Looks like you took the good ideas from Scrapy's crawling engine and combined it with a great scraping API, which is all I ever wanted in a bot framework!I'm especially excited about the unified API for browser and HTML scraping, which is something I've had to hack on top of Scrapy in the past and it really wasn't a good experience. That, along with puppeteer-heap-snapshot, will make the common case of "we need this to run NOW, you can rewrite it later" so much easier to handle.While I'm not particularly happy to see JavaScript begin taking over another field as it truly is an awful language, more choice is always better and this project looks valuable enough to make dealing with JS a worthwhile tradeoff.

评论 #32561896 未加载

评论 #32566905 未加载

评论 #32561914 未加载

评论 #32562093 未加载

OtmaneBenazzouover 2 years ago

Hi! It looks really REALLY cool!Is there any kind of detection/stealthiness benchmark compared to libraries such as puppeteer-stealth or fakebrowser?Honestly no matter how feature-complete and powerful a scraping tool is, the main "selling point" for me will always be stealthiness/human like behavior no matter how crappy the dev experience is.(and IMHO that's the same for most serious scrapers/bot makers)Will it always be free or could it turn into a product/paid SaaS?(kind of like browserless) I'm kind of wondering if it's worth learning it if the next cool features are going to be for paying users only.Is this something that you use internally or is it just a way to promote your paid products?Thanks :)

评论 #32561745 未加载

评论 #32561935 未加载

评论 #32561862 未加载

评论 #32561756 未加载

mfrye0over 2 years ago

This looks cool at first glance. I'll dig into it more.One note that may be helpful, if all you care about is the HTML, it's better to take a "snapshot" of the page by streaming the response directly to blob storage like S3. That way if something fails and you need to retry, you can reference the saved raw data from storage vs making another request and potentially getting blocked. Node pipelines makes it really easy to chain this stuff together with other logic.For reference, I run a company that does large scale scraping / data aggregation.

评论 #32571236 未加载

boesboesover 2 years ago

I see you basically recommend bypassing rate limits by using proxies etc? Why not just respect rate limits if set properly? A little bit of consideration for what/whomever is on the other end ;)

评论 #32567318 未加载

评论 #32567218 未加载

samwillisover 2 years ago

This looks really neat, I love the idea of a single api for both traditional and headless scraping.From my experience headless scraping is in the order of 10-100x slower and significantly more resource intensive, even if you carefully block requests for images/ads/etc.You should always start with traditional scraping, try as hard as you can to stick with it, and only move to headless if absolutely necessary. Sometimes, even if it will take 10x more “requests” to scrape traditionally, it’s still faster than headless.

评论 #32561720 未加载

nodoodlesover 2 years ago

Jan, thanks for the open approach to running the tech behind apify!The libraries look useful - one question which wasn't obvious in the doc, how do you manage / suggest approaching rate limiting by domain? Ideally respecting crawl-delay in robots.txt, or just defaulting to some sane value.. most naive queue implementations make it challenging, and queue-per-domain feels annoying to manage.

评论 #32563219 未加载

评论 #32567893 未加载

quickthrower2over 2 years ago

Sweet that you went down the free route and made it an npm package, following the good way, by providing an optional upgrade to SaaS. Cool stuff. I could have used this dearly last time I scraped. Like others I used mixed methods (headless browser for renders and direct calls) and wrote a lot of error handling boilerplate.

评论 #32563632 未加载

michaelmiorover 2 years ago

This looks very cool, but am I the only one who has an aversion to any product/library calling itself the X instead of an X?

评论 #32568432 未加载

评论 #32568413 未加载

nerdixover 2 years ago

Is there a way to use Apify's paid proxies without paying for the hosting? That doesn't look like option on the website.

评论 #32566363 未加载

aaronyarboroughover 2 years ago

Looks pretty cool. I'm working on a project that relies on regularly scraping large amounts of data. My codebase uses nodejs, and I'd love to try out a few of the features listed under "Helpful utils and configurability" as they might be able to solve a few pain points I have.

评论 #32566140 未加载

bobajeffover 2 years ago

It would be very useful if this or some other library came with Captcha solvers or a way to add Captcha solvers to the scrapers. Even regular users get Captchas sometimes.

评论 #32563597 未加载

评论 #32563491 未加载

monkeydustover 2 years ago

Can I use this to log into LinkedIn, run a query on posts and then send me an email of the results? (In theory of course as I am sure this will violate some policy)

评论 #32561984 未加载

apienxover 2 years ago

Great job! Probably the best toolkit for DIY data extraction at the moment. Shines most "against" super-sophisticated sites. Well done, guys!

评论 #32562553 未加载

lloydatkinsonover 2 years ago

Looks great! Just wondering why it has a few scrapers built in - like puppeteer and cheerio. Is it because you might want headless only sometimes?

评论 #32565249 未加载

conradfrover 2 years ago

This seems great.I've been using the unmaintained node-osmosis lib for years, maybe it'll motivate me to finally move from it.

评论 #32564913 未加载

OlikEtenover 2 years ago

Nice! Finally a web scraping library for the programming language most websites use. It was about time lol

评论 #32562544 未加载

saasxyzover 2 years ago

Wow, this looks awesome. Looking for a scraping tool right now. Will give this a shot.

dennisyover 2 years ago

Oh this looks lovely, congratulations!I would really like this but running in Python.

评论 #32570837 未加载

daolfover 2 years ago

Good job to the APIfy team for this, this look very interesting!

评论 #32562477 未加载

评论 #32562246 未加载

ketanipover 2 years ago

Demo video is awesome. Loved it.

cebertover 2 years ago

Can Crawlee run in AWS Lambdas?

评论 #32568022 未加载

wigsterover 2 years ago

this needs some sort of stealth plug-in called Creepee

评论 #32562268 未加载

pqdbrover 2 years ago

This looks really great. However, I can't find examples of how to handle scraping behind a login or a paywall, without having to 'type' credentials every time.

评论 #32565250 未加载

评论 #32565278 未加载

elevenohover 2 years ago

CoolOne issue I have w/ webdriving headless browser in general is host RAM usage per browser/chromium/puppeteer instance (e.g. ~600-900mb) for a single browser/context/page.Could crawlee make it easier to run more browser contexts with less ram usage?e.g. concurrently running multiple of these (pages requiring js execution): <a href="https://crawlee.dev/docs/examples/forms" rel="nofollow">https://crawlee.dev/docs/examples/forms</a>

评论 #32562737 未加载

评论 #32562817 未加载

alvisover 2 years ago

In a way, I hate you, but at the same time I love you. It's because I'm working on something similar to get data for my product. Seems like I'm going to use Ampify instead to save my life.Just a feedback from the developer point of view tho. I think the documentation (both clawlee & Apify) need some work. I took me a while get the difference between clawlee & other headless crawler like playwight etc.

评论 #32562358 未加载