Show HN: Crawlee for Python – a web scraping and browser automation library

254 pointsby jancurn11 months ago

Hey all,This is Jan, the founder of Apify (<a href="https://apify.com/" rel="nofollow">https://apify.com/</a>) — a full-stack web scraping platform. After the success of Crawlee for JavaScript (<a href="https://github.com/apify/crawlee/">https://github.com/apify/crawlee/</a>) and the demand from the Python community, we're launching Crawlee for Python today!The main features are:- A unified programming interface for both HTTP (HTTPX with BeautifulSoup) & headless browser crawling (Playwright)- Automatic parallel crawling based on available system resources- Written in Python with type hints for enhanced developer experience- Automatic retries on errors or when you’re getting blocked- Integrated proxy rotation and session management- Configurable request routing - direct URLs to the appropriate handlers- Persistent queue for URLs to crawl- Pluggable storage for both tabular data and filesFor details, you can read the announcement blog post: <a href="https://crawlee.dev/blog/launching-crawlee-python" rel="nofollow">https://crawlee.dev/blog/launching-crawlee-python</a>Our team and I will be happy to answer here any questions you might have.

18 comments

mdaniel11 months ago

You'll want to prioritize documenting the existing features, since it's no good having a super awesome full stack web scraping platform if only you can use it. I ordinarily would default to a "read the source" response but your cutesy coding style makes that a non-starterAs a concrete example: command-f for "tier" on <a href="https://crawlee.dev/python/docs/guides/proxy-management" rel="nofollow">https://crawlee.dev/python/docs/guides/proxy-management</a> and tell me how anyone could possibly know what `tiered_proxy_urls: list[list[str]] | None = None` should contain and why?

评论 #40918791 未加载

评论 #40933936 未加载

评论 #40934333 未加载

Findecanor11 months ago

Does it have support for web scraping opt-out protocols, such as Robots.txt, HTTP and content tags? These are getting more important now, especially in the EU after the DSM directive.

评论 #40917919 未加载

nobodywillobsrv10 months ago

I don't really understand it. Tried it on some fund site and it didn't really do much besides apparently grepping for links.The example should show how to literally find and target all data as in .csv .xlsx tables etc and actually download it.Anyone can use requests and just get the text and grep for urls. I don't get it.Remember: pick an example where you need to parse one thing to get 1000s of other things to then hit some other endpoints to then get the 3-5 things at each of those. Any example that doesn't look like that is not going to impress anyone.I'm not even clear if this is saying it's a framework or actually some automation tool. Automation meaning it actually autodetects where to look.

renegat0x010 months ago

I have been running my project with selenium for some time.Now I am using crawlee. Thanks. I will work on it, to better integrate into my project, however I already can tell it works flawlessly.My project, with crawlee: <a href="https://github.com/rumca-js/Django-link-archive">https://github.com/rumca-js/Django-link-archive</a>

c0brac0bra11 months ago

Wanted to say thanks for apify/crawlee. I'm a long-time node.js user and your library has worked better than all the others I've tried.

评论 #40917423 未加载

intev11 months ago

How is this different from Scrapy?

评论 #40914659 未加载

ranedk11 months ago

I found crawlee a few days ago while figuring out a stack for a project. I wanted a python library but found crawlee with typescript so much easier that I ended up coding the entire project in less than a week in Typescript+Crawlee+PlaywrightI found the api a lot better than any python scraping api till date. However I am tempted to try out python with Crawlee.The playwright integration with gotScraping makes the entire programming experience a breeze. My crawling and scraping involves all kinds of frontend rendered websites with a lot of modified XHR responses to be captured. And IT JUST WORKS!Thanks a ton . I will definitely use the Apify platform to scale given the integration.

评论 #40922193 未加载

评论 #40917088 未加载

marban11 months ago

Nice list, but what would be the arguments for switching over from other libraries? I’ve built my own crawler over time, but from what I see, there’s nothing truly unique.

评论 #40914278 未加载

VagabundoP11 months ago

Looks nice, and modern python.The code example on the front page has this:`const data = await crawler.get_data()`That looks like Javascript? Is there a missing underscore?

评论 #40915880 未加载

fforflo11 months ago

I'd suggest bringing more code snippets from the test cases to documentation as examples.Nice work though.

manishsharan11 months ago

Can this work on intranet sites like sharepoint or confluence , which require employee SSO ?I was trying to build a small Langchain based RAG based on internal documents but getting the documents from sharepoint/confluence (we have both) is very painful.

评论 #40920508 未加载

评论 #40920588 未加载

holoduke11 months ago

Does it have event listeners to wait for specific elements based on certain pattern matches. One reason i am still using phantomjs is because it simulates the entire browser and you can compile your own webkit in it.

评论 #40919024 未加载

barrenko11 months ago

Pretty cool, and any scraping tool is really welcome - I'll try it out for my personal project. At the monment, due to AI, scraping is like selling shovels during a gold rush.

ijustlovemath11 months ago

Do you have any plans to monetize this? How are you supporting development?

评论 #40915457 未加载

renegat0x011 months ago

Can it be used to obtain RSS contents? Most of examples focus on html

评论 #40917022 未加载

bmitc11 months ago

Can you use this to auto-logon to systems?

评论 #40920578 未加载

localfirst11 months ago

in one sentence, what does this do that existing web scraping and browser automation doesn't do?

评论 #40919198 未加载

thelastgallon11 months ago

I wonder if there are any AI tools that do web scraping for you without having to write any code?

评论 #40917226 未加载