Hey all,<p>This is Jan, the founder of Apify (<a href="https://apify.com/" rel="nofollow">https://apify.com/</a>) — a full-stack web scraping platform. After the success of Crawlee for JavaScript (<a href="https://github.com/apify/crawlee/">https://github.com/apify/crawlee/</a>) and the demand from the Python community, we're launching Crawlee for Python today!<p>The main features are:<p>- A unified programming interface for both HTTP (HTTPX with BeautifulSoup) & headless browser crawling (Playwright)<p>- Automatic parallel crawling based on available system resources<p>- Written in Python with type hints for enhanced developer experience<p>- Automatic retries on errors or when you’re getting blocked<p>- Integrated proxy rotation and session management<p>- Configurable request routing - direct URLs to the appropriate handlers<p>- Persistent queue for URLs to crawl<p>- Pluggable storage for both tabular data and files<p>For details, you can read the announcement blog post: <a href="https://crawlee.dev/blog/launching-crawlee-python" rel="nofollow">https://crawlee.dev/blog/launching-crawlee-python</a><p>Our team and I will be happy to answer here any questions you might have.
You'll want to prioritize documenting the <i>existing</i> features, since it's no good having a super awesome full stack web scraping platform if only you can use it. I ordinarily would default to a "read the source" response but your cutesy coding style makes that a non-starter<p>As a concrete example: command-f for "tier" on <a href="https://crawlee.dev/python/docs/guides/proxy-management" rel="nofollow">https://crawlee.dev/python/docs/guides/proxy-management</a> and tell me how anyone could possibly know what `tiered_proxy_urls: list[list[str]] | None = None` should contain and why?
Does it have support for web scraping opt-out protocols, such as Robots.txt, HTTP and content tags?
These are getting more important now, especially in the EU after the DSM directive.
I don't really understand it. Tried it on some fund site and it didn't really do much besides apparently grepping for links.<p>The example should show how to literally find and target all <i>data</i> as in .csv .xlsx tables etc and actually download it.<p>Anyone can use requests and just get the text and grep for urls. I don't get it.<p>Remember: pick an example where you need to parse one thing to get 1000s of other things to then hit some other endpoints to then get the 3-5 things at each of those. Any example that doesn't look like that is not going to impress anyone.<p>I'm not even clear if this is saying it's a framework or actually some automation tool. Automation meaning it actually autodetects where to look.
I have been running my project with selenium for some time.<p>Now I am using crawlee. Thanks. I will work on it, to better integrate into my project, however I already can tell it works flawlessly.<p>My project, with crawlee: <a href="https://github.com/rumca-js/Django-link-archive">https://github.com/rumca-js/Django-link-archive</a>
I found crawlee a few days ago while figuring out a stack for a project. I wanted a python library but found crawlee with typescript so much easier that I ended up coding the entire project in less than a week in Typescript+Crawlee+Playwright<p>I found the api a lot better than any python scraping api till date. However I am tempted to try out python with Crawlee.<p>The playwright integration with gotScraping makes the entire programming experience a breeze. My crawling and scraping involves all kinds of frontend rendered websites with a lot of modified XHR responses to be captured. And IT JUST WORKS!<p>Thanks a ton . I will definitely use the Apify platform to scale given the integration.
Nice list, but what would be the arguments for switching over from other libraries? I’ve built my own crawler over time, but from what I see, there’s nothing truly unique.
Looks nice, and modern python.<p>The code example on the front page has this:<p>`const data = await crawler.get_data()`<p>That looks like Javascript? Is there a missing underscore?
Can this work on intranet sites like sharepoint or confluence , which require employee SSO ?<p>I was trying to build a small Langchain based RAG based on internal documents but getting the documents from sharepoint/confluence (we have both) is very painful.
Does it have event listeners to wait for specific elements based on certain pattern matches. One reason i am still using phantomjs is because it simulates the entire browser and you can compile your own webkit in it.
Pretty cool, and any scraping tool is really welcome - I'll try it out for my personal project. At the monment, due to AI, scraping is like selling shovels during a gold rush.