TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Show HN: Crawlee for Python – a web scraping and browser automation library

254 pointsby jancurn11 months ago
Hey all,<p>This is Jan, the founder of Apify (<a href="https:&#x2F;&#x2F;apify.com&#x2F;" rel="nofollow">https:&#x2F;&#x2F;apify.com&#x2F;</a>) — a full-stack web scraping platform. After the success of Crawlee for JavaScript (<a href="https:&#x2F;&#x2F;github.com&#x2F;apify&#x2F;crawlee&#x2F;">https:&#x2F;&#x2F;github.com&#x2F;apify&#x2F;crawlee&#x2F;</a>) and the demand from the Python community, we&#x27;re launching Crawlee for Python today!<p>The main features are:<p>- A unified programming interface for both HTTP (HTTPX with BeautifulSoup) &amp; headless browser crawling (Playwright)<p>- Automatic parallel crawling based on available system resources<p>- Written in Python with type hints for enhanced developer experience<p>- Automatic retries on errors or when you’re getting blocked<p>- Integrated proxy rotation and session management<p>- Configurable request routing - direct URLs to the appropriate handlers<p>- Persistent queue for URLs to crawl<p>- Pluggable storage for both tabular data and files<p>For details, you can read the announcement blog post: <a href="https:&#x2F;&#x2F;crawlee.dev&#x2F;blog&#x2F;launching-crawlee-python" rel="nofollow">https:&#x2F;&#x2F;crawlee.dev&#x2F;blog&#x2F;launching-crawlee-python</a><p>Our team and I will be happy to answer here any questions you might have.

18 comments

mdaniel11 months ago
You&#x27;ll want to prioritize documenting the <i>existing</i> features, since it&#x27;s no good having a super awesome full stack web scraping platform if only you can use it. I ordinarily would default to a &quot;read the source&quot; response but your cutesy coding style makes that a non-starter<p>As a concrete example: command-f for &quot;tier&quot; on <a href="https:&#x2F;&#x2F;crawlee.dev&#x2F;python&#x2F;docs&#x2F;guides&#x2F;proxy-management" rel="nofollow">https:&#x2F;&#x2F;crawlee.dev&#x2F;python&#x2F;docs&#x2F;guides&#x2F;proxy-management</a> and tell me how anyone could possibly know what `tiered_proxy_urls: list[list[str]] | None = None` should contain and why?
评论 #40918791 未加载
评论 #40933936 未加载
评论 #40934333 未加载
Findecanor11 months ago
Does it have support for web scraping opt-out protocols, such as Robots.txt, HTTP and content tags? These are getting more important now, especially in the EU after the DSM directive.
评论 #40917919 未加载
nobodywillobsrv10 months ago
I don&#x27;t really understand it. Tried it on some fund site and it didn&#x27;t really do much besides apparently grepping for links.<p>The example should show how to literally find and target all <i>data</i> as in .csv .xlsx tables etc and actually download it.<p>Anyone can use requests and just get the text and grep for urls. I don&#x27;t get it.<p>Remember: pick an example where you need to parse one thing to get 1000s of other things to then hit some other endpoints to then get the 3-5 things at each of those. Any example that doesn&#x27;t look like that is not going to impress anyone.<p>I&#x27;m not even clear if this is saying it&#x27;s a framework or actually some automation tool. Automation meaning it actually autodetects where to look.
renegat0x010 months ago
I have been running my project with selenium for some time.<p>Now I am using crawlee. Thanks. I will work on it, to better integrate into my project, however I already can tell it works flawlessly.<p>My project, with crawlee: <a href="https:&#x2F;&#x2F;github.com&#x2F;rumca-js&#x2F;Django-link-archive">https:&#x2F;&#x2F;github.com&#x2F;rumca-js&#x2F;Django-link-archive</a>
c0brac0bra11 months ago
Wanted to say thanks for apify&#x2F;crawlee. I&#x27;m a long-time node.js user and your library has worked better than all the others I&#x27;ve tried.
评论 #40917423 未加载
intev11 months ago
How is this different from Scrapy?
评论 #40914659 未加载
ranedk11 months ago
I found crawlee a few days ago while figuring out a stack for a project. I wanted a python library but found crawlee with typescript so much easier that I ended up coding the entire project in less than a week in Typescript+Crawlee+Playwright<p>I found the api a lot better than any python scraping api till date. However I am tempted to try out python with Crawlee.<p>The playwright integration with gotScraping makes the entire programming experience a breeze. My crawling and scraping involves all kinds of frontend rendered websites with a lot of modified XHR responses to be captured. And IT JUST WORKS!<p>Thanks a ton . I will definitely use the Apify platform to scale given the integration.
评论 #40922193 未加载
评论 #40917088 未加载
marban11 months ago
Nice list, but what would be the arguments for switching over from other libraries? I’ve built my own crawler over time, but from what I see, there’s nothing truly unique.
评论 #40914278 未加载
VagabundoP11 months ago
Looks nice, and modern python.<p>The code example on the front page has this:<p>`const data = await crawler.get_data()`<p>That looks like Javascript? Is there a missing underscore?
评论 #40915880 未加载
fforflo11 months ago
I&#x27;d suggest bringing more code snippets from the test cases to documentation as examples.<p>Nice work though.
manishsharan11 months ago
Can this work on intranet sites like sharepoint or confluence , which require employee SSO ?<p>I was trying to build a small Langchain based RAG based on internal documents but getting the documents from sharepoint&#x2F;confluence (we have both) is very painful.
评论 #40920508 未加载
评论 #40920588 未加载
holoduke11 months ago
Does it have event listeners to wait for specific elements based on certain pattern matches. One reason i am still using phantomjs is because it simulates the entire browser and you can compile your own webkit in it.
评论 #40919024 未加载
barrenko11 months ago
Pretty cool, and any scraping tool is really welcome - I&#x27;ll try it out for my personal project. At the monment, due to AI, scraping is like selling shovels during a gold rush.
ijustlovemath11 months ago
Do you have any plans to monetize this? How are you supporting development?
评论 #40915457 未加载
renegat0x011 months ago
Can it be used to obtain RSS contents? Most of examples focus on html
评论 #40917022 未加载
bmitc11 months ago
Can you use this to auto-logon to systems?
评论 #40920578 未加载
localfirst11 months ago
in one sentence, what does this do that existing web scraping and browser automation doesn&#x27;t do?
评论 #40919198 未加载
thelastgallon11 months ago
I wonder if there are any AI tools that do web scraping for you without having to write any code?
评论 #40917226 未加载