TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

How to Scrape Web Using Python, Selenium and Beautiful Soup

54 点作者 chsasank超过 6 年前

3 条评论

xarball超过 6 年前
Why would you switch from selenium to beautiful soup halfway through what you&#x27;re trying to do, and force your program to re-request the same information from the web server? Selenium has access to the entire DOM, and the entire JavaScript session already loaded in a running web browser. It has way more power for data mining than beautiful soup does.<p>It looks like they&#x27;re just trying to use selectors, but these directions seem to completely miss that functionality in Selenium&#x27;s API. Just search the WebDriver documentation for &#x27;find_element_by_&#x27;:<p><a href="https:&#x2F;&#x2F;selenium-python.readthedocs.io&#x2F;api.html" rel="nofollow">https:&#x2F;&#x2F;selenium-python.readthedocs.io&#x2F;api.html</a><p>I use Selenium for all my web crawling, exactly because I would rather have one crawler with all the backing support of a modern web browser, than corner myself into not having something as crucial as a JavaScript parser halfway through implementing a bot that&#x27;s designed to hook what&#x27;s basically an end-user interface sitting on top of all that.<p>The most obvious benefit of Selenium to me, is that by having all that, I can make my interactions with a web server look <i>more</i> like a user, and fly under the radar a little more. This tends to require less work on my part when I treat websites more like a whole package (though more RAM, yes!)
评论 #17897828 未加载
haloux超过 6 年前
Ryan Mitchell did an excellent talk at DEFCON23 about defeating bot checks and other common barriers that web scrapers face. Excellent watch for anyone interested in scraping: <a href="https:&#x2F;&#x2F;youtu.be&#x2F;PADKIdSPOsc" rel="nofollow">https:&#x2F;&#x2F;youtu.be&#x2F;PADKIdSPOsc</a><p>Shameless plug: her O’Reilly book and associated github work “Web Scraping with Python” is an excellent read.
fareesh超过 6 年前
Coming from a Ruby background I&#x27;ve always been curious about Python&#x27;s libraries for scraping. I&#x27;ve tried scrapy and beautiful soup, but somehow kept going back to Nokogiri and mechanize.<p>I found the CSS selector or xpath based syntax and the DSL to be a lot more convenient and less verbose to deal with.<p>Is selenium still the best bet for parsing JS powered pages these days? I was under the impression that headless chrome was more memory &#x2F; performance efficient.<p>I do a lot of scraping work but my methods have not really evolved in the past 3-4 years, always on the lookout for something more elegant &#x2F; quicker.
评论 #17895240 未加载
评论 #17896926 未加载