科技回声

3 条评论

xarball超过 6 年前

Why would you switch from selenium to beautiful soup halfway through what you're trying to do, and force your program to re-request the same information from the web server? Selenium has access to the entire DOM, and the entire JavaScript session already loaded in a running web browser. It has way more power for data mining than beautiful soup does.It looks like they're just trying to use selectors, but these directions seem to completely miss that functionality in Selenium's API. Just search the WebDriver documentation for 'find_element_by_':<a href="https://selenium-python.readthedocs.io/api.html" rel="nofollow">https://selenium-python.readthedocs.io/api.html</a>I use Selenium for all my web crawling, exactly because I would rather have one crawler with all the backing support of a modern web browser, than corner myself into not having something as crucial as a JavaScript parser halfway through implementing a bot that's designed to hook what's basically an end-user interface sitting on top of all that.The most obvious benefit of Selenium to me, is that by having all that, I can make my interactions with a web server look more like a user, and fly under the radar a little more. This tends to require less work on my part when I treat websites more like a whole package (though more RAM, yes!)

评论 #17897828 未加载

haloux超过 6 年前

Ryan Mitchell did an excellent talk at DEFCON23 about defeating bot checks and other common barriers that web scrapers face. Excellent watch for anyone interested in scraping: <a href="https://youtu.be/PADKIdSPOsc" rel="nofollow">https://youtu.be/PADKIdSPOsc</a>Shameless plug: her O’Reilly book and associated github work “Web Scraping with Python” is an excellent read.

fareesh超过 6 年前

Coming from a Ruby background I've always been curious about Python's libraries for scraping. I've tried scrapy and beautiful soup, but somehow kept going back to Nokogiri and mechanize.I found the CSS selector or xpath based syntax and the DSL to be a lot more convenient and less verbose to deal with.Is selenium still the best bet for parsing JS powered pages these days? I was under the impression that headless chrome was more memory / performance efficient.I do a lot of scraping work but my methods have not really evolved in the past 3-4 years, always on the lookout for something more elegant / quicker.

How to Scrape Web Using Python, Selenium and Beautiful Soup

3 条评论

How to Scrape Web Using Python, Selenium and Beautiful Soup

3 条评论