TechEcho

rabuseover 3 years ago

This breaks with using standard web scraping methods (non-headless JS engines). Had to deal with this issue recently due to everything being a damn SPA now. Look into Selenium for running headless browsers if looking to scrape the modern web.

评论 #29235465 未加载

评论 #29233948 未加载

评论 #29232439 未加载

funnyflamigoover 3 years ago

I know some people think all scraping is bad or malicious. I'd like to point out this is a perfectly legitimate use case for it, in fact this is how Google Search operates.<p>Web scraping done correctly should be barely noticeable if at all to the operators. Don't send 10,000 req/s, have aggressive delays, make your retries extremely generous, try to avoid pages or actions you know are "heavy". You don't need to update data from every product page every 5 minutes.

评论 #29233155 未加载

pjfin123over 3 years ago

I wrote a similar Python library to do Beautiful Soup scraping with basic PageRank and a Flask web app: <a href="https://github.com/argosopentech/argos-search" rel="nofollow">https://github.com/argosopentech/argos-search</a>

How to scrape and extract hyperlink networks with BeautifulSoup and NetworkX

3 comments

How to scrape and extract hyperlink networks with BeautifulSoup and NetworkX

3 comments