TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

How to scrape and extract hyperlink networks with BeautifulSoup and NetworkX

53 pointsby spacejunkjimover 3 years ago

3 comments

rabuseover 3 years ago
This breaks with using standard web scraping methods (non-headless JS engines). Had to deal with this issue recently due to everything being a damn SPA now. Look into Selenium for running headless browsers if looking to scrape the modern web.
评论 #29235465 未加载
评论 #29233948 未加载
评论 #29232439 未加载
funnyflamigoover 3 years ago
I know some people think all scraping is bad or malicious. I&#x27;d like to point out this is a perfectly legitimate use case for it, in fact this is how Google Search operates.<p>Web scraping done correctly should be barely noticeable if at all to the operators. Don&#x27;t send 10,000 req&#x2F;s, have aggressive delays, make your retries extremely generous, try to avoid pages or actions you know are &quot;heavy&quot;. You don&#x27;t need to update data from every product page every 5 minutes.
评论 #29233155 未加载
pjfin123over 3 years ago
I wrote a similar Python library to do Beautiful Soup scraping with basic PageRank and a Flask web app: <a href="https:&#x2F;&#x2F;github.com&#x2F;argosopentech&#x2F;argos-search" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;argosopentech&#x2F;argos-search</a>