TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Web Scraping a Javascript Heavy Website: Keeping Things Simple

49 点作者 kuhn超过 11 年前

8 条评论

rgarcia超过 11 年前
I used to use the network tab for stuff like this, but now I almost exclusively use mitmproxy[0]. Once things get sufficiently complicated, the constant scrolling and clicking around in the network tab feels tedious. Plus it&#x27;s difficult to capture activity if a site has popups or multiple windows. mitmproxy solves these problems and also has a ton more features like replaying requests and saving to files. My ideal tool involves something that translates mitmdump into code that performs the equivalent raw HTTP requests (e.g. using python&#x27;s requests). Sort of like Selenium&#x27;s IDE but for super lightweight scraping.<p>[0] <a href="http://mitmproxy.org/" rel="nofollow">http:&#x2F;&#x2F;mitmproxy.org&#x2F;</a>
评论 #6294406 未加载
hazz超过 11 年前
In many cases websites that load data asynchronously through an API are much nicer to scrape, as the data is already structured for you. You don&#x27;t have to go through the pain of extracting data from a mess of tables, divs and spans.
bdcravens超过 11 年前
I&#x27;ve done a lot of scraping. Some sites use heavy Javascript frameworks that generate session IDs and request IDs that the XHR requests use to &quot;authenticate&quot; the request. In these situations, the amount of work to reverse engineer that workflow is pretty large. In these situations, I lean on headless Selenium. I know there are some lighter solutions, but Selenium offers some distinct advantages:<p>1) lot of library support, in multiple languages<p>2) without having to fake UAs, etc, the requests look more like a regular user (all media assets downloaded, normal browser UA, etc)<p>3) simple clustering: setting up a Selenium grid is very easy, and switching from local instance of Selenium to using the grid requires very little code change (1 line in most cases)
评论 #6294989 未加载
hayksaakian超过 11 年前
Before any naysayers complain about the idea of using undocumented endpoints, keep in mind that this is all in the context of web scraping.
timscott超过 11 年前
I&#x27;ve recently been learning all this the hard way.<p>1. Documented API. Failing that...<p>2. HTTP client fetching structured data (XHR calls). Failing that...<p>3. HTTP client fetching and scraping HTML documents. Failing that...<p>4. Headless browser<p>I recently found myself pushed to #4 to handle sites with over-complex JS or anti-automation techniques.
wslh超过 11 年前
If you liked this article, you might also be interested in &quot;Scraping Web Sites which Dynamically Load Data&quot; <a href="http://blog.databigbang.com/scraping-web-sites-which-dynamically-load-data/" rel="nofollow">http:&#x2F;&#x2F;blog.databigbang.com&#x2F;scraping-web-sites-which-dynamic...</a>
corford超过 11 年前
For JS heavy sites, I&#x27;ve found proxying the traffic through Fiddler is the easiest way to discover the API end points I need to hit.
cpayne超过 11 年前
I&#x27;m getting a 404 - Page not found
评论 #6294598 未加载