TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Next.js v13 websites are a heaven for scraping

81 点作者 punkpeye超过 1 年前
Just a random thought while analyzing some Next.js v13 website HTML markup.<p>Due to the way that RSC hydrates elements during client-side rendering, Next.js has to provide all their property data using JSON (see this issue https:&#x2F;&#x2F;github.com&#x2F;vercel&#x2F;next.js&#x2F;discussions&#x2F;42170)<p>As a result, any website that uses Next.js with RSC is extremely easy to extract data from since you can tap into JSON of every element.<p>Just an interesting observation without implying whether it is good or bad for the ecosystem.

7 条评论

isbvhodnvemrwvn超过 1 年前
It also shows how awful some sites are. I&#x27;m scraping a jobs offer portal written in next.js (although I use the endpoint for hydration), one of the lovely things they do is they pass in props containing all possible skills candidates can have, they send 200KB per listing, less than 10KB is the average volume of job-specific (actually required) data.<p>Previously they were rendering it from a REST endpoint which only had the necessary data, but now I get a lovely dictionary of all possible values of various properties, provided on a silver platter.<p>Another portal I&#x27;m scraping uses GraphQL, and they don&#x27;t have access control, I get all the data they have on impressions, I know how much money somebody paid for listing, I know when the offer is going to be auto-bumped - just lovely. And no need to use Playwright either.
评论 #37885105 未加载
评论 #37887135 未加载
评论 #37887131 未加载
评论 #37884525 未加载
javascriptdante超过 1 年前
I sometimes wonder if the Javascript world has some sort of divine significance, as though they carry the weight of all the developer sins in the world for some particular transcendental purpose. After all, Dante&#x27;s nine circles of hell are insufficient for programmers who are able to recurse infinitely. How much further can we go?<p>Consider the descent down, and the contradictions therein, considering JUST data concerns for brevity:<p>1. SPA ostensibly being marginally more data efficient by shipping the markup&#x2F;interactivity first, and fetching only the data as needed.<p>2. REST being too inefficient and cumbersome in practice, the rise of GraphQL to minimize payload over the wire.<p>3. Lack of SSR impacting SEO, the need to render on the server and sending HTML over on initial payload. (which btw goes against (1) for first render)<p>4. Impedance mismatch creeping in, seemingly necessitating the need to send data along with the rendered HTML for hydration purposes (again, going against the idea of saving data over the wire)<p>I eagerly await, from the sidelines, news from even deeper depths. Perhaps shipping the entirety of the server code that operates then against a local SQLite database? Which is then kept in sync with the authoritative copy &quot;on the edge&quot;? After all, if simple old SSR is beneath consideration, than we would do well to disdain the notion of a boring centralized database as well.
评论 #37899967 未加载
评论 #37890540 未加载
评论 #37895447 未加载
nine_k超过 1 年前
I wish websites could just serve the data worth scraping as nice data files, along the content. But it&#x27;s impossible due to most websites&#x27; business model.<p>Those who can do it for free, do; see e.g. Wikipedia.<p>More websites could offer a paid API though; stopping to waste your engineers&#x27; time on adding anti-scraping measures, and collecting payments from serious actors could offset the possible losses from copycats.
评论 #37888196 未加载
评论 #37888470 未加载
评论 #37888191 未加载
moomoo11超过 1 年前
I’m glad stuff like next and other bloat exists so I can avoid it and stick to plain ol server rendered tech.<p>Go backend, sprinkle some svelte, call it a day. Serve m&#x2F;billions with server rendered page without breaking a sweat or over engineering. $24 a month.
评论 #37889720 未加载
quickthrower2超过 1 年前
Hasn’t it always been that way. Had fun on manifold.markets a year ago beating some markets using that kind of data
revskill超过 1 年前
No need for it.<p>I made a RSC framework which offers both for you, the html or the data.<p>Html: <a href="https:&#x2F;&#x2F;www.revskill.dev" rel="nofollow noreferrer">https:&#x2F;&#x2F;www.revskill.dev</a><p>Only the json: <a href="https:&#x2F;&#x2F;www.revskill.dev&#x2F;?data=true" rel="nofollow noreferrer">https:&#x2F;&#x2F;www.revskill.dev&#x2F;?data=true</a><p>Want the HTML without the embed json ? <a href="https:&#x2F;&#x2F;revskill.dev&#x2F;?isBot=true" rel="nofollow noreferrer">https:&#x2F;&#x2F;revskill.dev&#x2F;?isBot=true</a><p>You can see, no need for Graphql, HTML and API seperation.<p>Your html is the API !
ChrisArchitect超过 1 年前
Tell HN: