科技回声

7 条评论

It also shows how awful some sites are. I'm scraping a jobs offer portal written in next.js (although I use the endpoint for hydration), one of the lovely things they do is they pass in props containing all possible skills candidates can have, they send 200KB per listing, less than 10KB is the average volume of job-specific (actually required) data.Previously they were rendering it from a REST endpoint which only had the necessary data, but now I get a lovely dictionary of all possible values of various properties, provided on a silver platter.Another portal I'm scraping uses GraphQL, and they don't have access control, I get all the data they have on impressions, I know how much money somebody paid for listing, I know when the offer is going to be auto-bumped - just lovely. And no need to use Playwright either.

评论 #37885105 未加载

评论 #37887135 未加载

评论 #37887131 未加载

评论 #37884525 未加载

javascriptdante超过 1 年前

I sometimes wonder if the Javascript world has some sort of divine significance, as though they carry the weight of all the developer sins in the world for some particular transcendental purpose. After all, Dante's nine circles of hell are insufficient for programmers who are able to recurse infinitely. How much further can we go?Consider the descent down, and the contradictions therein, considering JUST data concerns for brevity:1. SPA ostensibly being marginally more data efficient by shipping the markup/interactivity first, and fetching only the data as needed.2. REST being too inefficient and cumbersome in practice, the rise of GraphQL to minimize payload over the wire.3. Lack of SSR impacting SEO, the need to render on the server and sending HTML over on initial payload. (which btw goes against (1) for first render)4. Impedance mismatch creeping in, seemingly necessitating the need to send data along with the rendered HTML for hydration purposes (again, going against the idea of saving data over the wire)I eagerly await, from the sidelines, news from even deeper depths. Perhaps shipping the entirety of the server code that operates then against a local SQLite database? Which is then kept in sync with the authoritative copy "on the edge"? After all, if simple old SSR is beneath consideration, than we would do well to disdain the notion of a boring centralized database as well.

评论 #37899967 未加载

评论 #37890540 未加载

评论 #37895447 未加载

nine_k超过 1 年前

I wish websites could just serve the data worth scraping as nice data files, along the content. But it's impossible due to most websites' business model.Those who can do it for free, do; see e.g. Wikipedia.More websites could offer a paid API though; stopping to waste your engineers' time on adding anti-scraping measures, and collecting payments from serious actors could offset the possible losses from copycats.

评论 #37888196 未加载

评论 #37888470 未加载

评论 #37888191 未加载

moomoo11超过 1 年前

I’m glad stuff like next and other bloat exists so I can avoid it and stick to plain ol server rendered tech.Go backend, sprinkle some svelte, call it a day. Serve m/billions with server rendered page without breaking a sweat or over engineering. $24 a month.

评论 #37889720 未加载

quickthrower2超过 1 年前

Hasn’t it always been that way. Had fun on manifold.markets a year ago beating some markets using that kind of data

revskill超过 1 年前

No need for it.I made a RSC framework which offers both for you, the html or the data.Html: <a href="https://www.revskill.dev" rel="nofollow noreferrer">https://www.revskill.dev</a>Only the json: <a href="https://www.revskill.dev/?data=true" rel="nofollow noreferrer">https://www.revskill.dev/?data=true</a>Want the HTML without the embed json ? <a href="https://revskill.dev/?isBot=true" rel="nofollow noreferrer">https://revskill.dev/?isBot=true</a>You can see, no need for Graphql, HTML and API seperation.Your html is the API !

ChrisArchitect超过 1 年前

Tell HN:

7 条评论

isbvhodnvemrwvn超过 1 年前

评论 #37885105 未加载

评论 #37887135 未加载

评论 #37887131 未加载

评论 #37884525 未加载

javascriptdante超过 1 年前

评论 #37899967 未加载

评论 #37890540 未加载

评论 #37895447 未加载

nine_k超过 1 年前

评论 #37888196 未加载

评论 #37888470 未加载

评论 #37888191 未加载

moomoo11超过 1 年前

评论 #37889720 未加载

quickthrower2超过 1 年前

Hasn’t it always been that way. Had fun on manifold.markets a year ago beating some markets using that kind of data

revskill超过 1 年前

ChrisArchitect超过 1 年前

Tell HN:

Next.js v13 websites are a heaven for scraping

7 条评论

Next.js v13 websites are a heaven for scraping

7 条评论