TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Tips for reliable web automation and scraping selectors

122 pointsby tschillerover 4 years ago

7 comments

imgabeover 4 years ago
Another tip I've found extremely helpful for webscraping: check the <head> for <meta> tags or a <script type="application/ld+json"> tag that might already have the information you want collected neatly in one place. You may be able to save yourself a lot of time and grief.
评论 #25998009 未加载
评论 #25998032 未加载
评论 #25999427 未加载
评论 #25996424 未加载
SomewhatLikelyover 4 years ago
Here&#x27;s a browser extension for working with selectors that was shared on the front page sometime last year: <a href="https:&#x2F;&#x2F;github.com&#x2F;hermit-crab&#x2F;ScrapeMate" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;hermit-crab&#x2F;ScrapeMate</a><p>Edit: I think it was from this discussion: <a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=24057228" rel="nofollow">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=24057228</a>
评论 #25995459 未加载
tschillerover 4 years ago
Author here, happy to answer any questions<p>For our product (PixieBrix) we actually generally grab the data directly from the front-end framework (e.g., React props). It&#x27;s a bit less stable since it&#x27;s effectively an internal API, but it means you can grab a lot of data with a single selector and can generally avoid parsing values out of text
评论 #25996085 未加载
评论 #26004481 未加载
indysignersover 4 years ago
Both the :has and the :contains selector (as in ul:has(&gt; li:contains(&quot;Built&quot;)) ) were new to me. So thanks to the author for sharing that little trick!
Chernobogover 4 years ago
For e2e testing I have seen various patterns, and the article mentions data-test-id for instance. In my own tests, I have opted for something similar, that has given a bit more flexibility.<p>Singular elements: <i>data-test-save-button</i>, <i>data-test-name-input</i><p>Elements that are a part of a list: <i>data-test-user={user.id}</i>, <i>data-test-listing={listing.id}</i><p>This allows us to name our elements with data test attributes, but also provide values to them where applicable.<p>I have also created a testSelector function that takes id and value, and spits out either <i>[data-test-${id}=&quot;${value}&quot;]</i> or <i>[data-test-${id}]</i>.<p>We have also experimented with letting shared components popuplate their own data-test-* attribute automatically based on other props. Like in our modal component, which sets data-test-modal={title}. data-test-delete-user-modal vs. data-test-modal=&quot;Delete user&quot;. But in the latter case, the dev does not need to provide the data-test-* attribute manually, since the component takes care of it.
1vuio0pswjnm7over 4 years ago
Selectors are very brittle. I do not use them and IMO the scrapers I create are less likely to break and easier to fix if they do.
hombre_fatalover 4 years ago
Nice list, esp for anyone getting started. I remember web scraping was my entrypoint into web development. I take it for granted now, but 15+ years ago I loved the idea of being able to completely mine a website of all its content.
评论 #25998217 未加载