TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Show HN: Flyscrape – A standalone and scriptable web scraper in Go

208 点作者 philippta超过 1 年前

12 条评论

oefrha超过 1 年前
I had a look at <a href="https:&#x2F;&#x2F;github.com&#x2F;philippta&#x2F;flyscrape&#x2F;blob&#x2F;master&#x2F;scrape.go">https:&#x2F;&#x2F;github.com&#x2F;philippta&#x2F;flyscrape&#x2F;blob&#x2F;master&#x2F;scrape.go</a>. It’s just using the builtin HTTP client to fire off requests, with an identifying user agent. Which means it’s useless for scraping most real world sites you may want to scrape, unfortunately. You’ll get served a JavaScrpt challenge, or not even that (many sites will refuse to serve anything if they see a random user agent like flyscrape&#x2F;1.0).
评论 #38237410 未加载
1vuio0pswjnm7超过 1 年前
&quot;default = 100 [requests per second]&quot;<p>How many new TCP connections per second.<p>Is this a &quot;scraper&quot; or a &quot;crawler&quot;.<p>It appears to accept a &quot;starting URL&quot; and to follow links.<p>Opening many TCP connections is arguably still a reason why website operators try to prevent crawling (except from Googlebot IPs). As for scraping, it can be done with a single TCP connection. Perhaps &quot;developers&quot; instead opt to use many TCP connections and then complain when they get blocked.
moehm超过 1 年前
Interesting. Can you compare it to colly? [0]<p>Last time I looked it was the most popular choice for scraping in Go and I have some projects using it.<p>Is it similar? Does it have more&#x2F;less features or is it more suited for a different use case? (Which one?)<p>[0] <a href="https:&#x2F;&#x2F;github.com&#x2F;gocolly&#x2F;colly">https:&#x2F;&#x2F;github.com&#x2F;gocolly&#x2F;colly</a>
评论 #38236548 未加载
bryanrasmussen超过 1 年前
Looks like it doesn&#x27;t have the possibility of running it as a particular browser etc. Which I guess makes it fine for a lot of pages, but also a lot of scraping tasks would be affected. Am I right or did I miss something?
评论 #38231523 未加载
fyzix超过 1 年前
What happens if &#x27;find()&#x27; returns a list and you call &#x27;.text()&#x27;. Intuition tells me it should fail but maybe it implicitly gets the text from the first item if it exists.<p>Either way, I think you create a separate method &#x27;find_all()&#x27; that returns a list to make the API easier to reason about.
krick超过 1 年前
This looks like something I could use. Maybe not revolutional, but I do that from time to time, and even if only for organizational purposes it seems to make sense to store that stuff as a bucnh of configuration files for some external tool, rather than a bunch of python-scripts that I implement somewhat differently every time.<p>Right now I&#x27;m just wrapping my head around how this works, and didn&#x27;t try it hands-on yet, but I struggle to evaluate from the existing documentation, how useful this actually is. All examples in the repository right now are ultimately one-page scrappers, which, honestly, would be quite useless to me. Pretty much every scraper I write has at least 2-3 logical layers. Like, consider your HN-example, but you want to include top-10 comments for each post. Is it even possible? Well, I guess for HN you could just get by using allowedURLs and treating default function as a parser for the comment-page, but this isn&#x27;t generic enough. Consider some internet shop. That would be (1) product category tree, sometimes much easier to hard-code, rather than scrape it every time; hard-coding often is generative (e.g. example.com&#x2F;X&#x2F;A-B-C, where X is a string from the list, A, B and C are padded numbers, each with a different range) (2) you go into each category, retrieve either a sub-category list (possibly, js-rendered, multiple pages) or product list (same applies) (3) open each product url, do the actual parsing (name, price, specification, etc). Each of json-object from (3) often has to include some minimal parsed data from level (2) (like category name)<p>More advanced, but also way to popular to imagine a generic web-scraper without it: in addition to some json-metadata you download pictures, or pdf-files, etc. (Sometimes you don&#x27;t even need metadata.) Maybe just text files, but the result is several GBs, and isn&#x27;t suitable to be handled as a single json-object, but rather a file&#x2F;directory tree.<p>Is any of this possible with this tool?<p>Also, regardless of being it useful for my cases, some minor comments:<p>1. Links in docs&#x2F;readme.md#configuration don&#x27;t work (but the .md files for them actually exist).<p>2. I would suggest making &quot;url&quot; in the configuration either a list, or string|list. I suppose, that pretty much doesn&#x27;t change the logic, but would make a lot of basic use-cases much easier to implement.
slig超过 1 年前
Thanks for sharing! Just a small nit: the links at the bottom of this page are broken [1].<p>[1]: <a href="https:&#x2F;&#x2F;github.com&#x2F;philippta&#x2F;flyscrape&#x2F;blob&#x2F;master&#x2F;docs&#x2F;readme.md#configuration">https:&#x2F;&#x2F;github.com&#x2F;philippta&#x2F;flyscrape&#x2F;blob&#x2F;master&#x2F;docs&#x2F;read...</a>
评论 #38236565 未加载
lucgagan超过 1 年前
This looks great. I wish I had this a few months ago! Giving it a try.
评论 #38231712 未加载
unixhero超过 1 年前
I will test this, great stuff
sunshadow超过 1 年前
These days, I&#x27;m not even using Go for scraping that much, as the webpage changes makes me crazy and JS code evaluation is a lifesaver, so I moved to Typescript+Playwright. (Crawlee framework is cool, while not strictly necessary).<p>Its been 8+ years since i started scraping. I even wrote a popular Go web scraping framework previously: (<a href="https:&#x2F;&#x2F;github.com&#x2F;geziyor&#x2F;geziyor">https:&#x2F;&#x2F;github.com&#x2F;geziyor&#x2F;geziyor</a>).<p>My favorite stack as of 2023: TypeScript+Playwright+Crawlee(Optional) If you&#x27;re serious in scraping, you should learn javascript, thus, playwright should be good.<p>Note: There are niche cases where lower-level language would be required (C++, Go etc), but probably only &lt;%5
评论 #38232257 未加载
评论 #38232152 未加载
评论 #38236284 未加载
评论 #38232160 未加载
xyzzy_plugh超过 1 年前
I like web scraping in Go. The support for parsing HTML in x&#x2F;text&#x2F;html is pretty good, and libraries like github.com&#x2F;PuerkitoBio&#x2F;goquery go a long way to matching ergonomics in other tools. This project uses both, but then also goes on to use github.com&#x2F;dop251&#x2F;goja, which is a JavaScript VM <i>and</i> it&#x27;s accompanying nodejs compatability layer <i>and</i> even esbuild, in order to <i>interpret scraping instruction scripts</i>.<p>I mean, at this point I am not sure Go is the right tool for the job (I am <i>actually</i> pretty confident that it is <i>not</i>).<p>A pretty neat stack of engineering, sure! This is cool, niely done. But I can&#x27;t help but feel disturbed.
评论 #38231755 未加载
评论 #38232174 未加载
评论 #38231760 未加载
snake117超过 1 年前
Looks interesting, and thank you for sharing this! One common issue with scraping web pages is dealing with data that is dynamically loaded. Is there a solution for this? For example, when using Scrapy, you can have Splash running in Docker via scrapy-splash (<a href="https:&#x2F;&#x2F;github.com&#x2F;scrapy-plugins&#x2F;scrapy-splash">https:&#x2F;&#x2F;github.com&#x2F;scrapy-plugins&#x2F;scrapy-splash</a>).
评论 #38231687 未加载
评论 #38231654 未加载
评论 #38232179 未加载