科技回声

12 条评论

oefrha超过 1 年前

I had a look at <a href="https://github.com/philippta/flyscrape/blob/master/scrape.go">https://github.com/philippta/flyscrape/blob/master/scrape.go</a>. It’s just using the builtin HTTP client to fire off requests, with an identifying user agent. Which means it’s useless for scraping most real world sites you may want to scrape, unfortunately. You’ll get served a JavaScrpt challenge, or not even that (many sites will refuse to serve anything if they see a random user agent like flyscrape/1.0).

评论 #38237410 未加载

1vuio0pswjnm7超过 1 年前

"default = 100 [requests per second]"How many new TCP connections per second.Is this a "scraper" or a "crawler".It appears to accept a "starting URL" and to follow links.Opening many TCP connections is arguably still a reason why website operators try to prevent crawling (except from Googlebot IPs). As for scraping, it can be done with a single TCP connection. Perhaps "developers" instead opt to use many TCP connections and then complain when they get blocked.

moehm超过 1 年前

Interesting. Can you compare it to colly? [0]Last time I looked it was the most popular choice for scraping in Go and I have some projects using it.Is it similar? Does it have more/less features or is it more suited for a different use case? (Which one?)[0] <a href="https://github.com/gocolly/colly">https://github.com/gocolly/colly</a>

评论 #38236548 未加载

bryanrasmussen超过 1 年前

Looks like it doesn't have the possibility of running it as a particular browser etc. Which I guess makes it fine for a lot of pages, but also a lot of scraping tasks would be affected. Am I right or did I miss something?

评论 #38231523 未加载

fyzix超过 1 年前

What happens if 'find()' returns a list and you call '.text()'. Intuition tells me it should fail but maybe it implicitly gets the text from the first item if it exists.Either way, I think you create a separate method 'find_all()' that returns a list to make the API easier to reason about.

krick超过 1 年前

This looks like something I could use. Maybe not revolutional, but I do that from time to time, and even if only for organizational purposes it seems to make sense to store that stuff as a bucnh of configuration files for some external tool, rather than a bunch of python-scripts that I implement somewhat differently every time.Right now I'm just wrapping my head around how this works, and didn't try it hands-on yet, but I struggle to evaluate from the existing documentation, how useful this actually is. All examples in the repository right now are ultimately one-page scrappers, which, honestly, would be quite useless to me. Pretty much every scraper I write has at least 2-3 logical layers. Like, consider your HN-example, but you want to include top-10 comments for each post. Is it even possible? Well, I guess for HN you could just get by using allowedURLs and treating default function as a parser for the comment-page, but this isn't generic enough. Consider some internet shop. That would be (1) product category tree, sometimes much easier to hard-code, rather than scrape it every time; hard-coding often is generative (e.g. example.com/X/A-B-C, where X is a string from the list, A, B and C are padded numbers, each with a different range) (2) you go into each category, retrieve either a sub-category list (possibly, js-rendered, multiple pages) or product list (same applies) (3) open each product url, do the actual parsing (name, price, specification, etc). Each of json-object from (3) often has to include some minimal parsed data from level (2) (like category name)More advanced, but also way to popular to imagine a generic web-scraper without it: in addition to some json-metadata you download pictures, or pdf-files, etc. (Sometimes you don't even need metadata.) Maybe just text files, but the result is several GBs, and isn't suitable to be handled as a single json-object, but rather a file/directory tree.Is any of this possible with this tool?Also, regardless of being it useful for my cases, some minor comments:1. Links in docs/readme.md#configuration don't work (but the .md files for them actually exist).2. I would suggest making "url" in the configuration either a list, or string|list. I suppose, that pretty much doesn't change the logic, but would make a lot of basic use-cases much easier to implement.

slig超过 1 年前

Thanks for sharing! Just a small nit: the links at the bottom of this page are broken [1].[1]: <a href="https://github.com/philippta/flyscrape/blob/master/docs/readme.md#configuration">https://github.com/philippta/flyscrape/blob/master/docs/read...</a>

评论 #38236565 未加载

lucgagan超过 1 年前

This looks great. I wish I had this a few months ago! Giving it a try.

评论 #38231712 未加载

unixhero超过 1 年前

I will test this, great stuff

sunshadow超过 1 年前

These days, I'm not even using Go for scraping that much, as the webpage changes makes me crazy and JS code evaluation is a lifesaver, so I moved to Typescript+Playwright. (Crawlee framework is cool, while not strictly necessary).Its been 8+ years since i started scraping. I even wrote a popular Go web scraping framework previously: (<a href="https://github.com/geziyor/geziyor">https://github.com/geziyor/geziyor</a>).My favorite stack as of 2023: TypeScript+Playwright+Crawlee(Optional) If you're serious in scraping, you should learn javascript, thus, playwright should be good.Note: There are niche cases where lower-level language would be required (C++, Go etc), but probably only <%5

评论 #38232257 未加载

评论 #38232152 未加载

评论 #38236284 未加载

评论 #38232160 未加载

xyzzy_plugh超过 1 年前

I like web scraping in Go. The support for parsing HTML in x/text/html is pretty good, and libraries like github.com/PuerkitoBio/goquery go a long way to matching ergonomics in other tools. This project uses both, but then also goes on to use github.com/dop251/goja, which is a JavaScript VM and it's accompanying nodejs compatability layer and even esbuild, in order to interpret scraping instruction scripts.I mean, at this point I am not sure Go is the right tool for the job (I am actually pretty confident that it is not).A pretty neat stack of engineering, sure! This is cool, niely done. But I can't help but feel disturbed.

评论 #38231755 未加载

评论 #38232174 未加载

评论 #38231760 未加载

snake117超过 1 年前

Looks interesting, and thank you for sharing this! One common issue with scraping web pages is dealing with data that is dynamically loaded. Is there a solution for this? For example, when using Scrapy, you can have Splash running in Docker via scrapy-splash (<a href="https://github.com/scrapy-plugins/scrapy-splash">https://github.com/scrapy-plugins/scrapy-splash</a>).

Show HN: Flyscrape – A standalone and scriptable web scraper in Go

12 条评论

Show HN: Flyscrape – A standalone and scriptable web scraper in Go

12 条评论