科技回声

I'm building a website that lets people aggregate the numbers they care about into one spot.<p>Right now, the group that it is most popular with is authors, who use it to get alerts when they get a new review on Amazon.<p>They have suggested that I make it possible to track their author rank on Amazon. I've been playing with that and I have found that regex is a nice way to go for that particular job. (I've been using xpaths and selectors up to this point.) So soon I'll probably add that as a specialized function to my website.<p>Because regexes are so useful (not for parsing but for finding known patterns), I'm tempted to make it possible to create automatic scrapers using regexes. But it seems the kind of thing you want to research a bit first.

1 comment

mtmail超过 10 年前

Does your target audience understand regular expressions? I like the approach import.io took: you go to one or more pages with their browser, select the fields you're interested in and they build the extraction (xpath, css selectors) for you. An engineer can take that configuration and instruct the scraper to call a URL and get JSON back. Even with their special browser, help pages, videos I had trouble explaining it to a non-technical person.<p>"Normal" regular expressions are probably fine. Only with back-tracing or look-forward it might be possible to create complexity so a regex takes too long. Wrapping it into a block with fixed timeout should work.

评论 #8832007 未加载

Ask HN: what could go wrong?

1 comment

Ask HN: what could go wrong?

1 comment