Show HN: Jam API, turn any site into a JSON api using CSS selectors

300 点作者 gavino大约 9 年前

26 条评论

A little word of warning/encouragement. I did something similar a long time ago (JSONDuit), which got posted to HN by someone else.You will probably run into a healthy mix of "that's cool" / "I did that before you!" / "but how will it make money?". Ignore it and do your thing. If you figure out how to monetize it, great! Even if you don't or if you have no desire to, you will have learned and grown during the course of the project. That is invaluable.Have fun and screw the haters...

评论 #11588324 未加载

评论 #11588609 未加载

评论 #11589645 未加载

评论 #11588360 未加载

adriancooney大约 9 年前

This is a fantastic idea and I'm really surprised nothing like this has existed before, it seems like such a no-brainer. Great work.

评论 #11587815 未加载

评论 #11587994 未加载

评论 #11588555 未加载

评论 #11588123 未加载

评论 #11587670 未加载

评论 #11588154 未加载

评论 #11587882 未加载

评论 #11590280 未加载

ptwt大约 9 年前

I put this similar project[0] together a while ago. Almost the same concept, but I skipped the json layer altogether as I just wanted a quick way of getting nuggets of content from webpages into my terminal.For example:<pre><code> curl https://news.ycombinator.com/news | tq -tj ".title a" </code></pre> 0. <a href="https://github.com/plainas/tq" rel="nofollow">https://github.com/plainas/tq</a>

评论 #11592500 未加载

jstanley大约 9 年前

with curl:<pre><code> $ curl -d url=https://news.ycombinator.com/ -d json_data='{"title":"title"}' http://www.jamapi.xyz/ </code></pre> =><pre><code> { "title": "Hacker News" } </code></pre> Also, the Ruby example appears to post to the wrong URL?

评论 #11587568 未加载

chriswarbo大约 9 年前

Very nice idea. Although scraping should always be a last resort, I could imagine using this for semi-serious purposes, i.e. when I care enough about the output, will be doing many requests, don't mind relaying data via a third-party, etc.I currently do quite a bit of scraping for my own use (generating RSS feeds for sites, making simple commandline interfaces to automate common tasks, etc.). I've found xidel to be pretty good for this: it starts off pretty simple (e.g. with CSS selectors or XPath), but gets pretty gnarly for semi-complicated things. For example, it allows templating the output, using a language I struggle to grasp. This service seems to address that middle ground, e.g. restricting its output to JSON, and hence making the specification of the output much simpler (a nice JSON structure, rather than messing around with splicing text together).

NicoJuicy大约 9 年前

I'm actually wondering if it would be possible to add forms authentication to this?Eg. Post with some sort of css selecters and then a "cookie memory".

评论 #11592613 未加载

fryiee大约 9 年前

Great! I've been trying to get my head around Scrapy, and I have little Python experience. This seems to fit in a lot better with my skillset for the project I'm working on.

denishaskin大约 9 年前

Application Error An error occurred in the application and your page could not be served. Please try again in a few moments.If you are the application owner, check your logs for details.

OJFord大约 9 年前

Yes, yes, yes!I'm using Apifier at the moment, which I really like, but my biggest gripe is the awkwardness of source (and VCS) integration. The best I've come up with is to export the JSON config (which contains the scraper source code as a value - yuck) and try to remember to keep re-exporting and checking it in.Having also had to hack around the inability to parameterise the scrape url (e.g. 'profile/$username') - which they've since added support for - I started to wonder if I mightn't as well just use BeautifulSoup (Python HTML parser lib) and check it in properly.This is probably my ideal. I can keep it all in source control because it's just an HTTP request body, and I can parameterise it because, well, it's just an HTTP request body!It's also open source because you're an amazing person; so if I had one little concern left about the availability of your site I can dismiss it right away since I could run my own on Heroku should jamapi.xyz prove unsustainable. It's possibly a better idea to do that anyway, but I often wonder if Heroku doesn't consider that a problem - multiple instances of the same app running on free dynos under different accounts...

staticelf大约 9 年前

I just get "invalid json" when I try to use the form on the page.

soheil大约 9 年前

I think with advent of tools like this developers more and more will be thinking of ways to make it hard to have someone scrape their website into data structures. I wonder if we are going to see the same thing that happened to minimized js happening to html more and more. I know there are sites that dynamically change css class names and ids. But I think soon we will also see div hierarchies to dynamically change form without presentationally looking different to the end user.

评论 #11591667 未加载

WA大约 9 年前

HTTPS results in 500 Internal Server Error.Edit: Well no, it's only some sites. E. g. <a href="https://medium.com" rel="nofollow">https://medium.com</a>

评论 #11587681 未加载

评论 #11587667 未加载

评论 #11587659 未加载

MetaMetaApplyHN大约 9 年前

Does anyone have any information on anyone that's used HTTP as an API to share/create metadata for any transactions, content, etc. publicly online? I would very curious to know about it!Welcome feedback on my "Apply HN" on doing exactly this: <a href="https://news.ycombinator.com/item?id=11583348" rel="nofollow">https://news.ycombinator.com/item?id=11583348</a>

评论 #11588877 未加载

loisaidasam大约 9 年前

Might be helpful to have the example execute inline so you can see what's going on/experiment without having to leave the page.

splatcollision大约 9 年前

Nice work, thanks for adding the Github link. I can think of lots of immediate use for this. Consider publishing on NPM?

bartkappenburg大约 9 年前

OT perhaps: I'm still looking for a solution that has a graphical UI that allows users to point and click an element on their page and returns the corresponding CSS-selector. SelectorGadget does this as a chrome-extension but I'm looking for something that works without an extension.

评论 #11587935 未加载

评论 #11587938 未加载

评论 #11588263 未加载

daw___大约 9 年前

Wonderful idea.What about DOM nodes generated by JavaScript? Will Jam render the page before scraping?

评论 #11587849 未加载

karlcoelho1大约 9 年前

If anyone remembers, they was a YC company that did exactly this. It was called Kimono Labs. I think it failed and just got acquired a year ago. "Jam API" will probably do way better because, well, open source.

paulmd大约 9 年前

I've been thinking about writing some website-to-JSON scrapers myself and this basically solves that problem (since I would have been going after CSS selectors or xpath anyway myself). Nice job.

dimino大约 9 年前

How will someone like CloudFlare stop a tool like this from scraping their customer's sites? Just blocking the tool's IP?

评论 #11592322 未加载

smadge大约 9 年前

I wish site publishers annotated their markup with RDFa tags so every web page was already an "api"

nsgi大约 9 年前

If it's going to be used for serious purposes it really needs HTTPS support, as most APIs do these days.

thomasahle大约 9 年前

What do you think would be a good syntax for enabling following links?Say I wanted the Hacker News links + first comment?

评论 #11587881 未加载

uberneo大约 9 年前

<a href="http://blog.webkid.io/nodejs-scraping-libraries/" rel="nofollow">http://blog.webkid.io/nodejs-scraping-libraries/</a> -- Good scraping options in NodeJS .. my personal favourite is <a href="https://github.com/rc0x03/node-osmosis" rel="nofollow">https://github.com/rc0x03/node-osmosis</a>

amelius大约 9 年前

Isn't this exactly what XML (or for that matter XHTML) was supposed to do?

评论 #11593202 未加载

joelbondurant大约 9 年前

This!... is why we can't have nice things.