How does it know I want CSV? – An HTTP trick

220 点作者 calpaterson超过 2 年前

22 条评论

> at least not until the IANA get around to officially assigning them a media type.This is the wrong characterisation. IANA does not take such initiative; their role is administrative rather than regulatory or active. It’s up to an interested party to register media types.For Parquet, that’s easy: the developers can fill out <a href="https://www.iana.org/form/media-types" rel="nofollow">https://www.iana.org/form/media-types</a> in probably less than ten minutes, probably choosing the media type application/vnd.apache.parquet. It’ll be processed quickly.For JSON Lines/NDJSON, it’s messier, calling for standards tree registration, which generally means taking a proper specification through some relevant IETF working group. (There are a few media types in customary use presently, all bad: application/x-ndjson, application/x-jsonlines, application/jsonlines; all are in the standards tree despite nonregistration, and two include the long-obsolete x- prefix.) Such an adventurer will doubtless encounter at least some resistance due to the existing JSON Text Sequences (application/json-seq, defined in RFC 7464, <a href="https://www.rfc-editor.org/rfc/rfc7464" rel="nofollow">https://www.rfc-editor.org/rfc/rfc7464</a>), which is functionally equivalent, mildly harder to work with, and technically superior, due to being unambiguously not-just-JSON, using a ␞ (U+001E RECORD SEPARATOR) prefix on every record, but given the definite popularity of JSON Lines/NDJSON, an Internet Draft will easily be enough for provisional registration.

评论 #34417719 未加载

评论 #34414969 未加载

larsnystrom超过 2 年前

Good article, but this not a “trick”, it’s a core part of the HTTP protocol. Worthy of an article non the less, judging by how misunderstood the topic is among the commenters here.

评论 #34412336 未加载

评论 #34412601 未加载

评论 #34411565 未加载

评论 #34412269 未加载

评论 #34426468 未加载

评论 #34419425 未加载

评论 #34420280 未加载

joosters超过 2 年前

There are two downsides to this approach:1) The discovery of the different response formats. How do I know that I can get csv files from that URL, other than by hoping that the website documents this somewhere?There's nothing in the underlying HTTP response from <a href="https://csvbase.com/meripaterson/stock-exchanges" rel="nofollow">https://csvbase.com/meripaterson/stock-exchanges</a> that tells me I can get an HTML or CSV version. Is there a JSON version available? What other variants exist? How do I know that this URL will deliver different responses?2) Will the website always default to csv files or will my app break when they decide that XML is superior? (Well, obviously not, especially for a site called csvbase!)But if your program expects CSV data, it is probably best to always request that, and a URL that ends in .csv gives you far more certainty that the data is going to be in that format.

评论 #34413763 未加载

评论 #34414472 未加载

ilyt超过 2 年前

READ THE SPEC GUYS, you might find other "hidden" "tricks" there lmao

评论 #34413901 未加载

评论 #34413341 未加载

hcarvalhoalves超过 2 年前

This is a supported use of "Accept" headers, but I kind of miss the pre-SEO web where it was okay for URLs to just carry a file extension - having "example.com/dataset/foo.html" and "example.com/dataset/foo.csv" is pretty simple and less ambiguous too.

评论 #34413482 未加载

评论 #34414443 未加载

评论 #34414489 未加载

tboerstad超过 2 年前

In case the author sees this: Thank you for enabling CORS so that it's possible to plot examples from other sites. It would be awesome if the Content-Range header was allowed as wellHere is an example of plotting the first dataset that popped up for me: <a href="https://csvplot.com/remote_file.html?url=https://csvbase.com/ubieratlupa/monuments" rel="nofollow">https://csvplot.com/remote_file.html?url=https://csvbase.com...</a>

评论 #34417381 未加载

apwheele超过 2 年前

For a random python/pandas trick, I have come across web-api's that cannot be directly read into pandas using the URL (I imagine folks on here can comment better the difference in web serving tech), but you can read in the IO object and pass that to pandas. Blog post, <a href="https://andrewpwheeler.com/2022/11/02/using-io-objects-in-python-to-read-data/" rel="nofollow">https://andrewpwheeler.com/2022/11/02/using-io-objects-in-py...</a>, but can just put simple example in comment:<pre><code> #### import pandas as pd from io import StringIO import requests url = ('https://data.townofcary.org/explore/dataset/cpd-incidents/download/' '?format=csv&timezone=America/New_York&lang=en&use_labels_for_header=true' '&csv_separator=%2C') res = requests.get(url) df = pd.read_csv(StringIO(res.text)) ####</code></pre>

评论 #34413472 未加载

huntedsnark超过 2 年前

I'm kind of shocked by how poorly understood basic HTTP stuff like this is for HN audience based on the comments and article itself. My filter bubble must be tuned to "web."

hk1337超过 2 年前

Similar to how ifconfig.me returns just your IP address in curl but the full page in a browser.

pelasaco超过 2 年前

Was rails one of the first frameworks to make it to work automatically? <a href="https://github.com/rails/rails/blob/bbf0d35bf6148752911c1da4b7449450faea8755/actionpack/lib/action_controller/metal/mime_responds.rb#L25">https://github.com/rails/rails/blob/bbf0d35bf6148752911c1da4...</a>

contravariant超过 2 年前

Some links in the article are missing the TLD curiously enough, it does work if you go to: <a href="https://csvbase.com/meripaterson/stock-exchanges" rel="nofollow">https://csvbase.com/meripaterson/stock-exchanges</a>.

评论 #34411208 未加载

alganet超过 2 年前

HTTP is great. Another common "How does it know" is resuming downloads: that's done by the Range header. Curl supports it by using `--continue-at- -` (the dash means "figure out where it stopped", you can also use a byte range).

评论 #34412108 未加载

评论 #34411971 未加载

ElfinTrousers超过 2 年前

If you came here to point out how content negotiation isn't a "trick" but rather a simple basic part of the core protocol: think first on the fact that there was once a time when you didn't know that.

评论 #34412287 未加载

评论 #34412294 未加载

tleb_超过 2 年前

I've long thought that this concept could/should be applied to user-targeted content. Blogs could be capable of delivering content as HTML but possibly Markdown, plaintext, Gemini, PDF, etc. as well. SQLite tables, CSV, JSON for sharing data. Downloading of a directory with archives. What is missing for adoption is proper readers for alternative formats in the big web browsers though; I wonder if that would be accepted upstream.

rjh29超过 2 年前

Speaking of http tricks I'm more impressed about how you can clone github URLs (i.e. they serve git and regular http on the same port).

评论 #34411830 未加载

评论 #34415253 未加载

评论 #34411593 未加载

daniel-s超过 2 年前

The author uses Pandas as an example, but Pandas also has a great, built-in API to download and read HTML formatted tables in one line.<a href="https://pandas.pydata.org/docs/reference/api/pandas.read_html.html" rel="nofollow">https://pandas.pydata.org/docs/reference/api/pandas.read_htm...</a>

评论 #34413060 未加载

Asmod4n超过 2 年前

I’d pay money for a RSS reader which uses this, last time I checked none of the popular ones use content negotiation.

评论 #34414014 未加载

评论 #34414709 未加载

评论 #34413866 未加载

cybrjoe超过 2 年前

You mention jsonlines in the escape hatch section, is there an escape hatch for jsonlines? I tried .jsonl but I get a 500 error.

评论 #34412286 未加载

jcuenod超过 2 年前

I loved this about building on rails

bayesian_horse超过 2 年前

You always want CSV, don't you?

评论 #34411247 未加载

评论 #34411774 未加载

评论 #34411827 未加载

评论 #34412510 未加载

superlupo超过 2 年前

You've basically reinvented REST

评论 #34411758 未加载

评论 #34411464 未加载

Gys超过 2 年前

Instead of relying on settings in clients I would have required something more explicit in the url. For example <a href="https://csvbase.com/meripaterson/stock-exchanges" rel="nofollow">https://csvbase.com/meripaterson/stock-exchanges</a> for the html version and <a href="https://csvbase.com/meripaterson/stock-exchanges/csv" rel="nofollow">https://csvbase.com/meripaterson/stock-exchanges/csv</a> for the csv version.Generally speaking, implicit magic is cool but can also be frustrating if its not working as expected.Very OT: Apple has this attitude of 'it just works'. Really great. Unless it is not working while there are no settings and no info whatsoever on what the requirements are to make it work.

评论 #34411788 未加载

评论 #34412011 未加载

评论 #34411512 未加载

评论 #34411457 未加载