Google Dataset Search

388 pointsby abraxazabout 4 years ago

19 comments

abraxazabout 4 years ago

Information on how to annotate datasets: <a href="https://developers.google.com/search/docs/data-types/dataset" rel="nofollow">https://developers.google.com/search/docs/data-types/dataset</a>> We can understand structured data in Web pages about datasets, using either schema.org Dataset markup, or equivalent structures represented in W3C's Data Catalog Vocabulary (DCAT) format. We also are exploring experimental support for structured data based on W3C CSVW, and expect to evolve and adapt our approach as best practices for dataset description emerge. For more information about our approach to dataset discovery, see Making it easier to discover datasets.For more info on those:- W3C's Data Catalog Vocabulary: <a href="https://www.w3.org/TR/vocab-dcat-3/" rel="nofollow">https://www.w3.org/TR/vocab-dcat-3/</a>- Schema.org dataset: <a href="https://schema.org/Dataset" rel="nofollow">https://schema.org/Dataset</a>- CSVW Namespace Vocabulary Terms: <a href="https://www.w3.org/ns/csvw" rel="nofollow">https://www.w3.org/ns/csvw</a>- Generating RDF from Tabular Data on the Web (examples on how to use CSVW): <a href="https://www.w3.org/TR/csv2rdf/" rel="nofollow">https://www.w3.org/TR/csv2rdf/</a>

评论 #27069657 未加载

评论 #27069639 未加载

chatmastaabout 4 years ago

This is a great resource. At Splitgraph, we index ~40k open data sets, and we make sure to include structured metadata for each one, so we show up in these results. (example [0])One cool aspect of this metadata is that it allows a dataset to have multiple sources. So if two sites index the same dataset, there is no duplicate content penalty like there might be with textual content. If you search for a dataset, it will include links to all its sources (whether canonical or otherwise).For most of the data we index at Splitgraph, the canonical source is an open government data portal powered by Socrata (e.g. data.cdc.gov). We noticed that Socrata powered a lot of portals, so we wrote a Socrata plugin for Splitgraph, along with a scraper to index the metadata. The plugin basically implements a Postgres FDW so that Splitgraph can translate from SQL to the upstream query language. In this case, the plugin translates to Socrata's bespoke API language. But for private deployments we also have plugins for Snowflake, Postgres, some SaaS services, etc.If you find some data on Google Dataset Search with Splitgraph listed as a source, please take a look! Our "Data Delivery Network" (DDN) is implemented on top of the Postgres wire protocol, so you can connect with any Postgres client (or use our web editor). All the Postgres query syntax is available to you; you can even JOIN across any of the other 40k+ datasets indexed at Splitgraph. That includes "live data" like Socrata portals, but also versioned snapshots of data called "data images." Here's an example of a point-in-time query across two snapshots (basically a diff) [1], and another query that joins across tables at data.cityofchicago.org and data.cambridgema.gov [2].[0] <a href="https://www.splitgraph.com/cdc-gov/distribution-of-covid19-deaths-and-populations-by-jwta-jxbg" rel="nofollow">https://www.splitgraph.com/cdc-gov/distribution-of-covid19-d...</a> – "View Source" to see the Schema.org metadata[1] <a href="https://bit.ly/3epvxcj" rel="nofollow">https://bit.ly/3epvxcj</a>[2] <a href="https://bit.ly/3f1ll8K" rel="nofollow">https://bit.ly/3f1ll8K</a>(Sorry for the bit.ly links. The URL for our query editor includes the full SQL string, and I don't want to mess up HN formatting.)

评论 #27083665 未加载

davcancasabout 4 years ago

This dataset search engine has been around for years! We created DataMarket (<a href="https://datamarket.es" rel="nofollow">https://datamarket.es</a>) inspired by this site (and Auren Hoffman's SafeGraph).

评论 #27070002 未加载

Der_Einzigeabout 4 years ago

Stop, you're making the barrier to entry too low! /sThis is really really cool. Between this and Hugginfaces Dataset and models hubs, AI/ML is really getting easier to use.

评论 #27070218 未加载

john-tells-allabout 4 years ago

Dataset with 9,000 annotated cat images! => <a href="https://datasetsearch.research.google.com/search?query=cat&docid=L2cvMTFqY2tkNTI3MQ%3D%3D" rel="nofollow">https://datasetsearch.research.google.com/search?query=cat&d...</a>

uptimeabout 4 years ago

I have a lot to read before I get excited but if the team is here: Can we get DCAT for sets that are otherwise only discoverable with OAI-PMH? Seems like a divide between govt and academic repos that hinders harvesting.

damirkotoricabout 4 years ago

Shameless plug. I wrote a piece about The State of Open Data Portals <a href="https://uxdesign.cc/designing-open-data-portals-for-government-85e2524f5877" rel="nofollow">https://uxdesign.cc/designing-open-data-portals-for-governme...</a> where I predict that it'll take a Google to really provide a single searchable dataset portal for the whole world.Doesn't take a genius to predict, but there ya go! Governments are assembling datasets in a very fragmented way. It'll take a private company to provide one single website to explore and find all datasets from around the world, making it easier to look at holistic patterns that are happening around the world, or compare patterns between countries.Though, I would expect a much better UX from Google nowadays. This site has more in common with Google Scholar than Google Search.And ultimately I'd like to see them build something where people don't need to download datasets in order to make use of the data.I compare the state of open data to the state of mapping software before Google Maps. You needed to download map files and open them on special software that you open on your computer to make sense of the data. And then Google Maps came along and flipped that whole model. Open data needs the same leap forward in order for more people to make greater use of open data.

igraviousabout 4 years ago

Discussion from Jan 2020: <a href="https://news.ycombinator.com/item?id=22130874" rel="nofollow">https://news.ycombinator.com/item?id=22130874</a> | 32 commentsDiscussion from Sept 2018: <a href="https://news.ycombinator.com/item?id=17919297" rel="nofollow">https://news.ycombinator.com/item?id=17919297</a> | 76 comments

plaidfujiabout 4 years ago

I’ve come across this a few different times over the years... always seems enticing and potentially useful, but I’ve never found a real use for it. I suppose it provides a library of well-prepped datasets to test ML models on? Anyone ever used this for any practical purpose beyond a sandbox-type use case?

评论 #27078989 未加载

smhxabout 4 years ago

another good resource that's more specific to machine learning is <a href="https://paperswithcode.com/datasets" rel="nofollow">https://paperswithcode.com/datasets</a>

ravila4about 4 years ago

The lab I work in has a project that helps annotate datasets with metadata and register their schemas: <a href="https://discovery.biothings.io/" rel="nofollow">https://discovery.biothings.io/</a>A common barrier to making FAIR datasets is that not all data lends itself to be schema.org compliant. The idea is that instead of enforcing one schema to rule them all, we allow people to make their own schemas by extending existing ones, and register them in an API to be easily discoverable.

jeresuikkilaabout 4 years ago

They should add a "I'm Feeling lucky" button

a_square_pegabout 4 years ago

Also in case the team is here... the updated date for ERA5 back extension to 1950-1978 (Preliminary version - <a href="https://datasetsearch.research.google.com/search?query=ERA5%20atmospheric%20reanalysis&docid=L2cvMTFwM2tiM3NmYw%3D%3D" rel="nofollow">https://datasetsearch.research.google.com/search?query=ERA5%...</a>) is incorrect as this was only released last year (2020) but is stated as 2011.

antplsabout 4 years ago

No mention of <a href="https://frictionlessdata.io/" rel="nofollow">https://frictionlessdata.io/</a> dataset metadata format which is also used by Kaggle

fabcommabout 4 years ago

This is every data scientists’ dream.

评论 #27069681 未加载

评论 #27069637 未加载

paulcoleabout 4 years ago

I was looking up dentistry data sets (my industry) and came across this:<a href="https://www.arcgis.com/home/item.html?id=9850793c688e4eebaab8be8e98cc6b28" rel="nofollow">https://www.arcgis.com/home/item.html?id=9850793c688e4eebaab...</a>Can anybody explain why this showed up in a dataset search and what exactly the data is?

评论 #27070642 未加载

sigmonsaysabout 4 years ago

Do they have a deprecation notice up already?

评论 #27070070 未加载

apprenticerabout 4 years ago

The privacy problem should be considered

评论 #27075069 未加载

评论 #27073118 未加载

lunatunaabout 4 years ago

I'm not a scientist, so not the most scholarly first lookup, but tried searching for penis data[0]. The first link sent me to a site that requires signup to use [1]. No fun. Won't use again.[0] - <a href="https://datasetsearch.research.google.com/search?query=penis&docid=L2cvMTFqbl82NWZ3Ng%3D%3D" rel="nofollow">https://datasetsearch.research.google.com/search?query=penis...</a> [1] - <a href="https://data.world/jemus42/world-penis-data" rel="nofollow">https://data.world/jemus42/world-penis-data</a>

评论 #27071049 未加载