Launch HN: Hubble (YC S20) – Monitor data quality inside data warehouses

125 点作者 oliver101将近 5 年前

Hey everyone! We’re Oliver and Hamzah from Hubble (<a href="https://gethubble.io/hn" rel="nofollow">https://gethubble.io/hn</a>). Hubble runs tests on your data warehouse so you can identify issues with data quality. You can test for things like missing values, uniqueness of data or how frequently data is added/updated.We worked together for the last 4 years at a startup where we built and managed data products for insurers and banks. A common pattern we saw was teams taking data from their internal tools (CRM, HR system, etc.), application databases, and 3rd party data and storing it in a warehouse for analysis. However, when analysts/data scientists used the data for reports they would spot something suspicious and the engineering team would have to manually go through the data pipelines to find the source of the problem. More often than not it was simple things like a spike in missing values because an ETL job failed or stale data because a 3rd party data source hadn’t updated correctly. We realised that reliability/ trustworthiness of the raw data was essential before you could start abstracting away more interesting tasks like analysis, insight or predictions.We wanted to do this without having to write and maintain lots of individual tests in our code. So we built Hubble, which connects to a data warehouse and creates tests based on the type of data being stored (i.e. freshness of timestamps, the cardinality of strings, max value of numbers, missing values, etc.). We’ve also added the ability to write any custom tests using a built-in SQL editor. All the tests run on a schedule and you’ll get an email or slack alert when they fail. We’re also building webhooks and an Airflow operator so you can run tests immediately after running an ETL job or trigger a process to fix a failing test.Instead of asking users to send their data to us, the tests are run in the data warehouse and we track the test results over time. Today we support BigQuery, Snowflake and Rockset (which lets us work with MongoDB and DynamoDB) and are adding more on request.We’re planning on charging $200 a month for a few seats, and $30-50 for extra users after that.We’re still at an early access stage but want the HN community’s feedback so we’ve opened up access to the app for a few days, you can try it out here <a href="https://gethubble.io/hn" rel="nofollow">https://gethubble.io/hn</a>. We’ve added a demo data warehouse you can start with that has data on COVID-19 cases in Italy and bike-share trips in San Francisco. Thanks and looking forward to hearing your ideas, experiences and feedback!

10 条评论

jeremynevans将近 5 年前

Customer here (comment not solicited!). We've been trying out Hubble for a month or so and it's looking really promising.I love the idea of being able to outsource the creativity/problem solving of predicting things that could go wrong with our data to a service that specialises in just that, and I can totally see how they can automate this in a big way as they grow.

verhey将近 5 年前

How does hubble compare to Great Expectations or DBT for pipeline testing? It looks like more emphasis on automated profiling than "having to write and maintain lots of individual tests" and obviously hubble being a saas offering is the big difference?Also any plans to profile and test file-based stores as well? There's a lot that can go wrong in a pipeline before data even reaches BigQuery or Snowflake, and you may help your customers save money if you could profile data in S3 before it goes through a potentially expensive transform process.Best of luck, though! Data testing is a very real need in most data organizations I've been in, and I'm glad more and more tools seem to be popping up recently to help with it.

评论 #24225972 未加载

mushufasa将近 5 年前

this is interesting! running tests on data is certainly a pain point for me, and there doesn't seem to be nearly the kind of infrastructure available as for, say, tests for code functionality.Is this open source? Sending my data to a third party is a no-go, as is having a third-party connect to the database. Something part of a managed hosting service, though, or an add-on to an existing trusted hosted service that has gone through compliance (e.g. Heroku, AWS), would be more palatable.

评论 #24225609 未加载

LittlePeter将近 5 年前

Running a full table scan on BigQuery every hour can get quite expensive. Do you support some sort of deltas?I signed up. Unlike the video, I do not see Redshift as an option. Any idea when Redshift will be supported?How does billing per user make sense here? What prevents me monitoring thousands of tables under single user? Your workload costs will be higher than $200 here, no?Do you have a set of fixed IPs you're connecting from to allow me to whitelist you?

评论 #24225481 未加载

scapecast将近 5 年前

Co-founder of intermix.io here (which we sold in March). We came more from the performance monitoring angle (specifically for Redshift), but then shifted to a product that works horizontally across all warehouses, to track usage, workflows and user engagement. "Shift to Data Products" was the narrative we started using in Q4 2019. If you read the copy on the current intermix.io website, I think you'll find yourself nodding. (FYI - we got bought by a small PE Fund that is rolling the product into Xplenty, an ETL product).My experience is that monitoring data quality is a still an under-appreciated discipline. I've found that most teams still have an "not invented here" mentality, or don't even know they have the problem! That can lead to a "oh, we can just fix it when it happens" type of mentality. But your timing may be better than ours - we started back in 2016.I haven't played with your product (yet), only took a look at this thread and your website. Some observations:- SQL Editor - big plus! I think giving your users a space where they can take action is a super value-add, we didn't have that.- nice work running the tests inside the customer's warehouse. That has two benefits for you. 1) you're not incurring the cost to crunch the metadata, it can get quite expensive, depending on the number of tables in the warehouse. 2) you're avoiding data access issues, getting access to the warehouse was always a hurdle, even though we only needed access to the system tables.- pricing model. I think the per-seat model is the way to go. We tried charging by number of rows, and size of the warehouse (number of nodes), but then you run into weird situations with customers who are dealing with huge historic datasets, but really only look at the last 30 of data.My unsolicited $0.02 is that you think hard about distribution. I think you want to think about hitching your wagon to the cloud marketplaces, and Snowflake's marketplace. For example, attaching themselves to Snowflake is what made all the difference for Fivetran.I have a bunch of more scars that I can share if you care to know them :-)

评论 #24227095 未加载

评论 #24229640 未加载

评论 #24232677 未加载

评论 #24227882 未加载

_Microft将近 5 年前

Have you considered picking a different name? Searching for "Hubble" for whatever reason is going to return millions of irrelevant results for your customers.

评论 #24225513 未加载

评论 #24229781 未加载

评论 #24229729 未加载

评论 #24224830 未加载

评论 #24225298 未加载

hribo将近 5 年前

I signed up and I think the concept is promising. It was very easy to add a couple of tests. SQL interface is handy and convenient, but sometimes still limited. It would be good to add a support for some custom scripts (i.e. Python, R). Another important thing for my team would also be seamless integration with other tools (i.e. email, SMS, Slack) to notify the team about the failed test(s).

12ian34将近 5 年前

+1 for alleviating data scientists/engineers of boring, repetitive manual tasks and empowering them to focus on the more challenging stuff

iblaine将近 5 年前

What does the tech stack look like?Is there any caching for those situations where you may read the same historical data over & over?

评论 #24230466 未加载

hg_将近 5 年前

Do you have/think you need an on-prem version?

评论 #24239149 未加载