Launch HN: Matano (YC W23) – Open-Source Security Lake Platform (SIEM) for AWS

140 pointsby wizwit999over 2 years ago

Hi HN! We’re Shaeq and Samrose, co-founders of Matano (<a href="https://matano.dev">https://matano.dev</a>). Matano is a high-scale, low-cost alternative to traditional SIEM (e.g. Splunk, Elastic) built around a vendor-agnostic security data lake that deploys to your AWS account.Don’t worry — we’ll explain all this jargon in a second.SIEM stands for “Security Information and Event Management” and refers to log management tools used by security teams to detect threats from an organization's security logs (network, host, cloud, SaaS audit logs, etc.) and send alerts about them. Security engineers write detection rules inside the SIEM as queries to detect suspicious activity and create alerts. For example, a security engineer could write a detection rule that checks the fields in each CloudTrail log and creates an alert whenever an S3 bucket is modified with public access, to prevent data exfiltration.Traditional SIEM tools (e.g. Splunk, Elastic) used to analyze security data are difficult to manage for security teams on the cloud. Most don’t scale because they are built on top of a NoSQL database or search engine like Elasticsearch. And they are expensive — the enterprise SIEM vendors have costly ingest-based licenses. Since security data from SaaS and cloud environments can exceed hundreds of terabytes, teams are left with unsatisfactory options: either not collect some data, leave some data unprocessed, pay exorbitant fees to an enterprise vendor, or build their own large-scale solution for data storage (aka “data lake”).Companies like Apple, HSBC, and Brex do the latter: they build their own security data lakes to analyze their security data without breaking the bank. “Data lake” is jargon for heterogeneous data that is too large to be kept in a standard database and is analyzed directly from object storage like S3. A “security data lake” is a repository of security logs parsed and normalized into a common structure and stored in object storage for cost-effective analysis. Building your own data lake is a fine option if you’re big enough to justify the cost — but most companies can’t afford it.Then there’s the vendor lock-in issue. SIEM vendors store data in proprietary formats that make it difficult to use outside of their ecosystem. Even with "next-gen" products that leverage data lake technology, it's nearly impossible to swap out your data analytics stack or migrate your security data to another tool because of a tight coupling of systems designed to keep you locked in.Security programs also suffer because of poor data quality. Most SIEMs today are built as search engines or databases that query unstructured/semi-structured logs. This requires you to heavily index data upfront which is inefficient, expensive and makes it hard to analyze months of data. Writing detection rules requires analysts to use vendor-specific DSLs that lack the flexibility to model complex attacker behaviors. Without structured and normalized data, it is difficult to correlate across data sources and build effective rules that don’t create many false positive alerts.While the cybersecurity industry has been stuck dealing with these legacy architectures, the data analytics industry has seen a ton of innovation through open-source initiatives such as Apache Iceberg, Parquet, and Arrow, delivering massive cost savings and performance breakthroughs.We encountered this problem when building out petabyte-scale data platforms at Amazon and Duo Security. We realized that most security teams don't have the resources to build a security data lake in-house or take advantage of modern analytics tools, so they’re stuck with legacy SIEM tools that predate the cloud.We quit our jobs at AWS and started Matano to close the gap between these two worlds by building an OSS platform that helps security teams leverage the modern data stack (e.g. Spark, Athena, Snowflake) and efficiently analyze security data from all the disparate sources across an organization.Matano lets you ingest petabytes of security and log data from various sources, store and query them in an open data lake, and create Python detections as code for realtime alerting.Matano works by normalizing unstructured security logs into a structured realtime data lake in your AWS account. All data is stored in optimized Parquet files in S3 object storage for cost-effective retention and analysis at petabyte scale. To prevent vendor lock-in, Matano uses Apache Iceberg, a new open table format that lets you bring your own analytics stack (Athena, Snowflake, Spark, etc.) and query your data from different tools without having to copy any data. By normalizing fields according to the Elastic Common Schema (ECS), we help you easily search for indicators across your data lake, pivot on common fields, and write detection rules that are agnostic to vendor formats.We support native integrations to pull security logs from popular SaaS, Cloud, Host, and Network sources and custom JSON/CSV/Text log sources. Matano includes a built-in log transformation pipeline that lets you easily parse and transform logs at ingest time using Vector Remap Language (VRL) without needing additional tools (e.g. Logstash, Cribl).Matano uses a detection-as-code approach which lets you use Python to implement realtime alerting on your log data, and lets you use standard dev practices by managing rules in Git (test, code review, audit). Advanced detections that correlate across events and alerts can be written using SQL and executed on a scheduled basis.We built Matano to be fully serverless using technologies like Lambda, S3, and SQS for elastic horizontal scaling. We use Rust and Apache Arrow for high performance. Matano works well with your existing data stack, allowing you to plug in tools like Tableau, Grafana, Metabase, or Quicksight for visualization and use query engines like Snowflake, Athena, or Trino for analysis.Matano is free and open source software licensed under the Apache-2.0 license. Our use of open table and common schema standards gives you full ownership of your security data in a vendor neutral format. We plan on monetizing by offering a cloud product that includes enterprise and collaborative features to be able to use Matano as a complete replacement to SIEM.If you're interested to learn more, check out our docs (<a href="https://matano.dev/docs">https://matano.dev/docs</a>), GitHub repository (<a href="https://github.com/matanolabs/matano">https://github.com/matanolabs/matano</a>), or visit our website (<a href="https://matano.dev">https://matano.dev</a>).We’d love to hear about your experiences with SIEM, security data tooling, and anything you’d like to share!

16 comments

sullivanmattover 2 years ago

This issue exists to the right of your solution and is (for now) out of scope, but the biggest issue I have with security data lakes is the need to (easily) get both row-based data and visualizations. Back when I had access to a well-built and cared for Splunk environment, I would constantly run queries, build visualizations, go back to the results index, tweak the query, go back to viz, etc. This feedback loop is important and allows for fast iteration, especially if you are conducting a high-stakes investigation and need answers rapidly. I should be able to look at my available fields and tweak the viz accordingly in under a few seconds; preferably in one mouse click.Now I live on an ELK stack and I experience nothing but full-time agony as I switch between Kibana and Kibana Lens constantly. It's clear they are two completely separate "products" built for different use-cases. The experience reminds you constantly that they were not purpose-built for how I use them, unlike Splunk.Increasingly we are moving towards the reality of a security data lake, and all I can think is that I'm about to lose what little power I had left as I have to move to something like Mode, Sisense, or Tableau which again, were not purpose-built for these use-cases and even further separate the query/data discovery and visualization layers.I hate how crufty and slow Splunk has gotten as an organization, and they use their accomplishments from 15 years ago to justify the exorbitant price they charge. I really hope the OSS/next-gen SaaS options can fill this need and security data lake becomes a reality. But for that to happen, more focus is needed on the user experience as well.Regardless, very cool stuff and could definitely fill a need for organizations that are just starting to dip toes into security data lakes. I wish you success!

评论 #34507787 未加载

评论 #34508532 未加载

mdanielover 2 years ago

I loaded the GitHub link, bracing myself for yet another AGPL license, but no, it's Apache 2! So I wanted to say thank you for that and I hope to take a deeper look when I'm back at my desk because trying to keep Splunk alive and happy is a monster pain point. There are so many data sources we'd love to throw at it but we don't have the emotional energy to put up with Splunk crying about it

评论 #34509758 未加载

jillesvangurpover 2 years ago

Interesting project.A few remarks though.- Doing real time data processing on tera/peta bytes involves a lot of IO, which is a significant part of of the cost in AWS. Things like Athena are simply not cheap to run at that scale.- With time series data, the emphasis is usually on querying recent data, not all of the data. You retain older data for auditing for some time. But this can essentially be cold storage.- Especially alerting related querying is effectively against recent data only. There's no good reason for this to be slow.- People tend to scale Elasticsearch for the whole data set instead of just recent data. However, with suitable data stream and index life cycle management policies, you can contain the cost quite effectively.- Elastic Common Schema is nice but also adds a lot of verbosity to your data, and queries. Bloating individual log entries to a KB or more. Parquet is a nice option for sparsely populated column oriented data of course. Probably the online disk storage is not massively different from a well tuned elastic index.- Elastic and Opensearch have both announced stateless as a their next goal. So, architecturally similar to this and easier to scale horizontally.- SIEM is just one use case. What about APM, log analytics, and other time series data? Security events usually involve looking at all of that.

评论 #34515727 未加载

molsongoldenover 2 years ago

Excited to give this a try and follow your progress!In case anybody else is wondering how Matano compares to Panther (my first thought reading this launch post) there's a comparison on the Matano website[0].Quick note to the Matano team, the "Elastic Common Schema (ECS)" link in the readme[1] seems to be broken.[0] <a href="https://www.matano.dev/alternative-to/panther">https://www.matano.dev/alternative-to/panther</a>[1] <a href="https://github.com/matanolabs/matano#-log-transformation--data-normalization">https://github.com/matanolabs/matano#-log-transformation--da...</a>

评论 #34509492 未加载

protoductionover 2 years ago

Hi Shaeq and Samrose - congrats on the launch! Matano looks great.Out of curiosity, at some point I believe you were working on a predecessor called AppTrail whic tackled (customer-facing) audit logs, it was something I was interested in at the time (and still am! I would've loved to use that).Would you perhaps be willing to share your learnings from that product, and (I assume) why it evolved into Matano?

评论 #34509447 未加载

bovermyerover 2 years ago

Oh, this very much has my attention. I'll be checking this out in depth.

brunesover 2 years ago

How do you position this against AWS's own Security Lake announced at re:Invent in November (<a href="https://aws.amazon.com/security-lake/" rel="nofollow">https://aws.amazon.com/security-lake/</a>) ?Your architecture diagram looks like a carbon copy of theirs.

评论 #34510637 未加载

boundlessdreamzover 2 years ago

What distinguishes a SIEM from traditional log analysis? I know the feature set is oriented towards SIEM but it seems like a super set of regular log analysis. I don't have a need for a SIEM now but this looks good even for non security logs.

评论 #34514678 未加载

waihtisover 2 years ago

I'm a vendor in the cyberspace so not a potential customer (feel free not to waste time answering) but am just intellectually curious who you're targeting this at. High-skill tech companies who are just building up a security program? I don't see most security teams building their own SIEM'ish solution just because they really don't have the chops or resource to do it. OTOH, it would be a big rip-out operation for F100 companies to change to this from Splunk et al.

评论 #34507571 未加载

评论 #34507415 未加载

debarshriover 2 years ago

I have been exploring this realm of SIEM, XDR, NDR etc. Sure, all proprietary SIEMs are expensive. But what is not clear is how you are going to price it. Security teams have dedicated budget. If you are coming cheaper than them, they you are destroying your TAM because I know customer would not mind paying those license fees. OSS GTM might work but might against your TAM.

评论 #34506692 未加载

评论 #34506802 未加载

badrabbitover 2 years ago

What distinguishes Matano'd existing or planned products from Google Chronicle? Would you have any limits on data ingestion or retention?Also, python detections sounds horrible! I love python but it sounds like you haven't considered the challenges of detection engineering. This one of my main "expertise" if you will. You should think more in the lines of flexible sql than python. People who write detection rules to the most part don't know python and even if they do it would be a nightmare to use for many reasons.I hope someone from your team reads this comment: DO NOT try to invent your own query language but if you do, DON'T start from scratch. Your product could be the best people who like the fabulous splunk need to also like it. And for a security data lake, you must support Sigma rule conversion into your query/rule format. Python is a general purpose language, there are very good reasons why no one else from Splunk,elastic, graylog, Google,Microsoft use Python. Don't learn this hard lesson with your own money. Querying it needs to be very simple and most importantly you need to support regex with capture groups and the equivalent of "|stats" command from splunk if you want to quickly capture market share. I have used and evaluated many of these tools and have written a lot of detection content.Your users are not coders, DB admins or exploit developers. They are really smart people whose focus is understand threat actors and responding to incidents -- not coding or anything sophisticated. FAANG background founders/devs have a hard time grasping this reality.

评论 #34510215 未加载

alecbellover 2 years ago

This is awesome. Nice work open-sourcing it! I used Splunk at Expedia and it was super expensive and slow. While I wasn't using it for security purposes, it could take 15-30 min for us to detect error logs, and I can imagine that's not okay for security purposes. Good luck guys!

评论 #34508902 未加载

slt2021over 2 years ago

Question to Matano authors - won't your solution simply enrich AWS by blowing up my cloud bill ?Did you estimate how many times lambda will get invoked and what will be AWS bill for 1 million events ingested? I am curious to learn the price to pay for serverless SIEM

评论 #34511501 未加载

wdbover 2 years ago

Anyone aware of a similar solution for Google Cloud / GCP?

评论 #34508147 未加载

simonebrunozziover 2 years ago

Shaeq and Samrose: for us investors here, where are you in terms of fundraising? I'm an ex AWS (google me, you'll have a few laughs!), turned VC in the past few years. $HN_username at gmail if you want to reach out and chat!Edit: here's me with Andy, from a millenium ago [0].[0]: <a href="https://www.youtube.com/watch?v=bWL0_Xdntzo&t=2907s">https://www.youtube.com/watch?v=bWL0_Xdntzo&t=2907s</a>

napoluxover 2 years ago

Super random question... I wonder if the name is related to Frank Matano, the italian youtuber/comedian.

评论 #34507099 未加载

评论 #34506198 未加载