Ask HN: Good tech talks on how analytics systems are implemented?

193 pointsby psankaralmost 6 years ago

I am doing a new sub-system, for analytics, which I can design / implement from scratch. I get a bunch of unique users (say a few thousands). Now I need to track each of these users and do some analytics (Which places are they logging in from ? How long do their sessions last on average ? What are the usual links/endpoints that they visit ? etc.) in my API server. I have a few thousand active users and about two dozen parameters on which I want to track and analyze these parameters.I have never implemented such an analytics type system. I want to learn about people who have already implemented similar systems. Are there any good tech talks, engineering blog posts, video courses, etc. that highlight the design/technology/architecture choices and their benefits.

31 comments

kthejoker2almost 6 years ago

Speaking as an analytics architect ...You'll be a lot better off spending your mental energy thinking about the outcomes you want to achieve (user engagement, upselling, growth, etc) and the types of analysis you'll need to understand what changes you need to make to produce those outcomes. Protip: this is actually really hard, and people underestimate it by orders of magnitude. A blog post by Roger Peng (with indirect commentary from John Tukey) ... <a href="https://simplystatistics.org/2019/04/17/tukey-design-thinking-and-better-questions/" rel="nofollow">https://simplystatistics.org/2019/04/17/tukey-design-thinkin...</a>One other immediate tip is to start thinking about correlating your telemetry with user surveys - again, strongly focusing on outcomes and the controllable aspects of those outcomes.Don't let the data lead the discuisson; decide on the question you're asking, and the implications of all of the possible answers to that question (clearly yes, clearly no, mixed, etc) before you ask it.Then engineer the lightest weight system possible to ingest, process, store, analyze, and visualize that data.For me, that would just be:1. Log data in whatever logging tool you like. Persist the raw stuff forever in a cheap data lake. 2. Batch at some fixed interval into a staging area of a relational DB. 3. Transform it with stored procedures for now (while you figure out what the right transforms are) into a flat fact table. 4. Visualize in Superset or PowerBI or even plain old Excel.Once you've got the patterns of analysis at least fundamentally right you can consider stream processing (Flink or Kafka Streams are fine) to replace 2 and 3.

评论 #20168177 未加载

评论 #20167402 未加载

sfkdjf9j3jalmost 6 years ago

Seriously, just put everything in Postgres. You have so little data, you shouldn't even be thinking about an "analytics system".I have seen so many developers over-engineer this exact problem. Forget about Kafka, Kinesis, Redshift, Airflow, Storm, Spark, Cassandra etc. You don't need them, not even close. Unless you want to add a bunch of expensive distributed systems and operational overhead for fun/resume building, they're going to waste your time and hurt your stability.

评论 #20164526 未加载

评论 #20164202 未加载

madhadronalmost 6 years ago

I would be suspicious of most tech talks on this. If someone is giving a tech talk on their analytics systems, they are either working at enormous scale (Facebook, Google), selling something (Splunk), or over engineering their system (many startups).I second advice elsewhere in this thread. Log it into PostgreSQL. If you start overloading that, look into sampling your data before you look into a fancier system. Make sure you have identifiers in each row for each entity a row is part of: user, session, web request. If you're not building replicated PostgreSQL (which you probably won't need for this), log to files first, and build another little process that tails the files and loads the rows into PostgreSQL. That log then load advice is hard learned experience from working at Splunk.

评论 #20166218 未加载

评论 #20165371 未加载

评论 #20165726 未加载

评论 #20165426 未加载

serial_devalmost 6 years ago

I was working for a startup implementing analytics tools. In my opinion, our setup was over-engineered, but I wasn't there at the beginning, so I might be wrong. Also, requirements changed a couple of times, so this could also explain why something that looked necessary for scaling and speed, ended up being this over-engineered mess. This is how it worked: After javascript tracker fired, we got log files, passed them through Kafka, then parsed the log files and performed calculations through Storm (Java). For storage, we used Cassandra. The system also had other parts, but I don't remember why they were there, tbh.My thought process for solving your problem would be the following. First, you need to understand what's good for you and for your company might not be the same. You want the challenge, you want to implement something that could scale and you want to use exotic tools for achieving this. It's interesting and looks good in your CV. Your company might just want the results. You need to decide which is more important.If we prioritize your companies needs over keeping you entertained, I'd follow this thought process:Can't you just use Google Analytics? You can also connect it to BigQuery and do lots of customizations. Maybe time would be better spent learning GA. It's powerful, but most of us cannot use it well.Second question: if for some reason, you don't want to use Google Analytics, can you use another, possibly open-source and/or self-hosted analytics solution? Only because you can design it from scratch, it doesn't mean you should.Third: Alright, you want to implement something from scratch. For this scale, you can probably just log and store events in an SQL database, write the queries, and display it in a dashboard.Then, if you really want to go further, there are many tools that are designed to scale well and perform analytics, "big data". By looking for talks about these tools, you will get a better understanding of how things work. There are various open-source projects you should read more about: Cassandra, Scylla, Spark, Storm, Flink, Hadoop, Kafka, Hadoop, Parquet, just to name a few.

评论 #20163120 未加载

评论 #20163078 未加载

solidasparagusalmost 6 years ago

My go-to v0 solution is JSON (simple, with no nested objects or lists) written to s3 (partitioned by date, see Hive date partitioning) and AWS Athena (serverless Presto) to do SQL queries on those JSONs. You can build the system in less than an hour, you don't have to manage any VMs and it's relatively easy to extend to a more serious solution (e.g. if you need major scale or Spark-like analytic jobs).Relational databases like some are suggesting are fine, but you have to manage them unlike s3 + Athena and it tends to make you design around relational database concepts, which can make it difficult to migrate to a full-blown analytics solution that often abandons relational guarantees.This solution also lets you be flexible in your raw data schema unlike relational databases where you have to have well defined schema or hacks like saving the raw JSON as a string.When you need to evolve your data schema (you will as you learn what things you want to measure), a relational database requires you to be thoughtful about how to do this (e.g. you can't have your data producer writing a new schema before the table has been changed). Often this requires you to add some sort of queue between data producers and database so that you can make changes to the table without stopping the data producers. With s3 + Athena, you can just upgrade your data producer, it will start saving the new format to s3 and then you upgrade your Athena table definition whenever you want to start querying the new data (because in relational databases, the schema defines how data is stored, but in s3+Athena world, the schema just tells the SQL engine how to read whatever data exists on s3).

评论 #20167002 未加载

评论 #20167592 未加载

评论 #20178151 未加载

srrralmost 6 years ago

1) Design the reports you want. Pay special attention to interactive elements like filters and drilldowns. List all dimensions and metrics you need. Think about privacy.2) Find your visualisation tool of choice. This is more important than any architecture choice for the tracking because this makes your data useable. [1]3) Select your main data storage that is compatible with your visualisation tool, data size, budget, servers, security, ... SQL is always better because it has a schema the vis tools can work with. For a low amount of data you might just want to use your existing database (if you have one) and not build up new infrastructure that has to be maintained.4) If you need higher availability on the data ingress than your db can provide use a high availability streaming ingress [2] to buffer the data.5) Design a schema to connect your db to the visualisation tool. Also think about how you will evolve this schema in the future. (Simplest thing in sql is: Add colunms.)I hope this helps. If you have selected some tools it is fairly easy to search for blog posts and tech talks. But don't think to big (data). "A few thousands users" and "two dozen parameters" may be handled with postgres and metabase. Also in most enterprise enviroments there already exists a data analytics / data science stack that is covered by SLAs and accepted by privacy officers. Ask around.[1] <a href="https://github.com/onurakpolat/awesome-bigdata#business-intelligence" rel="nofollow">https://github.com/onurakpolat/awesome-bigdata#business-inte...</a> [2] <a href="https://github.com/onurakpolat/awesome-bigdata#data-ingestion" rel="nofollow">https://github.com/onurakpolat/awesome-bigdata#data-ingestio...</a>

评论 #20163966 未加载

kfkalmost 6 years ago

I have designed one for 500 dashboard users and various other requirements. My advise would be to get a cheap SQL compliant database that does not require a lot of maintenance (if you can afford buy a cloud one). Then for the analysis part the quickest thing to do is use Jupyter + SQLAlchemy. You can also use a dashboarding tool, there are many, to connect to the database, but I think with Jupyter you can ask more interesting questions that require more blending or transformations. That's it, you'll grow from here in the coming months and years, but if you over engineer analytics at the beginning you'll most likely get tired of it and stop doing it at some point.

usgroupalmost 6 years ago

IMO, your requirements are too basic to need a serious system. Either log interaction to a file or a database, parse the output and query it with SQL to produce your basic metrics, or just write to Google Analytics.When this starts creaking at the seams it'll mean that you either have bigger analysis and/or scalability requirements and it'll much clearer what you need to look for.

评论 #20163131 未加载

评论 #20163132 未加载

tekkkalmost 6 years ago

I made one for our company using AWS Kinesis Firehose which I thought was really good having used GA, Mixpanel and Segment before. Shame we haven't been able to put it into more wider use. Extremely simple and very robust, to deploy it you just have to run the CloudFormation stacks with Sceptre in a single command and then add the client library with some event listeners for clicks, pageviews et cetera. I'd love to be able to open-source it but I don't know, should think through the benefits and disadvantages of both with my CEO. Probably couldn't get customers to pay for an expensive custom analytics platform if it was open-source.Having spent some time on this I'll just say that don't overthink it. Over-engineering such system is way too easy while the actual benefits might not be that great. Sure if you're receiving a lot of data there might be some pitfalls to be aware of eg using proper bucket partitioning with Athena for queries.

acidbaseextractalmost 6 years ago

Go with off the shelf. You'll get something far better that you can build yourself, and if you need something custom, you'll have a much clearer idea what your analysis is missing.Writing to Google Analytics, Amplitude, Mixpanel (all of which have free tiers) or equivalent all should handle your case well.

评论 #20163133 未加载

karankealmost 6 years ago

Note: I build and maintain such systems for a living.There's a lot of context that's missing from your post, some questions that can help us guide you in the right direction:1) Can your website call out to external services, or are you limited to operating behind a company network?2) Is this more of an ad-hoc analysis or do you want to invest in a framework to be able to track such metrics systematically over time?3) How important is data accuracy? Adblock can easily mess with client-side metrics.4) How real-time do metrics need to be? The big trade-off here is speed vs accuracy.5) How long do you intend to keep this data? This is a pretty big concern with regards to privacy and storage costs.If you'd rather not share some of these answers on a public forum, feel free to shoot me an email.

评论 #20163109 未加载

princevermaalmost 6 years ago

Start by adopting <a href="https://github.com/snowplow/snowplow" rel="nofollow">https://github.com/snowplow/snowplow</a> then grow as and where you feel restricted.

评论 #20191585 未加载

评论 #20164178 未加载

评论 #20164365 未加载

gt565kalmost 6 years ago

You can just crunch your data with SQL/service layer code in a background worker and store it in redis. Then you can use the objects from redis to render charts, build dashboards, etc...Structure your code so you crunch your historical data once, store in redis, and then new data gets shoved in the redis cache as your time dimensions on your metrics progress based on business logic.Until your data is at enterprise volume, you really don't need an OLAP system.

burembaalmost 6 years ago

I have an open-source project that collects the customer events via SDKs and stores it in a data warehouse.It's a distributed system, the mobile and web SDKs batch the user events on their devices and push it to our API in JSON format. The API servers enrich & sanitize the data, validate the schema, convert it to a serialized AVRO binary, and push it to a commit-log system such as Kinesis or Kafka (It's pluggable).We have another project that fetches data from Kafka & Kinesis in small batches, converts the data into columnar format and stores it in an S3 bucket / Google Cloud Storage. Then, we integrate their preferred data-warehouse into their distributed filesystem. That way they have all their raw data in their infrastructure for other systems such as fraud detection, recommendation, etc. but they have SQL access to their data as well.That being said, this architecture is for >100M events per month. If your data is not that much, you can actually ingest your data into an RDBMS and it just works fine. We support Postgresql at Rakam and you need is the API server and a Postgresql instance in that case. Our open-source version supports Postgresql so you can look into the source code from here: <a href="https://github.com/rakam-io/rakam" rel="nofollow">https://github.com/rakam-io/rakam</a> Would love to get some contribution. :)For the analysis part, all these metrics can be created using just SQL, the modern data-warehouse solutions (BigQuery and Snowflake) also support javascript and it's relatively easy to build funnel & retention queries that way. It requires more work but now you have more control & flexibility over your data.

codingdavealmost 6 years ago

This sounds like a classic case of build vs. buy. If analytics are not your core product, inventing a new solution is going to cost you more than buying an existing analytics solution. There are dozens, a few of which have even been in the news the last few days due to acquisitions.I'm not going to endorse any of them over the others, but I will say you'll be better off using a 3rd party than coding this yourself.

drunkpotatoalmost 6 years ago

The quickest and easiest thing to do would be to hook up Segment or a similar system (heap analytics, google analytics, etc). I would stay away from GA given my own choice though. It’s free but google won’t give your own data back to you without an enterprise agreement which runs 6 figures minimum. For open source there’s snowplow, which I haven’t used but many in the data community do.

petercooperalmost 6 years ago

If your analytics are merely a 'nice to have' but losing a day or two of results would be acceptable in a crisis, I'd log everything to Redis and then run a daily report to drag aggregated values into another database system. I would clogging your main database system up with analytics related queries on a day to day basis, for sure.

gorkemcetinalmost 6 years ago

Definitely not a "good talk/blog post/video course" type of thing, however if you are interested on how we built Countly from ground up, together with the technology stack behind, you can check our source code here:<a href="https://github.com/countly/countly-server" rel="nofollow">https://github.com/countly/countly-server</a>While we have used MongoDB, Nodejs, Linux as underlying platforms, there are several options out there you may check.Note that some (if not most) of the effort would go into SDK development, and to tell you the truth, SDK development is no easy task as it requires knowledge of how different platforms work.The point is (and take it as a warning): you will never be satisfied with what you have - after you are done with vital data, then there is custom events, raw data, user profiles, online users, and then you will start eating your own dog food as it becomes your part-time job.

erlanganalyticsalmost 6 years ago

Erlang Factory 2014 -- Step By Step Guide to Building an Application Analytics System<pre><code> https://www.youtube.com/watch?v=XBuQg1ZElao http://www.erlang-factory.com/static/upload/media/1395920416434704ef2014_anton_lavrik.pdf</code></pre>

vinay_ysalmost 6 years ago

For such a small scale, you can use a simple event tracking schema from your client-side and server-side code and have a simple stream processor to join these events and then save it to a simple event table in a SQL database. The DB tech you choose should be something suitable for OLAP workloads. For your scale, PostgreSQL or MySQL would just work fine. When your data grows you can look at more distributed systems like Vertica or Memsql or Clickhouse etc.In this architecture, most of your brain cycles will go into designing the queries for generating aggregates at regular intervals from the raw events table and storing in various aggregate tables. You must be familiar with facts and dimensions tables as understood in data warehouse context.

ergestalmost 6 years ago

I’ve worked with a GA implementation before which I don’t recommend if you want to own your data or if you want unsampled, detailed logs. I’ve also seen a full end to end implementation that uses server log shipping to s3, log parsing and complicated ETL processes which I also don’t recommend due to the sheer effort it would take to build.I’d say go with something like Matomo (formerly Piwik) <a href="https://matomo.org" rel="nofollow">https://matomo.org</a>. If you wanted to build your own, I’d suggest keeping it simple. Look at Matomo’s architecture and replicate <a href="https://github.com/matomo-org/matomo" rel="nofollow">https://github.com/matomo-org/matomo</a>.

anonymousDanalmost 6 years ago

Designing data intensive systems, M. Kleppmann

评论 #20163122 未加载

sfifsalmost 6 years ago

If it's a only few thousand enterprise users, I'd actually say log to a bog standard relational database from server side like MySQL or Postgres. Think through table schemas for everything you're going to log and make sure primary keys and nomenclature for everything talk to each other. Virtually any analytics platform or software talks to standard databases. Record as much as you can because analytics use cases typically get generated following first data collection and analysis and are iterative - so you also want to build flexibility.

评论 #20163294 未加载

norejisacealmost 6 years ago

I have a slightly adjusted question on any good talks / online training programs that touch on digital measurement across channels (media pixel, web analytics, 3rd party data etc). Any pointers?

reilly3000almost 6 years ago

Some questions for you:- Who will be viewing these reports when they are done? Who do you want to have a view of the data eventually?- How fresh do you need the data to be? Is 24 hours, 4 hours, or 4 seconds okay to wait?- Do you need to be alerted of anomalies in the data?- How long do you intend to store the raw data? Aggregated data?- Does your data need to contain anything that could personally identify a user in order to make a useful analysis? Do you serve customers in the EU?I'll check back later today and see if I can provide any insights based on your response.

Sponealmost 6 years ago

Ahoy <a href="https://github.com/ankane/ahoy" rel="nofollow">https://github.com/ankane/ahoy</a> is an interesting tool that we use to replace Google Analytics in most projects now.It covers all the basic needs, and even if you're not using Rails, I think you can draw inspiration from it!

nwsmalmost 6 years ago

Since no one is answering the question, this talk by Sonos engineers at AWS re:Invent 2015 is really good:<a href="https://www.youtube.com/watch?v=-70wNNrxf6Q" rel="nofollow">https://www.youtube.com/watch?v=-70wNNrxf6Q</a>

unixheroalmost 6 years ago

Here you go:AWS re:Invent 2017: Building Serverless ETL Pipelines with AWS Glue (ABD315)<a href="https://www.youtube.com/watch?v=eQBHIINW8VY&t=2692s" rel="nofollow">https://www.youtube.com/watch?v=eQBHIINW8VY&t=2692s</a>

ssvssalmost 6 years ago

<a href="https://druid.apache.org/" rel="nofollow">https://druid.apache.org/</a>

unixheroalmost 6 years ago

AWS ...SFTP-> S3 -> GLUE -> REDSHIFT ... PowerBI, Tableau

tedmistonalmost 6 years ago

I helped build an analytics platform that served more than millions of events. Nothing about it is really difficult until you scale heavily.The tl;dr - It's not worth your time / energy to build from scratch at this scale. Leveraging an existing standard like the analytics.js spec [1] makes web analytics very easy and quick to get started. With this approach you just have to add one JS snippet to your website and never need to update it in your code. If you are interested in the internals, you might enjoy digging deeper into the spec to understand why it was designed this way.Two services that implement this spec are Segment [2] and MetaRouter [3] [full disclosure: I helped build the product that became MetaRouter at a previous job]. They have different target audiences and pricing models but both are worth a look.You can think of these types of services as a meta analytics service that routes your events to destination analytics services and data stores of your choice. The great thing about using the standard is you can benefit from all of the many integrations that have already been created with various analytics services, databases, data warehouses, etc [4]. These destination catalogs can also help you decide what services to explore and try next as you need more advanced features.To get started with a meta analytics service, in the management dashboard, just add your API keys and config values for each service. For a simple service like Google Analytics this is literally just one simple key to copy and paste.As far as adding custom even monitoring to your site, within the analytics.js spec, first, you mainly want to be concerned with the Track call [5] which is a way to say for an arbitrary event e.g., ProductAddedToCart, I would like to attach this JSON object of properties e.g., a product name and price.And finally, user info like name, email, IP, etc are handled by Identify [6]. You can add custom fields too (traits on an identify are ~= properties to a track event but less transient).Going with an existing standard and SaaS-based approach will save a ton of time and engineering effort.[1]: <a href="https://segment.com/docs/spec/" rel="nofollow">https://segment.com/docs/spec/</a>[2]: <a href="https://segment.com/" rel="nofollow">https://segment.com/</a>[3]: <a href="https://www.metarouter.io/" rel="nofollow">https://www.metarouter.io/</a>[4]: <a href="https://segment.com/docs/destinations/" rel="nofollow">https://segment.com/docs/destinations/</a>[5]: <a href="https://segment.com/docs/spec/track/#example" rel="nofollow">https://segment.com/docs/spec/track/#example</a>[6]: <a href="https://segment.com/docs/spec/identify/#example" rel="nofollow">https://segment.com/docs/spec/identify/#example</a>

31 comments

kthejoker2almost 6 years ago

评论 #20168177 未加载

评论 #20167402 未加载

sfkdjf9j3jalmost 6 years ago

评论 #20164526 未加载

评论 #20164202 未加载

madhadronalmost 6 years ago

评论 #20166218 未加载

评论 #20165371 未加载

评论 #20165726 未加载

评论 #20165426 未加载

serial_devalmost 6 years ago

评论 #20163120 未加载

评论 #20163078 未加载

solidasparagusalmost 6 years ago

评论 #20167002 未加载

评论 #20167592 未加载

评论 #20178151 未加载

srrralmost 6 years ago

评论 #20163966 未加载

kfkalmost 6 years ago

usgroupalmost 6 years ago

评论 #20163131 未加载

评论 #20163132 未加载

tekkkalmost 6 years ago

acidbaseextractalmost 6 years ago

评论 #20163133 未加载

karankealmost 6 years ago

评论 #20163109 未加载

princevermaalmost 6 years ago

Start by adopting <a href="https://github.com/snowplow/snowplow" rel="nofollow">https://github.com/snowplow/snowplow</a> then grow as and where you feel restricted.

评论 #20191585 未加载

评论 #20164178 未加载

评论 #20164365 未加载

gt565kalmost 6 years ago

burembaalmost 6 years ago

codingdavealmost 6 years ago

drunkpotatoalmost 6 years ago

petercooperalmost 6 years ago

gorkemcetinalmost 6 years ago

erlanganalyticsalmost 6 years ago

vinay_ysalmost 6 years ago

ergestalmost 6 years ago

anonymousDanalmost 6 years ago

Designing data intensive systems, M. Kleppmann

评论 #20163122 未加载

sfifsalmost 6 years ago

评论 #20163294 未加载

norejisacealmost 6 years ago

I have a slightly adjusted question on any good talks / online training programs that touch on digital measurement across channels (media pixel, web analytics, 3rd party data etc). Any pointers?

reilly3000almost 6 years ago

Sponealmost 6 years ago

nwsmalmost 6 years ago

unixheroalmost 6 years ago

ssvssalmost 6 years ago

<a href="https://druid.apache.org/" rel="nofollow">https://druid.apache.org/</a>

unixheroalmost 6 years ago

AWS ...SFTP-> S3 -> GLUE -> REDSHIFT ... PowerBI, Tableau

tedmistonalmost 6 years ago