It's interesting work, but it's not really 'big data'.<p>"Every day, we collect around 8 million data points on exercise and video interactions, and a few million more around community discussion, computer science programs, and registrations. Not to mention the raw web request logs, and some client-side events we send to MixPanel."<p>OK - 8 million records per day. Let's double that for the argument's sake.<p>Even if they were fairly fat records (1Kb), that's only 16Gb / day. That makes it around 2 months / TB.<p>I can easily put together a machine with 20TB of storage and run a traditional free relational DB (or even a single free node of Greenplum) and store more than 3 years of this data.<p>Then bang against it with SQL. Transactions are free.
Interesting.<p>There's a whole emerging field called "learning analytics", which at the moment appears to be more a theoretically good idea than anything with practical outcomes (Sadly, much in education is like this - something will emerge in the technology field, and then 6 months later there will be a XXX-in-education movement) - although Khan Academy is in a good position to get that data and use it.<p>But for those of you who have kids who do Kumon Math (or similar) it's pretty easy to see how analytics could speed up the Kumon process (of selecting questions that exercise very specific skills).<p>For those interested there is an upcoming "Big Data in Education" Coursera course[1] that I'm planning on doing. It will be my first coursea experience, so I'm not quite sure what to expect. I'm in the fortunate position of having access to a fairly significant amount of educational usage data, so I'm hoping it will be useful.<p>[1] <a href="https://www.coursera.org/course/bigdata-edu" rel="nofollow">https://www.coursera.org/course/bigdata-edu</a>
Isn't this a flawed approach? It seems like Khan Academy is trying to re-construct a record of behaviours across their business by stitching together:<p>1. Parsing web logs for web page views and API accesses<p>2. Exporting "some client-side events" from MixPanel<p>3. Mining their transactional databases for state changes<p>On #1 - web caching and client-side events have long invalidated web log based analytics approaches. How is Khan different?<p>On #3 - this is reverse engineering your user behaviours by mining state changes in your transactional systems. This is typically a ton of work, it breaks when you change your data models, and your operational systems aren't designed to reveal user behaviours anyway.<p>Have Khan explored alternative approaches? Typically: defining with the analyst team a set of events you want to monitor, making sure all of your systems (client-side, mobile, server-side, whatever) emit immutable streams of these events, and then collecting, storing, enriching, analyzing at your leisure.
This was a nice read but I'm much more interested to know what they do with the data. From hanging around "big data" people the emphasis still seems to be on storage and simple SQL-esque querying. For most people this is a solved problem, and it's time to go beyond storage and see what value we can get from data. I believe in most cases this requires a different skill set <i>and</i> different mindset. Most people think in binary terms, but statistical models deal with shades of grey -- nothing is ever certain -- and even simple models like linear regression are difficult for the untrained to understand.