Let's say you have some data that you want to do something with... to be simple let's say you want to compute an average from the data samples that were taken on a weekday. Normally you'd read in the data, and add each weekday sample and divide by the length.... easy, right? Well that is great, but what if the data is too big to fit into memory? Ok so you break it up into chunks ... but how big should those chunks be*? I don't know, but you need to make sure it runs on your manager's laptop at the same time as his 40 IE browser windows. How much of this code is reusable? Maybe if you structure it carefully you can reuse most of it, but if you want to add a different calculation (say variance in the samples) the code will likely get more complex.
Stream processing can solve these problems by viewing your computation as a pipeline of computations and your data as a stream of samples passing through that pipeline. For the example above, you have two pipeline stages -- a filter and an adder. The first stage filters out samples that weren't on a weekday, the second computes the sum. Since individual data samples are processed one at a time, it doesn't matter how big the underlying data set is. And the components are reusable ... if you have another data source you want to remove non-weekday samples from, you just need to drop in the weekday filter. If you have another value you want to sum, you just drop in the adder.
"Isn't this why god invented databases?", you ask... true, a database can do all of these things mentioned above. But maybe you don't want to go through the trouble of setting up a database, writing the code to load it and query it. Or maybe if you want the average to be per-customer but the customer id is stored in a different place. If "a different place" means another table in the same database you can just do a join... no problem. But if the customer id is stored in a different database, or in a file, or on another machine, you're out of luck.
There are more significant drawbacks to using a database, however. Many computations are difficult to express in SQL, and SQL queries are not very flexible -- they don't compose well and can be difficult to update and tune. They also tend to use a lot of intermediate storage ... if you want to do two things the usual way to do it is to do the first thing, save it in a temporary table, then do the second. This isn't reusable, composable, or efficient. Moreover, in a database, your computation is location and format dependent. Treating the data as a stream abstracts the computation from the storage mechanism.
Stream processing is good. If you don't believe me, there are three hours of lecture on it in MIT's classic SICP course that present some extremely cool things you can do with streams. Another good source is this paper by Richard Fateman. If you still don't believe me, maybe the next few posts showing some other cool things you can do with streams will convince you.
* If you're asking "why don't you just use map reduce / hadoop ?" I'll address this in a later post.
No comments:
Post a Comment