Everyone's talking about big data. I thought I'll talk about my experience too. Let me tell you this post is a bit long. I tried to keep it very short but failed. So, I have highlighted two sections which you could read and get the gist! (But I really want you to read all of it:)
Prologue: So, we at Simplilearn decided to capture some very arbitrary data from our applications. We thought, well, we need something that scales, supports very high throughput, low maintenance and of course, is cost effective. DynamoDB was right there. It's NoSQL so we were not worried about structure. It is a managed service so no worries about throughput and scalability either. We could also control the cost a little bit by adjusting throughput, but in spite of that, it costs us good money, so I won't give this one to DynamoDB.
We created a data store and started capturing data. There was some concern regarding queries we could run and the fact that reading data has a cost as well. The cost was something we could not afford to think about! So we focussed on the querying part. We had to move data from DynamoDB which is a datastore to a warehouse. That's how it works right, move data from datastore to data warehouse, but which one. And then there was revelation. Just kidding, we used Redshift this time.
Now, moving data from DynamoDB to Redshift is like just one command. And what that does is reads all data from a DynamoDB table and inserts into a Redshift table. That's the problem right there, reads. Remember I said, reading has a cost as well. Consider this, We are generating roughly 100GB of data every month and let's say we want to analyse our data weekly, we will be copying 100GB data after the first month. 125GB a week after that, 150GB one week after that ... eventually terabytes of data, every week. Now that can't happen!
So we needed a way to copy data incrementally. Enter, DynamoDB streams. How streams work is that you query DynamoDB with a timestamp and it gives you all the modifications since that time. So now we can have a daily cron job, that will get the stream since yesterday and insert/update data in Redshift. That done, we could just use all the column oriented goodness of Redshift to analyse our data. I'll post details about how to query some special cases like unix timestamps and JSON data in another post.
Epilogue: So there you have it. You don't use a datastore like DynamoDB for ad-hoc analytics queries. You use it's high throughput, availability and reliability in capturing mission critical data. You copy this data to a data warehouse like Redshift to run all the queries that the Analytics team keeps bugging you about. That way, you can also archive data month on month in DynamoDB and still have a historical dataset in Redshift.
Also, we are now evaluating Google's BigQuery. It supports a datatype called Record which basically is nested data that you would get from a NoSQL db. Also, its support for functions like date makes it very easy to work with my kind of data. I will discus that in an upcoming post. Till then, hang around!!!