Cassandra Data Modeling

I ended up having to miss the JHipster webinar last week as I was invited by my company to attend the Datastax DS220: Data Modeling with Datastax Enterprise class on Monday and Tuesday. The company came out and taught the class onsite. The instructor was Andrew Lenards and he did a great job.

I have been using Cassandra for a little while, but I hadn’t done anything serious with it. The CQL query language is all at once a great blessing and a curse. On the upside it is immediately familiar so anyone who has done SQL work can get comfortable creating tables and executing queries quickly. On the downside it sort of abstracts a few things about the data store away from you and I think at a certain point for performance you sort of need to understand what is going on under the hood. This class gave us that. It starts out presenting a data model like you might see in relational databases and then you work through the ways you might model that data in Cassandra and the trade offs of different models (which questions you can ask, which fields are required to ask those questions, etc). One of the biggest things I was missing prior to the class was the whole concept of partitions vs rows and what the partition key is vs the collating keys. I had been using the data store like a SQL database so that my partitions always had at most one row. We did a lot of looking at instead what if we model the data so the partitions have many rows and what are the advantages and disadvantages of doing so. On day two we got very deep in the technical aspects of what was going on under the hood, how data was stored on disk and how to do things like estimate partition sizes. We were also able to ask a lot of questions specific to how we have been using Cassandra in our organization and what the limitations are going to be as we expand its usage to even more areas of our product.