First of all thanks for the good questions.
There are a couple of thing in place here that may help guide your answer with Calpont’s InfiniDB engine.
First, a brief description of our dictionary handling, i.e. strings longer than 8 bytes. We will actually store those in a separate structure and store an 8-byte key pointing to the strings. So, in a way the normalization you describe is already happening under the covers without you having to create separate structures. There can be some places where you may know there are only a small number of possible strings, and you could implement a normalization using a 2 byte or a 1 byte key that would save space vs. our basic implementation.
The whole question about flat file vs. normalized is complex, but is based on 2 features;
1. We do have multi-threaded and optionally distributed scan operations against the fact table that allow us to churn through even de-normalized data very quickly. The column storage works well with these operations by significantly reduction I/O for many analytic queries. So we can run well against the flat file data.
2. However, we also have multi-threaded and optionally distributed hash-join operations that allow for joining to the fact table very quickly. In addition, we handle a large number of joins very well:
Join rate trended through 20 dimensions:
www.infinidb.org/myblog-admin/profiling-...nidb-multi-join.html
Star Schema Benchmark:
www.infinidb.org/myblog-admin/mysql-para...s-via-infinidb-.html
Are we generally faster if you don’t have to join? Yes, but that isn’t always the complete picture. If you save enough bytes in storage then InfiniDB can be faster when joining. In addition, saving bytes may allow more queries to be satisfied from cache and gaining overall acceleration
So, I would use these general guidelines in this general order:
1) Tight data type declarations. Because we don’t (yet) have compression for non-dictionary data types, if you declare a bigint when you only need an int data type then there will be storing and moving around twice as much data as needed.
2) Using codes, flags extensively. If you can replace ‘Promotional Code 001’ - a varchar(40) with the value ‘P001’ - a varchar(4) there can be significant space and performance benefits. And the 4-byte string is probably faster as a fact table column. So any ETL transform process that allows for replacing a long string with a non-dictionary string is highly useful.
3) Use a flexible table definition. Because additional columns not referenced in a query only impact query speed when they are used, it is may be ok to store additional, infrequently used columns.
4) For longer strings that you can’t replace on a 1-1 basis with shorter strings, then it is worth taking a look at benefits from normalization. The actual benefits depend on your hardware and your data, so there is no hard and fast rule here. One option while prototyping is to do both and compare the results. Generally very long strings with few distinct values will benefit from normalization, especially when a smaller key can be used. However, these may also be good candidates for replacement with shorter codes.
Also, we don’t (yet) have the autoincrement option available, so that may limit how you create dimension tables.
Thanks for the good questions,
Jim Tommaney
Chief Product Architect