Contributed by Calpont, InfiniDB Community Edition is an open source, scale-up analytics database engine for your data warehousing, business intelligence and read-intensive application needs. Enabled via MySQLTM and purpose-built for an analytical workload with column-oriented technology at its core, the multi-threaded capabilities of InfiniDB Community Edition fully encompass query, transactional support and bulk load operations.  So come on in, grab a download and get started.

             | 

Welcome, Guest
Username Password: Remember me

To Normalize or not?
(1 viewing) (1) Guest
  • Page:
  • 1
  • 2

TOPIC: To Normalize or not?

To Normalize or not? 1 year, 11 months ago #461

  • srkiNZ84
  • OFFLINE
  • Fresh Boarder
  • Posts: 3
  • Karma: 0
Hi,

Currently I am working with a very large table of approx. 300M rows in SQL Server which is being constantly updated (The table is used to record log entries). The table schema is quite inefficient/inelegant in my opinion as it is a very flat structure consisting of about 15 columns, most of which are of the type VARCHAR(255, NULL). The table also contains an extremely large amount of duplicate data.

As a part of setting up a BI/Analysis tool, I would like to do two things (in no particular order):

1. Import the data into Calpont's InfiniDB

2. Take this database and normalize it, turning as many of VARCHAR columns into INT's acting as foreign keys to tables with distinct values.

Essentially what the step requires is to do a 'SELECT DISTINCT [column name] INTO [new table]', where [new table] will have an auto increment, primary key column. The transformation should then add a foreign key column in the original table pointing to the value in the [new table] and finally drop the [column name] column.

I've been told on a seperate forum that there is no point in normalizing the data, and that it might even degrade performance in InfiniDB to restructure the schema this way. Is this true? Am I better off keeping the data as one large table of VARCHAR values?

Thank you

Re:To Normalize or not? 1 year, 11 months ago #480

  • davidhill
  • OFFLINE
  • Moderator
  • Posts: 554
  • Karma: 4
Thanks for checking out InfiniDB and appreciate any feedback you have.

We are looking into your request and should get a response posted in the next day.

Thanks, David

Re:To Normalize or not? 1 year, 11 months ago #487

  • jtommaney
  • OFFLINE
  • Senior Boarder
  • Posts: 49
  • Karma: 4
First of all thanks for the good questions.

There are a couple of thing in place here that may help guide your answer with Calpont’s InfiniDB engine.

First, a brief description of our dictionary handling, i.e. strings longer than 8 bytes. We will actually store those in a separate structure and store an 8-byte key pointing to the strings. So, in a way the normalization you describe is already happening under the covers without you having to create separate structures. There can be some places where you may know there are only a small number of possible strings, and you could implement a normalization using a 2 byte or a 1 byte key that would save space vs. our basic implementation.

The whole question about flat file vs. normalized is complex, but is based on 2 features;

1. We do have multi-threaded and optionally distributed scan operations against the fact table that allow us to churn through even de-normalized data very quickly. The column storage works well with these operations by significantly reduction I/O for many analytic queries. So we can run well against the flat file data.

2. However, we also have multi-threaded and optionally distributed hash-join operations that allow for joining to the fact table very quickly. In addition, we handle a large number of joins very well:

Join rate trended through 20 dimensions:
www.infinidb.org/myblog-admin/profiling-...nidb-multi-join.html

Star Schema Benchmark:
www.infinidb.org/myblog-admin/mysql-para...s-via-infinidb-.html


Are we generally faster if you don’t have to join? Yes, but that isn’t always the complete picture. If you save enough bytes in storage then InfiniDB can be faster when joining. In addition, saving bytes may allow more queries to be satisfied from cache and gaining overall acceleration

So, I would use these general guidelines in this general order:

1) Tight data type declarations. Because we don’t (yet) have compression for non-dictionary data types, if you declare a bigint when you only need an int data type then there will be storing and moving around twice as much data as needed.

2) Using codes, flags extensively. If you can replace ‘Promotional Code 001’ - a varchar(40) with the value ‘P001’ - a varchar(4) there can be significant space and performance benefits. And the 4-byte string is probably faster as a fact table column. So any ETL transform process that allows for replacing a long string with a non-dictionary string is highly useful.

3) Use a flexible table definition. Because additional columns not referenced in a query only impact query speed when they are used, it is may be ok to store additional, infrequently used columns.

4) For longer strings that you can’t replace on a 1-1 basis with shorter strings, then it is worth taking a look at benefits from normalization. The actual benefits depend on your hardware and your data, so there is no hard and fast rule here. One option while prototyping is to do both and compare the results. Generally very long strings with few distinct values will benefit from normalization, especially when a smaller key can be used. However, these may also be good candidates for replacement with shorter codes.

Also, we don’t (yet) have the autoincrement option available, so that may limit how you create dimension tables.

Thanks for the good questions,

Jim Tommaney
Chief Product Architect

Re:To Normalize or not? 1 year, 11 months ago #489

  • srkiNZ84
  • OFFLINE
  • Fresh Boarder
  • Posts: 3
  • Karma: 0
Thank you for the reply. I didn't realise that InfiniDB treated long strings that way. From your description, I'm guessing that it doesn't do de-duplication of values i.e. if you have two strings with the same value they will point to two different 8-byte keys, whose values are the same? (as opposed to pointing to the exact same 8-byte key).

I will try to get both types of schema (normalized and flat table) set up and benchmark them to see what kind of performance difference there is, if any.

My feeling is that the normalized dataset will most likely out perform the flat table, due to the fact that I'm testing this out on a spare development machine with a small amount of RAM, a slow hard disk, but a fast processor. My hope is to get the foreign key tables small enough, so that they fit into memory (I assume that the fact table is kept in memory?).

The fact that InfiniDB doesn't support auto incrementing columns is a bit of a pain, as I'm probably going to have to import and transform the data into an InnoDB table, to generate the keys, and then do a straight copy of the data into InfiniDB.

Will post the results of my benchmarks once I have them.

Re:To Normalize or not? 1 year, 9 months ago #656

  • pedrorjbr
  • OFFLINE
  • Fresh Boarder
  • Posts: 3
  • Karma: 0
Do u have any answer abou your first question?
"From your description, I'm guessing that it doesn't do de-duplication of values i.e. if you have two strings with the same value they will point to two different 8-byte keys, whose values are the same?"

I would like to know too.

Thanks

Re:To Normalize or not? 1 year, 9 months ago #660

  • wweeks
  • OFFLINE
  • Moderator
  • Posts: 87
  • Karma: 3
Hi Pedro,

InfiniDB will reuse the same string values for columns that have a few unique values provided you use the cpimport program for your loads. The strings will be reused when all of the values fit into a single 8K block. Each row will get it's own string once the values exceed a single block. There's a little more to it as a column is broken into multiple store files and the rule above applies to each store file.

Insert and load data infile do not currently reuse the string values. We have an internal future enhancement open to reuse strings for those. Using cpimport is the way to go if you are tables with millions as it is much faster than load data infile.

Thanks very much for posting and we appreciate your checking out InfiniDB.

Walt Weeks
  • Page:
  • 1
  • 2
Time to create page: 0.38 seconds