Browsed by
Month: September 2010

Hierarchical data with SCD Type 2

Hierarchical data with SCD Type 2

The question that is always asked with storing data with Slowly Changing Dimensions, is to what extent you want to keep the historical data. I would like to explain a situation of hierarchical data that is stored with the SCD method and how this affects the data.

As an example where I first got in touch with this subject, I’ll take the hierarchical relationship between employees and managers.
The employees’ data is stored as SCD Type 2, and this table also includes a column “Manager”. This column has a foreign key relationship to the surrogate key of the same table. So what does this do to your data?
To keep the example simple, I’ll take two employees (John and Jane), and Bill is their manager. This results in the following structure in the table:
Let’s say Bill moves to another location as of October 1st 2010. Since I have the column “Location” also set to a type 2 change, you might expect the following changes and additions to the table:
The changes for Bill seems correct, an extra line is added with a new surrogate key (line 4) and his original line has gotten an end date. Now I still know that Bill has been working in New York and then in Boston.
Nothing special than a simple SCD type 2 change until so far. But the above situation, is that really what should happen, isn’t there more that should be done? I probably wouldn’t have posted this if it was just that 🙂
Look at John and Jane. They have Bill as their manager because they both have the surrogate key of Bill in the column “ManagerKey”. The active record for Bill has now been changed to key 4, so Jane and Joe have to “change manager” as well. I have put this as a quote, because they don’t really change managers, but the link to their manager’s record in the employees table has changed.
Therefore, the end result of Bill’s change adds the following lines to the table for his subordinates:
In this example it doesn’t seem like a big change, but this entails that every single change in a manager’s data, the subordinates all the way down will get a new record in the table as well. In my case, there is a new CEO in the organization that led to these changes for every one of the 2500 employees in the company!
We’re not talking about a lot of data here, but in other hierarchical structures than employee data, this could lead to a massive data load that is not even necessary. Therefore you should ask yourself if you really need that information so much in detail. If you don’t need this, there are two options that look like the most straight-forward options:
  1. Change the column “ManagerKey” to an SCD Type 1 column. If this is used in the above example, the manager’s key in John’s and Jane’s profile will be overwritten from 3 to 4. A drawback of this option is that you lose the history of who has been whose manager.
  2. If you need to keep the history, you can use a foreign key relationship not on the surrogate key, but on the business key. In the above example, John and Jane would have the value 145 in the column “ManagerKey”, and nothing changes with their information and no new record for them are added.
It is advised to make this choice early on in your design phase, since going back from one to another could be a very time consuming job!
Dimensional model in SCD2 (part 2)

Dimensional model in SCD2 (part 2)

Continued from part 1:

In addition to my previous post, I thought it would be helpful to display screenshots of how I designed my dimension.

This is the attributes tab:
… and this is how the attribute relationships look like:
With these attributes, I would like to be able to do analyses like the number of resources per gender, per country, per cost center, per language, etc…
Dimensional model in SCD2 (part 1)

Dimensional model in SCD2 (part 1)

This post is dedicated to an issue I’ve been dealing with for a while now and until so far I haven’t found a solid solution yet. Since I’ve been explaining and asking this to several experts over and over again, I’ll dedicate a blog post to it.

I’m a big fan of using Kimball’s Slowly Changing Dimensions, especially since it keeps track of historical data and it doesn’t explode your database. But… mapping this data into an SSAS dimension is still an effort, and let me explain my challenge by using some example data:

First of all, my dimension table looks like this:

Let’s take John’s example. John started the company on January 1st 2009 in the sales department in Delft. Halfway through the year they realized that he is needed in the sales team in Amsterdam, therefore John is transferred to the Amsterdam office on June 1st. Since John is staying in Amsterdam longer than expected, he is assigned another manager in February 2010.

In my fact table, the sales activities of John look like this:

In wording, he won a deal of 3000 euros when he was in the Delft office, a deal of 1000 euros after he moved to Amsterdam and another deal of 6000 euros after he was assigned another manager.

With this data, what I would like to achieve first of all is an analysis of the sales activities per location with a drilldown of sales reps on that location. In other words, it should show that the Delft location has a 7000 euro turnover (record 1 and 2) and that the Amsterdam location also has 7000 (record 3 and 4). With the drilldown, I should be able to see that it was John who was responsible for these sales.

Into SSAS where my challenge already starts with the creation of the dimension. These are the steps that I take:

1. Creation of the dimension: I create the dimension according to my dimension table, and the dimension is created as usual. No problems here.

2. Adding a hierarchy: I add a hierarchy according to the following structure: Location –> Business Key –> Surrogate Key.

3. Attribute relationships: By creating the hierarchy in the previous step, I also have to set the attribute relationships accordingly.

4. Deploy and process the dimension.

Processing the dimension generates the error that a duplicate key attribute has been found on the field “BusinessKey”. I understand the error, because if the location is a parent of the business key in the hierarchy, it goes wrong because there are more than 1 locations for one business key.

Then I continue with a few settings:

5. I set the key value for the field “BusinessKey” to the fields “BusinessKey” (as it originally was) and the field “SurrogateKey”.

6. Deploy and process the dimension.

The processing goes well, but when I browse the dimension, the hierarchy is displayed as follows:

  • Delft
    • Erik Monchen
      • 1
    • Erik Monchen
      • 2
    • Erik Monchen
      • 3
    • John Smith
      • 4
  • Amsterdam
    • John Smith
      • 5
    • John Smith
      • 6

I also understand why this is the behaviour because the key value of the business key is set to the business key and the surrogate key, but this is the only way how I can use the dimension without having errors…

In the end, I would like to achieve a structure like this:

  • Delft
    • Erik Monchen
      • 1
      • 2
      • 3
    • John Smith
      • 4
  • Amsterdam
    • John Smith
      • 5
      • 6

This way, it’s easy to link the fact table to the surrogate key value and the sales totals are added up correctly per location. Also, when I browse the cube where I’ve linked this dimension, it will display the location first, then when I drill down, it displays the names over and over again because that’s the next level in the hierarchy. If there is a way to change this view, I’d be open to use that as well though and not change anything to the dimension! The whole point of this exercise is to avoid having a list of double names when I browse the cube…

Since the method of Slowly Changing Dimensions is widely used, I was wondering how other people have accomplished this. I’m sure I’m not the only one out there facing this challenge!

Thanks in advance for your feedback!