Several large systems integration consulting firm leaders in my network have confirmed a resurgence in demand for OLAP and related dimensional modeling skills. Dimensional modeling design patterns are still being used today with small data, medium data and even with big data OLAP on Hadoop implementations. I’ll be sharing tips for OLAP on Hadoop next week. This week, let’s enjoy a little dimensional modeling blast from the past.
When I talk to implementation peers, it sure sounds like Ralph Kimball’s classic “objection removers” from 2005 might still be true 12 years later despite advances in GPU acceleration and in-memory computing. Admittedly, I have tested and deployed solutions that don’t require the historical OLAP / dimensional modeling level of effort. I do know the data warehouse is not dead yet…neither is OLAP.
Since OLAP seems to be cool again, I dusted off an old SQL Saturday #8 presentation that covers this topic and re-posted it for the younger talent. Here are a few tips and resources to get you started designing good data models.
An Era of Bad Data Models
Dimensional modeling design patterns that were born in the 90s still provide tried-and-true data models for reporting accurately and efficiently over time. Today there are more automation tools for this type of work. However, most projects evolve from self-service BI tools and rely on human brain power. Thus, it is prudent to understand the basics of a good dimensional model design.
Note modern self-service BI tools and hardware in-memory performance improvements do make it easier to create and live with bad data models for a while. I have seen numerous reporting data messes generated by non-technical users with good intentions. Usually I get the call from a line of business sponsor that realizes their team needs professional help to get them back on track. Over time, bad models can become big problems to manage.
Most of the self-service BI tools apply a dimensional approach in-memory “logically” but not necessarily “physically”. Several modern self-service BI tools do still “physically” use OLAP dimensional models as the core engine and when the model grows it can be upgraded to an OLAP server or cloud hosting service.
Even though you might not immediately need a dimensional design, consider the bigger picture. Somewhere you will need to map source data to the user interface. Bigger picture questions to consider when deciding between using good data modeling techniques versus let’s get it done fast, right now, data modeling.
- Will the source data always be available?
- How will you deal with data quality when you encounter issues or duplicates?
- How will you combine you report data with other department report data?Referred to as “conforming” / sharing dimensions across multiple facts.
- How will you handle changes to people, organizations, products, etc.? Will you only be able to report on what those look like today? Test what happens if it were to change. Test reporting over more than a year of data. Can you report on both the before and after state of the data? A dimensional design pattern called slowly changing dimensions elegantly handles those issues.
- Do you want to be able to drill anywhere, not just declared hierarchies?
I could go on and on. Essentially, if you are developing data models that need to grow, scale and be accurate over time…you might need a dimensional model.
What is Dimensional Modeling
A dimensional model is also commonly called a star schema. It provides a way to improve report query performance without affecting data integrity. This type of model is popular in data warehousing because it can provide better query performance than transactional, normalized, OLTP data models. It also allows for data history to be stored accurately over time for reporting. Another reason why dimensional models are created…they are easier for non-technical users to navigate. Creating reports by joining many OLTP database tables together becomes overwhelming quickly.
Dimensional models contain facts surrounded by descriptive data called dimensions. Facts contains numerical values of what you measure such as sales or user counts that are additive, or semi-additive in nature. Fact tables also contain the keys/links to associated dimension tables. Compared to most dimension tables, fact tables typically have a large number of rows.
Dimension contain the descriptive details about the facts. Common dimensions are date, type, and location. Dimension tables are usually denormalized and have a large number of columns. These tables have far fewer rows of data than fact tables with the exception of “monster dimensions” ie. person/user. The values in a dimension table are typically used as rows and columns in reports.
In my favorite dimensional modeling resource, the IBM Redbook Dimensional Modeling: In a Business Intelligence Environment, good versus bad fact table designs are illustrated and discussed in more detail.
From there, you will define varying types of dimensions and facts to enable multidimensional data analytics from different perspectives. OLAP consists of three basic analytical operations: consolidation (roll-up), drill-down, and slicing and dicing.
You’ll also define hierarchies, pivot or drill paths. For example, a user might want to display total sales “by region and country down to stock exchange”. Hierarchies also have Natural and Unnatural flavors.
Surrogate keys are crucial in dimensional designs. Those are artificial keys that keep dimension member keys independent from source system natural keys. They are not the source system primary key. In a dimensional design, dimension to fact relationships are based on surrogate keys. That design pattern allows flexibility for data source and other changes over time.
Another critical dimensional design pattern for accurately reporting over time is called Slowly Changing Dimensions (SCDs). Slowly changing dimensions are often used with Person, Customer, Employee, or Region dimensions. If you work for a public company, provide compliance or legal reporting, you will likely need slowly changing dimensions in your dimensional design.
Slowly changing dimension types provide different approaches to overwrite or retain history.
- Type 1 Overwrite – Historical values are updated in place (replaced) so comparisons use current data only; Provides only an AS-IS view of the business
- Type 2 – Create another Dimension record as the active record; Tracks historical data by creating multiple records in dimension with separate keys, effective dates and status flags effective dates and status flags; Keys referenced in Facts as of the effective date; Historical values are not replaced and historical comparisons contain valid context; Provides AS-IS and AS-WAS views of the business
- Type 3 – Creating new columns to extend Dimension record; Can be used to view current and prior values side-by-side in same Dimension row; Partially limited to amount of created SCD columns
- Other Types 4, 6 and Hybrid – Lesser used techniques that are out of scope for an intro
Dimensional Modeling Process
The process for dimensional modeling is extremely iterative in nature. It begins with identifying the reporting requirements for a business process and working through a series of steps. In the real world, it would not be uncommon to throw together a quick proof-of-concept dimensional model in a day or two and then continually update it as more information is collected or users begin exploring demo reports.
- Identify business process requirements
- Identify the grain
- Identify the dimensions
- Identify the facts
- Verify the model
As you design your dimensional model, you’ll develop a bus matrix of “conformed” / shared dimensions across multiple areas of the business. The bus matrix helps ensure you build a model that can be expanded upon later to avoid building a reporting “silo” that only works for one area or use case.
My favorite resources for dimensional modeling include free IBM Redbooks and the Kimball Group resources. If you are lucky enough to take one of the Kimball classes, do it!
Detailed, step-by-step, dimensional modeling process instructions and examples are available in the IBM Redbook Dimensional Modeling: In a Business Intelligence Environment. It is the #1 resource that I refer to newbies to download and read since it is free.
In the free IBM Redbook, best practice
dimensional modeling techniques are taught in-depth. The authors also help readers understand how to maintain, and use a dimensional model for common business intelligence scenarios. Complex dimensional design techniques and examples are also sprinkled in throughout the book making it a wonderful resource to bookmark and revisit.
I have also found the Kimball Data Warehouse Toolkit books and templates to be truly invaluable for data warehousing projects.
To Learn Dimensional Modeling
Dimensional modeling is a valuable skill to learn. I also find the process to be fun – almost like putting together pieces in a data puzzle – pivoting, slicing and dicing data. Since system integrators are desperately trying to find talent that knows how to develop good dimensional data models for OLAP projects, taking the time to dig in and learn it could be a good investment of your time.
As mentioned above, here is my old SQL Saturday #8 session that includes a plethora of other recommended resources. Enjoy!