One-Page Data Warehouse Development Steps

Data warehouse is the basis of Business Intelligence (BI). It not only provides the data storage of your production data but also provides the basis of the business intelligence you need. Almost all of the books today have very elaborated and detailed steps to develop a data warehouse. However, none of them is able to address the steps in a single page. Here, based on my experience in data warehouse and BI, I summarize these steps in a page. These steps give you a clear road map and a very easy plan to follow to develop your data warehouse.

Step 1. De-Normalization. Extract an area of your production data into a “staging” table containing all data you need for future reporting and analytics. This step includes the standard ETL (extraction, transformation, and loading) process.

Step 2. Normalization. Normalize the staging table into “dimension” and “fact” tables. The data in the staging table can be disposed after this step. The resulting “dimension” and “fact” tables would form the basis of the “star” schema in your data warehouse. These data would support your basic reporting and analytics.

Step 3. Aggregation. Aggregate the fact tables into advanced fact tables with statistics and summarized data for advanced reporting and analytics. The data in the basic fact table can then be purged, if they are older than a year.

Read more at One-Page Data Warehouse Development Steps

What do you think about this topic? Share your opinions below and subscribe us to get updates in your inbox.

 

Data Lake vs Data Warehouse: Key Differences

Some of us have been hearing more about the data lake, especially during the last six months. There are those that tell us the data lake is just a reincarnation of the data warehouse—in the spirit of “been there, done that.” Others have focused on how much better this “shiny, new” data lake is, while others are standing on the shoreline screaming, “Don’t go in! It’s not a lake—it’s a swamp!”

All kidding aside, the commonality I see between the two is that they are both data storage repositories. That’s it. But I’m getting ahead of myself. Let’s first define data lake to make sure we’re all on the same page. James Dixon, the founder and CTO of Pentaho, has been credited with coming up with the term. This is how he describes a data lake:

“If you think of a datamart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.”

And earlier this year, my colleague, Anne Buff, and I participated in an online debate about the data lake. My rally cry was #GOdatalakeGO, while Anne insisted on #NOdatalakeNO. Here’s the definition we used during our debate:

“A data lake is a storage repository that holds a vast amount of raw data in its native format, including structured, semi-structured, and unstructured data. The data structure and requirements are not defined until the data is needed.”

Read more Data Lake vs Data Warehouse: Key Differences

What do you think about this topic? Share your opinions below and subscribe us to get updates in your inbox.