A Data Warehousing Primer for Database Designers

 

Over the last several years, data warehousing has emerged as a distinct discipline in the information technology world. As a database designer, you may wonder what, if anything, is different about this approach to systems development. This article outlines some of the fundamentals of data warehousing and suggests, in broad terms, where this technology might be applicable.

For the purposes of this article, we’ll look at data warehousing’s logical and physical design criteria. The considerations for logical design will focus primarily on the database schema. The physical design discussion will zero in on the unique aspects of a data warehouse and why, for certain applications, some database products are better than others. Note that I’ll be discussing only the most common solutions and products in this arena. This article is not intended to be comprehensive and there are many business problems and products in the marketplace which do not fit into this article’s framework.

Relational database usage falls primarily into two main categories: transaction processing systems (OLTP) and analytical systems (OLAP). When E.F. Codd laid out his rules for OLTP relational database design in the 1980’s, he also set forth a similar set of rules for OLAP. His primary point for creating this distinction is still valid: namely, that both approaches to database design are equally sound from a theoretical construct point of view. Thus your selection of one or the other design should be determined by the functional requirements of the system.

Transaction processing systems are used by businesses both to record and view information in real-time. Such systems are typically optimized for supporting a specific set of business processes. Although historical data is available in such systems, the main design goal is to support the needs of the people who run the business operations. For example, though an order entry system can undoubtedly provide the marketing department with useful historical information about what customers are buying at various times of the year, the primary goal in designing the system is to meet the needs of users of current business information. The design is defined by sale’s need to track the order, manufacturing’s need to build it, logistics’ need to ship it and finance’s requirements for invoicing the customer. Once the customer pays the bill, the need for access to the discrete records by the business operations people diminishes.

When building such a system, the database designer will almost certainly encounter performance constraints. How much data does the system need to retain and for how long? How many transactions does it need to support in a given timeframe? How shall business rules and relational integrity be implemented and enforced? When all is said and done, the end result will probably look a lot like a textbook third-normal form schema. Since writes to the database happen, on an approximate basis, as often as reads, care is taken to ensure that key fields are not duplicated, thus allowing efficient updates to data.

Query processing is typically done on a single transactional unit of data. For example, to look at a single order the database designer might create a query that pulls information from the following tables: customer, address, order, order detail, product and price. Although this joins a number of individual tables, the very nature of the query will constrain the order table to a single row. This will axiomatically constrain customer and address to a single row. Order details is likely to be a manageable number of rows and the remainder of the tables will be limited to a number of rows equivalent to the number of order details. So, for an order of 6 items, our query will need to retrieve just 21 rows from 6 separate tables.

Now imagine a query to the same database to report sales by department by product by fiscal quarter over the last two years. Assume that we’re a mail-order apparel firm that grosses $200M annually with an average order size of 6 items and an average price per item of $25.00. This query will need to traverse over 50 million rows to fetch the raw data needed for the report. A third normal form database schema may not necessarily perform very well in processing this query. To compound the problem, what if there is no algorithmetric method of determining which dates belong in a particular fiscal year or if your queries need to incorporate the concepts of this year vs. last year, or quarter-to-date? What if the company was reorganized recently and you need to be able to report last year’s data as if it reflected the reorganization and this year’s data as if the reorganization had not taken place? Chances are, these concepts are beyond the scope of the line-of-business requirements. Enter the data warehouse or datamart (N.B. Think of a datamart as a data warehouse that covers a single subject or operational area, such as sales).

The heart of the data warehouse is the dimensional (a.k.a. star or snowflake) schema: a central fact table surrounded by dimension and transform tables. Think of the fact table as the business process you are measuring, such as a sale or shipment. The dimension tables will contain the selections and their attributes which will be used to constrain the records in the fact table, as well as provide the values used for grouping and sorting the results.

The simple example above models a retail business that operates stores in a number of different markets. Each row in the fact table (Sale Fact) contains information about the sale of an item (UnitsSold and UnitPrice). Notice that the fact table is inherently normalized. You should also notice that the facts are additive along all dimensions, i.e. no matter which dimension we choose to group the sales facts by, we can sum either UnitsSold or UnitPrice and the results make sense. Semiadditive facts can only be summed in certain dimensions. An example is a monthly account balance fact, which is not additive along the time dimension (try adding up the ending monthly balance of your checking account for a year and see if the resultant number is meaningful!).

Another consideration in the schema is the grain of the fact table. We might choose to have each row in the fact table represent an individual consumer transaction, daily, weekly or monthly sales summaries. Clearly, the finer the grain of the table, the larger the warehouse will be. The business analysis which (hopefully) preceded the database design will point out how fine the granularity must be in order to meet the functional requirements of the system.

We also need to decide on the metrics we wish to apply to our facts. These are the arithmetic operations we need to perform to produce the numeric results the users of the system need. Common metrics include SUM, COUNT, AVERAGE, and STD_DEV. Note that the some metrics, such as COUNT and AVERAGE, don’t necessarily correspond to their typical SQL database equivalents. Metrics which need a count of items usually require the number of unique values of a dimension attribute, not the count of rows returned. Depending on the tools used to implement the database queries, you may need to address this common shortcoming of relational databases.

The dimension tables store the descriptions or attributes of our business. To perform a query, we typically select the dimension rows which meet our desired selection criteria (e.g., all coats smaller than size 10 sold in the Northeast during 1996). The intersections (i.e. Cartesian product) of the rows selected from the dimension tables with the fact table form the result set. As a database designer, you should recognize that this typical query as the result of a many-to-many join of the dimension and fact tables.

The dimension tables in a pure star schema, as illustrated above, are not normalized. As a designer, you may have to fight the temptation to normalize these tables, turning the star into a snowflake schema. You will have to consider a number of things in making that decision. First of all, a number of very good OLAP tools on the market today (MicroStrategy’s DSSAgent is one) essentially force you to use the snowflake approach. Another consideration is how insulated the end users are from the schema design. Although a normalized database is a thing of beauty to a systems designer, it is intimidating to the average user. Denormalized dimension tables lend themselves very naturally to browsing using simple SQL SELECT DISTINCT queries. Since a data warehouse does not require individual records to be changed on a frequent basis, the update performance justification for normalization is not present. Finally, any rationalization based on table size is probably ill-considered. In the typical warehouse, we have fact tables with tens or hundreds of millions of rows and dimension tables with only hundreds or thousands. There are probably no space savings benefits that justify the added complexity.

Transform tables (not illustrated above) are used to join dimension tables back on themselves. They allow us to model useful concepts such as this year vs. last year (1:1) or year-to-date (1:M). Some OLAP tools allow you to create dimensional metrics which make it easy to form query templates that incorporate these concepts.

A powerful tool is the technique of drill-down and drill-up. As an example, envision a query that displays total sales by region by month this year vs. last year. A segment of our SQL query will look, in part, like this: SELECT SUM(UnitSales), REGION … GROUP BY REGION. In analyzing the results, the user notices that sales in the Northeast are off 20% over the same time period last year. By re-querying the database and adding additional attributes in the store dimension (e.g. SELECT SUM(UnitSales), REGION, METROMARKET … GROUP BY REGION, METROMARKET), the user might discover that the reason for the change is that sales are off in a single metropolitan area. From a design standpoint , we can see that the concept of drill-down and drill-up is simply the process of adding or subtracting row headers to our query results.

One way to extract significant query performance is through aggregations. This involves examining the way in which facts are commonly reported and adding tables, which are updated on a periodic basis, that pre-summarize data along one or more dimensions. For example, if the grain of our fact table in the example above was daily sales totals and we know that reporting by month and quarter is done regularly, we might add additional fact tables with the grain reduced to monthly and quarterly totals, respectively. More complex aggregations are possible by adding additional dimensions, e.g. region by quarter or product by store by week. The better OLAP query tools understand automatically to use an aggregate table for a query if an appropriate one is available.

A final consideration (final only in the context of this article!) concerns changing dimensions. In the real world, which we expect our database to model, change is constant. Managers change jobs, stores open and close, sales territories shift, product specifications and their suppliers change. During the business analysis phase of a project, careful attention must be paid to the requirements for "changing history". Perfectly valid outcomes are: a. we don’t care to track old history; b. segment history between changed dimension attributes; or c. add new dimension records while keeping the old. This is a fairly complex topic and further elaboration is beyond the scope of this article.

Once settled on a logical model for your data, you will need to consider a number of factors in choosing a database product or optimizing your current database manager. There are a number of somewhat esoteric products designed for use in extreme or unusual applications (very large databases, unusual data models or complex numerical analysis) which we won’t discuss here. Most datamarts and many warehouses are constructed using conventional relational databases such as Microsoft SQL Server, Sybase, DB2 and Oracle. There are also specialized relational products available that are designed for data warehousing, such as Red Brick. These specialized products offer some significant advantages over more general purpose database managers.

The typical data warehouse is updated on a periodic basis, such as nightly or weekly. This data is usually extracted from legacy or operational systems, or supplied by outside syndicates. The data is cleaned up (transformed) and loaded into the warehouse. Depending on the application, the data loaded might be simply appended to existing data, or the warehouse might be completely emptied and reloaded. In any event, changes to discrete data are usually not allowed during normal business hours. This results in some unique advantages.

First, no transaction log is needed or wanted. Since we are making sure that the data is clean before it is loaded into the warehouse, and since we don’t perform update operations in the conventional sense, we have no need to implement transactions with their associated commit and rollback overhead.

Second, data structures that promote fast retrieval, such as the bitmap index, become a viable indexing option with the data warehouse. As you know, indexes are data structures that allow random access to data records based on the values of one or more fields. In the transaction processing world, they have been implemented using the binary tree (b-tree) almost exclusively for as long as I can remember. The b-tree’s main advantage is that it allows pretty fast access to any random record, while performing pretty well when you need to insert or delete a node in the tree, as when inserting or deleting a database record. Structures such as the bitmap index are much faster at retrieval than the b-tree but have the disadvantage that, once created, they are not updateable. Unfettered by the need to provide update capability, you can drop the database indexes before every load and rebuild them after the load is complete, expanding your index options.

Third, the special purpose databases, such as Red Brick, are extremely fast at loading tables and building indexes, thereby overcoming problems that may result from the strict operational time constraints of data warehouses. A typical application running a nightly update must complete within the time allotted for it, or the system will not be available the next day. In addition to the time it takes to physically load the new records and rebuild the indexes, you might also be constrained by limited access to the systems that feed data to the warehouse. Having 14 hours of warehouse downtime daily in which to load data may not be helpful if the order entry system feeding it doesn’t finish reconciliation until 2 A.M. The special purpose databases like Red Brick respond well to these situations.

Finally, products like Red Brick can also be used to address query optimization problems. From a practical standpoint, all of the current generation of conventional RDBMS’s are somewhat eccentric at processing queries against the star schema database. The reasons are many. Often, the query processor will decide that the constraints on some of the dimension records are too complex, or weak. It will then decide to process the fact table first, testing the results against each of the remaining dimension records. This will destroy the whole indexing scheme, as the optimizer unwittingly forces a sequential scan of the biggest tables in the database in order to select records in the smallest! Unfortunately, the tools that the database administrator has at their disposal to combat the problem are indirect. They are usually limited to fiddling with the database statistics by overwriting the systems actual numbers with values intended to "fool" the optimizer into thinking that a particular query strategy is acceptable. A bigger problem is that the relationships between data elements in the data warehouse are far more deterministic than those in a conventional OLTP system. We don’t need the query processor to "discover" those relationships for us; we already know what they are!

 

Al Kozakiewicz is President of Hourglass Systems, Inc. If you have any questions or comments about this article, feel free to contact him at (518) 452-9062 or akozak@hourglass.com

 

 

[Home][Customers][Contents][Jobs][Relationships][Newsletters]