Data warehouse
A data warehouse is the main repository of an organization's historical data, its corporate memory. It contains the raw material for management's decision support system. The critical factor leading to the use of a data warehouse is that a data analyst can perform complex queries and analysis, such as data mining, on the information without slowing down the operational systems.
Inmon and Kimball
Bill Inmon, an early and influential practitioner, has formally defined a data warehouse in the following terms;-
- Subject-oriented
- The data in the database is organized so that all the data elements relating to the same real-world event or object are linked together;
- Time-variant
- The changes to the data in the database are tracked and recorded so that reports can be produced showing changes over time;
- Non-volatile
- Data in the database is never over-written or deleted - once committed, the data is static, read-only, but retained for future reporting; and
- Integrated
- The database contains data from most or all of an organization's operational applications, and that this data is made consistent.
Ralph Kimball, a leading proponent of the dimensional approach to building data warehouses, provides a succinct definition for a data warehouse: "a copy of transaction data specifically structured for query and analysis."[1]
These two influential data warehousing experts represent the current prevailing views on data warehousing. Kimball, in 1997, stated that "...the data warehouse is nothing more than the union of all the data marts", indicating a bottom-up data warehousing methodology in which individual data marts providing thin views into the organizational data could be created and later combined into a larger all-encompassing data warehouse.
Inmon responded in 1998 by saying, "You can catch all the minnows in the ocean and stack them together and they still do not make a whale," indicating the opposing view that the data warehouse should be designed from the top-down to include all corporate data. In this methodology, data marts are created only after the complete data warehouse has been created.
Data warehousing
A data warehouse might be used to find the day of the week on which a company sold the most widgets in May 1992, or how employee sick leave the week before the winter break differed between California and New York from 2001–2005.
While operational systems are optimized for simplicity and speed of modification (see OLTP) through heavy use of database normalization and an entity-relationship model, the data warehouse is optimized for reporting and analysis (online analytical processing, or OLAP). Frequently data in data warehouses are heavily denormalised, summarised or stored in a dimension-based model. However, this is not always required to achieve acceptable query response times.
This article seems heavily slanted towards Bill Inmon's data warehouse architecture. In particular it gives a description of the Kimball method as it has been portrayed by Inmon when arguing for the Inmon method. This article could use a more balanced portrayal of both methods as presented by their respective authors as well as some discussion of hybrid approaches.
History
Data Warehouses are a distinct type of computer database that were first developed during the late 1980s and early 1990s. They were developed to meet a growing demand for management information and analysis that could not be met by operational systems. Operational systems were unable to meet this need for a range of reasons:
- The processing load of reporting reduced the response time of the operational systems,
- The database designs of operational systems were not optimized for information analysis and reporting,
- Most organizations had more than one operational system, so company-wide reporting could not be supported from a single system
- Development of reports in operational systems often required writing specific computer programs which was slow and expensive
As a result, separate computer databases began to be built that were specifically designed to support management information and analysis purposes. These data warehouses were able to bring in data from a range of different data sources, such as mainframe computers, minicomputers, as well as personal computers and office automation software such as spreadsheet, and integrate this information in a single place. This capability, coupled with user-friendly reporting tools and freedom from operational impacts, has led to a growth of this type of computer system.
As technology improved (lower cost for more performance) and user requirements increased (faster data load cycle times and more features), data warehouses have evolved through several fundamental stages:
- Off line Operational Databases
- Data warehouses in this initial stage are developed by simply copying the database of an operational system to an off-line server where the processing load of reporting does not impact on the operational system's performance.
- Off line Data Warehouse
- Data warehouses in this stage of evolution are updated on a regular time cycle (usually daily, weekly or monthly) from the operational systems and the data is stored in an integrated reporting-oriented data structure.
- Real Time Data Warehouse
- Data warehouses at this stage are updated on a transaction or event basis, every time an operational system performs a transaction (e.g. an order or a delivery or a booking etc.)
- Integrated Data Warehouse
- Data warehouses at this stage are used to generate activity or transactions that are passed back into the operational systems for use in the daily activity of the organization.
The Data Warehouse Architecture
The data warehouse architecture consists of various interconnected elements which are: 1) Operational and external database layer: the source data for the data warehouse. 2) Informational access layer: the tools, the end user access to extract and analyze the data. 3) Data Access Layer: the interface between the operational and informational access layer. 4) Metadata Layer: The data directory or repository of metadata information.
The concept of "data warehousing" dates back at least to the mid-1980s, and possibly earlier. In essence, it was intended to provide an architectural model for the flow of data from operational systems to decision support environments. It attempted to address the various problems associated with this flow, and the high costs associated with it. In the absence of such an architecture, there usually existed an enormous amount of redundancy in the delivery of management information. In larger corporations it was typical for multiple decision support projects to operate independently, each serving different users but often requiring much of the same data. The process of gathering, cleaning and integrating data from various sources, often legacy systems, was typically replicated for each project. Moreover, legacy systems were frequently being revisited as new requirements emerged, each requiring a subtly different view of the legacy data.
Based on analogies with real-life warehouses, data warehouses were intended as large-scale collection/storage/staging areas for corporate data. From here data could be distributed to "retail stores" or "data marts" which were tailored for access by decision support users (or "consumers"). While the data warehouse was designed to manage the bulk supply of data from its suppliers (e.g. operational systems), and to handle the organization and storage of this data, the "retail stores" or "data marts" could be focused on packaging and presenting selections of the data to end-users, to meet specific management information needs.
Somewhere along the way this analogy and architectural vision was lost, as some vendors and industry speakers redefined the data warehouse as simply a management reporting database. This is a subtle but important deviation from the original vision of the data warehouse as the hub of a management information architecture, where the decision support systems were actually the data marts or "retail stores".
Storage
In OLTP — online transaction processing systems relational database design use the discipline of data modeling and generally follow the Codd rules of data normalization in order to ensure absolute data integrity. In this approach, each of the more complex information items is resolved into a set of records in multiple tables, each of which satisfies the normalization rules. Codd defines 5 increasingly stringent rules of normalization and typically OLTP systems achieve a 3rd level normalization. Fully normalized OLTP database designs often result in having information from a business transaction stored in dozens to hundreds of tables. Relational database managers are efficient at managing the relationships between tables and result in very fast insert/update performance because only a little bit of data is affected in each relational transaction.
OLTP databases are efficient because they are typically only dealing with the information around a single transaction. In reporting and analysis, thousands to billions of transactions may need to be reassembled imposing a huge workload on the relational database. Given enough time the software can usually return the requested results, but because of the negative performance impact on the machine and all of its hosted applications, data warehousing professionals recommend that reporting databases be physically separated from the OLTP database.
In addition, data warehousing suggests that data be restructured and reformatted to facilitate query and analysis by novice users. OLTP databases are designed to provide good performance by rigidly defined applications built by programmers fluent in the constraints and conventions of the technology. Add in frequent enhancements, and too many a database is just a collection of cryptic names, seemingly unrelated and obscure structures that store data using incomprehensible coding schemes; all factors that while improving performance, complicate use by untrained people. Lastly, the data warehouse needs to support high volumes of data gathered over extended periods of time and are subject to complex queries and need to accommodate formats and definitions inherited from independently designed package and legacy systems.
Designing the data warehouse data architecture synergy is the realm of Data Warehouse Architects. The goal of a data warehouse is to bring data together from a variety of existing databases to support management and reporting needs. The generally accepted principle is that data should be stored at its most elemental level because this provides for the most useful and flexible basis for use in reporting and information analysis. However, because of different focus on specific requirements, there can be alternative methods for design and implementing data warehouses. There are two leading approaches to organizing the data in a data warehouse: the dimensional approach advocated by Ralph Kimball and the normalized approach advocated by Bill Inmon. Whilst the dimension approach is very useful in data mart design, it can result in a rats nest of long term data integration and abstraction complications when used in a data warehouse .
In the "dimensional" approach, transaction data is partitioned into either a measured "facts" which are generally numeric data that captures specific values or "dimensions" which contain the reference information that gives each transaction its context. As an example, a sales transaction would be broken up into facts such as the number of products ordered, and the price paid, and dimensions such as date, customer, product, geographical location and salesperson. The main advantages of a dimensional approach is that the data warehouse is easy for business staff with limited information technology experience to understand and use. Also, because the data is pre-joined into the dimensional form, the data warehouse tends to operate very quickly. The main disadvantage of the dimensional approach is that it is quite difficult to add or change later if the company changes the way in which it does business.
The "normalized" approach uses database normalization. In this method, the data in the data warehouse is stored in third normal form. Tables are then grouped together by subject areas that reflect the general definition of the data (customer, product, finance, etc.) The main advantage of this approach is that it is quite straightforward to add new information into the database — the primary disadvantage of this approach is that because of the number of tables involved, it can be rather slow to produce information and reports. Furthermore, since the segregation of facts and dimensions is not explicit in this type of data model, it is difficult for users to join the required data elements into meaningful information without a precise understanding of the data structure.
Subject areas are just a method of organizing information and can be defined along any lines. The traditional approach has subjects defined as the subjects or nouns within a problem space. For example, in a financial services business, you might have customers, products and contracts. An alternative approach is to organize around the business transactions, such as customer enrollment, sales and trades.
Advantages
There are many advantages to using a data warehouse, some of them are:
- Data warehouses enhance end-user access to a wide variety of data.
- Decision support system users can obtain specified trend reports, e.g. the item with the most sales in a particular area within the last two years.
- Data warehouses can be a significant enabler of commercial business applications, particularly customer relationship management (CRM) systems.
Concerns
- Extracting, transforming and loading data consumes a lot of time and computational resources.
- Data warehousing project scope must be actively managed to deliver a release of defined content and value.
- Compatibility problems with systems already in place.
- Security could develop into a serious issue, especially if the data warehouse is web accessible.
- Data Storage design controversy warrants careful consideration and perhaps prototyping of the data warehouse solution for each project's environments.
See also
- Bitmap index
- Business Intelligence. (BI)
- Business Performance Management (BPM)
- Data integration
- Data mart
- Data mining
- Database Management System (DBMS)
- Executive Information System (EIS)
- Extract, transform, and load (ETL)
- Intelligent document
- Master Data Management (MDM)
- On Line Analytical Processing (OLAP)
- On Line Transaction Processing (OLTP)
- Operational Data Store (ODS)
- Shadow system
- Snowflake schema
- Star schema
References
- William H. Inmon, Richard D. Hackathorn: Using the Data Warehouse, John Wiley & Son's, ISBN 0-471-05966-8
- Pyle, Dorian. Business Modeling and Data Mining. Morgan Kaufmann, 2003. ISBN 1-55860-653-X
- Ralph Kimball, Margy Ross: The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling (Second Edition), John Wiley & Sons, ISBN 0-471-20024-7
- Stephen Haag, Maeve Cummings, Donald J. McCubbery, Alain Pinsonneault, Richard Donvan:Managements Information System for the Information Age, Third Canadian Edition, McGraw-Hill Ryerson, ISBN 0-07-095569-7
- Data Warehousing Gotchas
- ODP Resources on Data Warehouse
- ODP Articles on Data Warehouse
- Greenfield, Larry. DW InfoCenter
- Manning, Ian Data Warehousing - What Is It. http://www.datawarehouse.com/article/?articleId=3204&searchTerm=dilemma
- Agile Best Practices for Data Warehousing
- Darmawikarta, Djoni (2007); Dimensional Data Warehousing with MySQL, Pub. BrainySoftware. ISBN 0975212826.
- Practical Data Warehousing
- ^ Kimball, R. and Ross, M. "The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling". pp. 310. Wiley. 2nd Ed. 2002. ISBN 0471200247.