Data loading
The loading of the data in a warehouse is one of the most important process in developing a business intelligent software. In the staging area after the data has been extracted from the source system and transformed according to the target system the next step comes is to load the data in the warehouse.
There are three common techniques for data loading:
1. Full data refresh.
2. Incremental Load / refresh.
3. Trickle Feed.
The loading could be of two types: "Immediate" and "Deffered".
The first two techniques are deffered one while the third one is immediate type.
Things to Consider:
Main Objectives under consideration while deciding the loading strategies are:
a) Increase data freshness.
b) Increase data loading performance.
To obtain these objectives, we can "remove referential integrity constraints", "drop secondary indexes" and "disable logging". These terms are relative to the system under use as some or all of them are available or non-available in that system.
Normally, dropping indexes is not allowed in incremental and trickle feed techniques.
Full Data Refresh:
In full data refresh, following steps are taken on target system(table):
a) Target table data is deleted.
b) All new data is loaded in the target from the source.
c) New indexes are created(of the target table/system).
d) New statistics are calculated for the refreshed table.
Incremental Load/Refresh:
In incremental refresh:
a) New and updated data is acquired from the source system.
b) The new data is appended to the existing data in the target system and the updated data is updated in the target system's existing data.
c) The indexes and statistics are updated accordingly.
Trickle Feed:
In trickle feed: When ever the source system is updated, the changing are made in the target system immediately.
How to choose a specific technique:
Choosing a specific technique to load/refresh your target system depends upon different factors:
1. Amount of data updated/changed/added.
2. Quality of data required i.e. How fresh data should be.
For case 1: if the updated/changed/added data is > (in some cases equal to) 10% of the existing data then we are going to use FULL REFRESH instead of INCREMENTAL as the time required to update the existing system will be much higher than the time required to reload the complete data (Insert operation is almost 10 times faster than the update operation).
"But for the two techniques, the data to be loaded should be in bulk and thus the freshness is low."
For case 2: if we need fresh data, then we have to use the TRICKLE FEED, regardless of the performance(time required) of this technique.
Loading an already loaded system:
--Bilal.zubairy 06:03, 20 September 2007 (UTC)