Extract Transform and Load/Data Transformation
ETL process: Extract Transform and Load is the process of transferring data from the source
databases to a target database after performing the following set of operations:
this process entails extracting data from single or multiple datasources. The data sources could be unstructured(such as Excel
spreadsheets) or structured data as in relational databases. For example your input data sources could be SQL server inputs,
Oracle inputs, Excel files, flat files etc.
Transform: This process entails the following set of processes:
o Cleansing the data:
this entails removing the test and the redundant data from the input data sources.
o Transforming the data:
this entails performing operations on the input data set so that the output is in the required format. We could perform the
following set of operations for executing the same:
Choosing a subset of columns from the input source.
§ A large
number of operations on the character data set
Generate surrogate keys.
Lookups: this entails finding the required values from standalone tables.
transform: performing multiple ‘ifs’.
Load: The data is normally loaded onto the target databases. The target databases could be Operational
data stores(in 3rd normal format), Data staging(keeping the copy of the database from input data sources after
data cleansing on a single platform), Datawarehouse(could be in 3rd normal or star schema format). When we load
the data onto the target databases we need to keep the following into consideration:
o Update strategy: truncate
the tables before loading the database tables or append the targets. Bulk insert is also an option.
o Parallel processing:
in how many parallel threads would an ETL job be executed.
Job dependencies: this entails deciding on the order in which the ETL jobs need to be executed.
This is normally taken care of by the script in the ETL tool.
Data loading into the Datawarehouse or operational reporting sources: when we load the data into the target databases,
we need to consider the following intricacies:
Handling type-1 entities: initial and incremental load.
type-2 entities: for initial and incremental load.
Working with sources with time stamps
· Working with sources without
Our firm has written a paper to handle the above four issues
during our ETL projects. For downloading the same, you could click on the following link:
Handling of fact table load: a fact table should contain the surrogate keys from the dimension tables and the measures
to be analyzed. In order to load the fact tables we extensively use Look up or outer join functions.
Handling of aggregate table loads: when we need to aggregate the fact table data for a higher level of granularity(say
fact table is loaded on a daily basis but the aggregate needs to be on a monthly basis), then we generate aggregate tables.
This operation normally entails performing sum, average operations on the measures in the fact table and grouping by dimension
Factless fact: a factless fact table is a table with no measures.
It is used in those cases where we need to perform aggregation operations on entities in dimension tables. Say you have a
person dimension, you desire to find what is the count of the persons in a particular district. Then one generates a factless
fact table that is a copy of the dimension table.