A Basic Understanding of ETL Process

In our first article of BI series, we demonstrated just a basic overview of Business intelligence following the necessary tools for BI and regarding business advantages. In this article, we will dig deep into one of the BI processes of gathering data from any mass storage and processing it into useful business data ready for analytics and reporting.

What is ETL (Extract, Transform, Load)

In short, ETL is the process of gathering data from an unlimited number of sources, organizing it together, and centralizing it into a single repository to get the final business insight. This is the process by which data is extracted from heterogeneous data sources that are not optimized for analytics, transformed and business logic applied and then moved to a target system (e.g. DW), optimized for analytics.

Some people perform ETL through hand coding in SQL or Java, but there are tools available that simplify the process. This article takes a close look at ETL business cases, the advantages of using an ETL tool rather than hand coding, some basic do’s and don’ts.

Business Case for ETL

Consider that a company already have multiple systems (e.g. SAP, budgeting, production systems, online systems, mainframes, database, Access file, etc.) in different departments. Also, departments store some data manually into text, csv and Excel files on regular basis. Yet some files are generated every day from some remote systems and put into FTP or any shared location. Note that each system can have its own logic of preparing data.

We need a set of processes that can load all these data into a database after different rules applied. Maybe we can choose between 2 methods:

1)  We can  manually import data from different systems.The problem here is

  1. We need huge time to complete loading data.
  2. Also, it is hard to standardize data manually.
  3. Additionally, while we are done with loading all data, new files/data get generated that remain unprocessed!

2)  We can do this using ETL process that will integrate data of the source systems into clean, standardized and analytics-ready data into target database with set of business logic applied.

Then the business can not only see all data in one place but also put some BI tools or application on top of centralized data to gain insights of whole business. The main other benefit is whole data is having single version of truth.

Types of ETL Process

There are basically 2 kinds of ETL process- Traditional ETL and Modern ELT Process.

While ETL concept is very popular since 1970s, recently with the increase of distributed databases and commodity/cluster systems another process ELT is becoming more popular and being adopted by many tools (like Oracle data Integrator, ODI). Microsoft SSIS uses ETL approach but we can also do some ETL if we design processes in such way.

For ETL, transformations happen mainly in ETL tool then data is loaded into target database (e.g. DW). For ELT, data is loaded into target database (e.g. DW) then transformations are applied in that target. So, for ELT systems it is a must to have very powerful target system (DW).

Since the concepts of different phases of ETL or ELT processes are mostly same, in this article “ETL” word is used. Same concept/section of this article applies to ELT too.

Data is then transformed, e.g. data cleansing and optimizing data for analysis, in a staging area. The transformed data is then loaded into an OLAP analytics database or DW. BI teams/applications then run queries/reports or used as input for ML algorithms on top of that database.

In modern era, ETL approach is changing to ELT making not requiring staging areas and rather performing transformations in target analytics database or DW residing on powerful machines/grid/cloud.

Since the concepts of different phases of ETL or ELT processes are mostly same, in this article “ETL” word is used. Same concept/section of this article applies to ELT too.

ETL Phases

There are three phases of ETL process:

Extract Phase:

During Extract phase of ETL, identification of data sources, rows and fields are required. Data is extracted (copied) from OLTP and other databases, flat files, csv, excel, FTP or even Web. Make sure not to put negative impact on source systems.

 

Data extraction commonly happens in one of the 3 ways: Update notification, Incremental Extraction or Full Extraction. Whichever way we choose, main consideration we must make is not to keep source systems busy for long during this ETL phase.

Transform Phase:

Transformation phase is all about cleansing, giving business logics, looking up or join between data from multiple source, aggregating data, etc. to prepare it for analysis. One of the main goals for this phase is also to make different data into single standard that is ready for business analytics.

Examples of transformation can include: cleaning, validation, deduplication, format change, data enriching, filtering, merging, business rule apply, summarization, aggregation, etc. More can be found for Microsoft SSIS or Pentaho Data Integrator.

Load Phase:

In this 3rd final step of ETL process, the already extracted and transformed data is loaded into target database (or DW). There can be 2 methods: Full Load and Incremental Load. This Load phase ultimately puts the end dataset into data warehouse into dimensional modeling. There are several optimization methodologies for very large fact table loading that are not discussed here.

ETL Best Practices (DO’s)

  1.    Gather and properly analyze requirements (functional, data, operational, maintenance, etc.)
  2.    Properly identify source systems and their data volume and granularity.
  3.    Collect, define and document unambiguous/complete/non conflicting business logics/rules.
  4.    Identify target database logical and physical design and design documents.
  5.    Keep cleansing and validation mechanism for dates, empty fields, standardization process.
  6.    Try to follow good practices for SQL codes, SSIS, ODI, DW, etc.
  7.    Make use of synonyms, db-link/linked-server to make codes dynamic.
  8.    Try to apply parallelism in ETL processes.
  9.    Make sure of Data Governance and MDM is in place if required.
  10.   Extract only minimal subset of data from sources to reduce latency and impact on systems.
  11.   Avoid lookup caching for large tables. Try to use lowest possible data type.
  12.   Avoid row-by-row processing; process relational data as much possible in DBMS.
  13.   Handle indexing and partitioning [switching] on target system specially for fact tables.
  14.   Break whole ETL processes into sub-processes and try to make ETL process run from beginning or specific sub-process.
  15.   Keep logging, notification and alerting sub-processes in place.
  16.   Keep proper LLD (Low Level Documentation) and support documentation.
  17.   Have a maintenance plan and planning for process failure.

ETL Bad Practices (DON’Ts)

If we follow “DO’s” then many of the DON’Ts can be avoided. Below is list of more DON’Ts

  1.    Do not start creating ETL sub-processes without prior planning, design and documentation.
  2.    Do not ‘hardcode’ filenames, paths, FTPs, credentials, etc. Try variables, parameters, etc.
  3.    Try to avoid snow-flaking and 3NF in target DW database.
  4.    Avoid transferring data to staging/target database without applying proper cleansing and data type conversion. Keep proper handling for type mismatch (ask client).
  5.    Avoid using row-by-row transformations and try to avoid sorting in ETL Tool.
  6.    Do not use non-numeric types for surrogate and primary keys.
  7.    In SQL and for target database, try to avoid cursors, loops and indexing while loading data.
  8.    Keep steps for index and statistics creation after loading data.
  9.    Refrain from giving “real-time” ETL solutions. If you really need ‘near-real-time’ ETL then take extra caution only to extract/transform/load minimal data.
  10.   Do not bypass dev-testing even for simple processes.

Required Tools for ETL Process

There are many more BI tools out there depending on your business need. Some of them are mentioned below.

  1. Informatica
  2. SAP Business Objects Data Integrator – for enterprises
  3. IBM InfoSphere DataStage
  4. Microsoft SQL Server Integration Services (SSIS)
  5. Oracle Data Integrator 12c (ODI)
  6. Pentaho Data Integration 8.1 (PDI) – commercial and open source
  7. Jasper ETL – enterprise and open source
  8. Talend Open Studio for Data Integration – open source

We will be coming up with more insights of different BI processing methods in our next articles.

Feel free to reach us if you need any guidance or help regarding to BI solutions.

Leave a Comment

Your email address will not be published. Required fields are marked *