Pentaho Data Integration (PDI): Doing Simple ETL to Produce Dimensional Model


In a recent article, I tried to give some idea on ETL (Extract-Transform-Load) process with some points on what to avoid or what to embrace for ETL. We also listed Pentaho Data Integration (PDI) as an ETL tool.

This step-by-step hands-on article walks you through PDI tool installation, SQL JDBC Driver setup and carries out a very basic ETL process to transform a sample csv file into dimensional model. We hope to provide yet another article on dimensional modeling.

A. Pentaho BI Suite


Pentaho is a BI suite built using Java and as in Nov’18 version 8.1 is released that is the commercial version. For this article’s demo purpose, I am using 30-day-trial version from Hitachi Vantara website. There is also a Community edition with free tools that lack some functionalities of commercial product and also some functionalities are modified.

Pentaho BI suite is collection of different tools for ETL or Data Integration, Metadata, OLAP, Reporting and Dashboard, etc. tools.

Below sections are some short descriptions of what I did using Pentaho Data Integration (PDI) tool, a.k.a Spoon.

Installing Pentaho BI Suite 8.1 (30-day trial)

ETL – Using Pentaho Data Integration (PDI) tool (aka spoon)

For this demo, we are going to load a small dummy file (downloaded from internet) into staging table of SQL Server and then create dimension and fact tables from that staging table.

Let’s open the PDI tool and first step is to make sure that we can connect to target SQL Server.

In PDI GUI, go to File -> New ->“Database Connection…” and “test” the connection to SQL Server:

As we see, we need to make PDI tool to identify SQL JDBC driver. To do so, download and unzip the file “sqljdbc_6.0.8112.200_enu.exe” and copy 2 files (jre8\sqljdbc42.jar and auth\x64\sqljdbc_auth.dll) to <Pentaho_Install_Path>\design-tools\data-integration\lib folder.

Also make sure that TCP/IP and Named Pipe protocols are enabled through ‘SQL Server Configuration Manager’.

Now restart the PDI tool and try again to connect to the SQL database.

We are all set and now we will go through the input/output and then create some files in Pentaho Data Integration (PDI) tool in step-by-step manner.

Dummy Data file (csv) – 9.4mb (88476 rows x 11 columns)

B. PDI Transformations:

Pentaho Data Integrator (PDI) transformations are like SQL Server Integration Services (SSIS) dtsx package that can be developed full or a part of the ETL process.

As part of the Demo POC, I have created 3 PDI transformations:

1.Staging – This transformation file (DemoStage1.ktr) just loads the csv file into staging SQL2014 table. The “Strings cut” is used to make “Q1 2012” type data from csv file to convert to quarter number {1, 2, 3, 4}.

a. Csv file ->ProductSales

2. Dimension Load – This transformation file (DemoDim1.ktr) further truncate/load the staging table’s data into separate dimensions.

  1. ProductSales -> dbo.dimRetailer
  2. ProductSales -> dbo.dimOrderMethodType
  3. ProductSales -> dbo.dimProduct
  4. ProductSales -> dbo.dimPeriod                                                             

3. Fact Load – This transformation file (DemoFact1.ktr) truncate/load the staging table’s data into fact table by looking up each of the dimension tables built for surrogate keys.

a. ProductSales -> dbo.factProductSales

  1. lookup RetailerID
  2. lookup OrderMethodID
  3. lookup ProductID
  4. lookup YearQuarterID (PeriodID)

Below are the screenshots of each of the transformations and the job.

1. Transformation 1: Staging (DemoStage1.ktr) ->
Time Taken 1.9 seconds (88475 rows)

1a. CSV file input: This is under ‘Input’ node of “Design” tab at left side pan of PDI

1b. Strings Cut: This can be found under “Transform” node of Design tab in left side of PDE.

1c. Stage Table: This is table output of “output” node of Design pan.

2. Transformation 2: Dimension Tables (DemoDim1.ktr) ->
Time Taken 0.3 seconds
Below are 2 screenshots of DemoKim1.ktr, before and after execution of the transformation package.

In this transformation, the concept is to drop-create all the dimension tables then populating each of the dimension tables.

2a. Execute SQL script: This is under “Scripting” node and it contain drop-create DDL statements of all 4 dimension tables (dimRetailer, dimOrderMethodType, dimProduct and dimPeriod).

2b. Table Input: this tool from “Input” node is used to read distinct required fields to populate dimension tables. This ‘Table Input’ is used for all 4 transformation tasks (e.g. dimRetailer, dimOrderMethodType, dimProduct and DimPeriod).

2c. Table Output: This transformation tool is used for transferring Table Input result set to Table Output hence populates individual dimension tables. All 4 bottom transformations (highlighted yellow) utilizes same concept.

3. Transformation 3: Fact Table (DemoFact1.ktr)
Time Taken 2.3 seconds

3a. Execute SQL script: This task drop-creates the fact table (factProductSales).

3b. Table Input: “ProductSales” task is actually a ‘Table Input’ transformation task that selects rows from staging table (ProductSales). From here, we will use lookups to get surrogate keys of each of the dimension tables we created. Finally we will populate our fact table with surrogate keys and measure fields.

3c. Lookup: ‘Database Value Lookup’ transformation task from “Lookup” node is used to get corresponding surrogate keys from the dimension tables. For instance, in below screenshot, we are getting RetailerID surrogate key from dimRetailer dimension table by joining 2 fields.

Same concept is used for all 4 lookup transformation tools:

3d. Table Output: Finally, we are pushing surrogate keys (yellow highlighted) and other measures into factProductSales table.

C. PDI Job

Pentaho Data Integrator (PDI) can also create JOB apart from transformations. Job is just a collection of transformations that runs one after another. For example, a complete ETL project can have multiple sub projects (e.g. separate transformation files) that Job can trigger one after another. PDI Job has other functionalities that can be added apart from just adding transformations.

As part of the DEMO POC, I have created a single Job that executes 3 transformations in specific order.

PDI Job: Demo Job (DemoJob1.kjb) executes all 3 above transformations in a single go.

4a. Start and Success:

4b. Executing Transformation saved files: The 3 transformation tasks actually execute 3 saved transformation files (e.g. DemoStage1.ktr, DemoDim1.ktr and DemoFact1.ktr) from file system in specific order.

That was all for a simple demo on Pentaho Data Integration (PDI) tool. If you have any queries regarding to BI solution, feel free to knock us anytime.

1 thought on “Pentaho Data Integration (PDI): Doing Simple ETL to Produce Dimensional Model”

Leave a Comment

Your email address will not be published. Required fields are marked *