ETL Data Transformation and Why We Should Use It

ETL Data Transformation and Why We Should Use It

Linh Pham

ETL Data Transformation and Why We Should Use It

Data is the king, especially in data analysis. This requires collecting data from many places, and it takes effort to transform it before it can be used. What would you think if we had a system that could help without the need for developer expertise? Data transformation will be the best one.

ETL Data Transformation and Why We Should Use It

What is Data Transformation?

Data transformation is a chain of processes and activities to modify, structure, format, and calculate values for data from multiple sources and even multiple formats (csv, json, txt, binary, image, etc.) for a destination storage system or application with a standard format that we can use for the purpose of research, an application’s input, making reports, or making predictions for the future (AI).

The next thing I want to show you is the main workflow of Data Transformation.

Figure 1. Data Transformation Workflow

Look at the Figure 1 above, and it is easy to see that there are four basic steps to the data transformation process:

Collect data

Determine the data sources that must be gathered as well as the original data format.

Put data into DataLake / Data Warehouse

 We must consider two types of storage: Data Lake Or Data Warehouse:

  1. Consider using a data warehouse when the source data has structure and a clear format.
  2. Consider using a data lake (prefer) when the source data is unstructured, messy, or in a different format, and maybe the source storage needs to be scalable (advance required when needed).

Do ETL (Extract, Transform, Load) job with data pipelines

The main step to extract data from source data and execute transforming data, after this step we will have data is structured. (Keep reading to understand about ETL)

Sink to destination storages

The data is clear and structured, and it is ready for query from any service.

Below are some very simple cases to apply data transformation (I don’t want to give you any advanced scenarios, just keep it simple to understand).

Figure 2. Transform raw data into structured and formatted data

Figure 3. Remove duplicate data

ETL Process in Data Transformation

ETL is not a new concept, it was introduced in the 1970s, and now it is the main method for data warehouse projects. Data transformation has multiple steps inside its processes, and ETL is a very important step to transform data from multiple source types to destination storage. ETL (Extract, Transform, Load) is a data integration standard that combines data from multiple sources into a single entity at the end of the process and then stores it in a data warehouse, database, or other storage location that you need to query for your services.

So How does ETL work?

To easily understand how ETL works, it is best to go through each step in detail, but what happens inside the ETL process? 

 

Actually, the ETL process will be run via one or multiple pipelines, and every pipeline will run with the 3 steps below:  

“Note: ETL support for parallel execution.”

 

Step 1 - Extract Activities: By default, ETL data pipelines cannot run without a trigger from a data source. At this step, one or more activities will receive triggered data (zip file, csv, image, etc.). and load raw data into pipeline memory, normally it will be called a "DataSet," and the data pipeline can have one or more "DataSet” objects.

Example: If the source triggered file is a zip package, we need to create an activity to unzip first and then create an activity to load all the data files inside the zip package to the DataSet object.

In summary, in this step, the main job is to prepare the input data for the ETL data pipeline.

 

Step 2 - Transform Activities: This step will contain the main activities of ETL data pipeline processing. In this step, ETL provides very powerful built-in tools to assist us in executing transformations such as filtering, removing duplicate data, inserting and removing columns, using math formulas, and some data processing tools such as grouping, joining, combining data from multiple datasets, making HTTP requests, using webhooks, and so on. (Some ETL tools allow you to connect to third party software / service / execute scripts).

 

However, ETL is a real low-code platform, so sometimes it cannot adapt to all your expectations with built-in tools only. So is there any way to execute transform activity in a special case?, yes So there will be special handling that requires intervention from the coders to handle. 

 

Example scenarios: You need to append header and footer strings for your destination output file, but ETL does not have any tool support for custom adding free text to files at the expected position, so how do we resolve this issue? ETL tools allow us to call serverless functions (or execute scripts like Python,... depending on the ETL provider), so this is a workable solution:

.Sink the output file into storage.

.Create a serverless function in C#, javascript, java, or another language (with a file path as an input parameter) to append a string to a file and save the override file to storage.

.To execute the transform, ETL invokes the serverless function in (2).

.Finish the transformation activity.

 

Step 3 - Load Activities: This is the final step in the ETL processing flow. This step depends on your business and the solution that you need to sink your final data to the target storage that you will use as a database, file storage system, data warehouse, data lake, etc.

Refer to image below to see overview of the process

Figure 4. ETL process workflow

Introduce ETL and Data Transformation tools

There are a lot of ETL and data transformation tools, but I want to introduce and highly recommend some popular ones with powerful features and support from big corporations:

Azure Data Factory (ADF)

Adf is an ETL and data transformation cloud-based tool that was developed by Microsoft and runs on the Microsoft Azure Cloud. It can automatically pull data from both outside and inside Azure services, such as an FTP server, an on-premises database, Azure blob storage, other cloud services (AWS, Google,...), and it supports CI/CD with the Microsoft Azure devops tool.

Reference: Azure Data Factory

IBM DataStage

DataStage is an ETL tool from IBM and is part of the IBM Platform Service. However, it is not a cloud-based tool; you need to buy a license and install software on your machine. When working with this software, you need to spend effort managing the software, such as setting up replication or parallel processing. But if you do not have to spend time and resources on this, one option for you is to try to use IBM DataStage on AWS, because AWS already hosts IBM DataStage as a cloud service.

Reference:  IBM DataStage , AWS DataStage

Oracle Data Integrator

ODT is a high-performance ETL tool developed by Oracle. It provides two types of environments: on-premises and cloud-based (Oracle Cloud). The strength of ODT is that it is designed in an open architecture, so it is easy to interface with other big data tools such as Hadoop, Spark Streaming, Hive, Kafka, HBase, Sqoop, Pig, Cassandra, NoSQL databases, etc.

Reference: Oracle Data Integrator

Hadoop

Hadoop is an open source software framework from Apache and a free power tool for big data processing. ETL is a feature supported in Hadoop (Use MapReduce). It's not easy to learn but Hadoop has a large community so you can get support from them.

Reference: Hadoop

AWS Glue

AWS Glue is a serverless data integration tool run on Amazon cloud services that supports ETL and data transformation. But AWS Glue only connects to data sources that are hosted on AWS. It is great if your solution runs on an AWS environment.

Reference: AWS Glue

Why should we use ETL vs Data Transformation?

In fact, many companies have been using ETL and data transformation to aggregate and process data from many places to predict and make their business strategies, and with ETL, they can upload data to any location. anywhere, on any system, quickly and easily, without the need to assemble a team of professional programmers. 

Let's take a look at the strengths of ETL and data transformation to understand why they should be used.

Low cost: There is no need to write any custom software; a data engineer can use the ETL tool without coding experience.

Scale and Replica: There is no need to consider this because almost all ETL tools support this automatically.

Support big data with different file formats and from anywhere in the world.

Support non-relational databases, relational databases, data lakes, and data warehouses.

Support to connect to multiple data analytics tools (Power BI, Power platform, Tableau, Zoho,..).

A real-world example I'd like to share with you involves the use of ETL and data transformation (Of course, I will not share details because of the security agreement).

Our partner is providing solutions for finance companies to analytic trading signals and their investors, but the problem is that every company has its own data format with different types of data, and the source storage is totally different as well. We are not allowed to directly access their database, but are only allowed to access the data through shared storage: FTP Server, WebApi, S3, Blob storage.

The old solution that our partner used was building software to pull data from those companies automatically and have it run on an on-premises server. Another piece of software for converting raw data into databases has a clear structure.

Everything worked fine at first, but in the long run, they had quite a few main problems as below:

  • A lot of issues were introduced by the product team. They need a strong team to develop and maintain issues.
  • Slow performance when data is large. The end services take a long time to have data.
  • Spend more cost to scale on-prem server for software. They built an IT and Devops team to manage this.

They suggested a data expert engineer use ETL tools to replace this custom software. You can refer to the solution with specific images below:

 

Figure 5. Apply Azure Data Factory to build Data Transformation

Benefits after applied this solution:

  • Data engineers can involve transforming processes, they reduce development team members.
  • Everything is controlled by Azure cloud so no need to keep and maintain on-prem servers.
  • Scale and replica automatically with Azure cloud.
  • High availability and performance.
  • Update and deploy fast with Azure CI/CD.
  • More security because Azure cloud handled security problems.
  • Do not have any problem with big data when using DataLake.

Conclusion

This article is sent to you through my own practical experience from a project that uses ETL tools and data transformation very well, so I hope this article can help you learn and apply it in the field of data collection and processing data, especially when the data is increasingly large and complex.

Resources

  • Demo source code: N/A

References

 

saigon_technology

OTHER ARTICLES FROM LINH PHAM

Tìm Hiểu Kỹ Thuật MESSAGING

calendar

21 Dec 2022

time

20 mins read

career_saigon_technology

Ho Chi Minh (Headquater)

location_pin

2nd & 3rd Floor, M.I.D Building, 02 Nguyen The Loc Street, Ward 12 Tan Binh District, Ho Chi Minh City, Vietnam

DMCA.com Protection Status

Quick Links

Opening Jobs

About us

Life at saigontechnology

Blog

Contact Us

PRIVACY POLICY

Follow Us

saigontechnology-great-plage-to-work-partner
saigontechnology-microsoft-partner
iso_9001iso_27001
sao_khue
50_leading

© Copyright 2022 by STS Software Technology JSC, Leading Software Outsourcing Company in Vietnam. All Rights Reserved.