<html><body><h2 style="text-align:justify;">1. What is Data Transformation?</h2>Data transformation is a chain of processes and activities to modify, structure, format, and calculate values for data from multiple sources and even multiple formats (csv, json, txt, binary, image, etc.) for a destination storage system or application with a standard format that we can use for the purpose of research, an application’s input, making reports, or making predictions for the future (AI).The next thing I want to show you is the main workflow of Data Transformation.<img src="https://api.careers.saigontechnology.com/storage/ck-editor/CKEDITOR-r6U1EC8tjssf5okZraxR1M3IsEwhQc5AVJFEH9HG.png" alt="">Figure 1. Data Transformation WorkflowLook at the Figure 1 above, and it is easy to see that there are four basic steps to the data transformation process:Collect dataDetermine the data sources that must be gathered as well as the original data format.Put data into DataLake / Data Warehouse We must consider two types of storage: Data Lake Or Data Warehouse:<ol><li style="text-align:justify;">Consider using a data warehouse when the source data has structure and a clear format.</li><li style="text-align:justify;">Consider using a data lake (prefer) when the source data is unstructured, messy, or in a different format, and maybe the source storage needs to be scalable (advance required when needed).</li></ol>Do ETL (Extract, Transform, Load) job with data pipelinesThe main step to extract data from source data and execute transforming data, after this step we will have data is structured. (Keep reading to understand about ETL)Sink to destination storagesThe data is clear and structured, and it is ready for query from any service.Below are some very simple cases to apply data transformation (I don’t want to give you any advanced scenarios, just keep it simple to understand).<img src="https://api.careers.saigontechnology.com/storage/ck-editor/CKEDITOR-5P0aopsEwzN5yLjX0xMoDsf8NrGw2isPTWOaUmrX.png" alt="">Figure 2. Transform raw data into structured and formatted data<img src="https://api.careers.saigontechnology.com/storage/ck-editor/CKEDITOR-2TwskhqjzOh23A7kMq6yJoMUu0sQNkp72FFO3lVV.png" alt="">Figure 3. Remove duplicate dataETL Process in Data TransformationETL is not a new concept, it was introduced in the 1970s, and now it is the main method for data warehouse projects. Data transformation has multiple steps inside its processes, and ETL is a very important step to transform data from multiple source types to destination storage. ETL (Extract, Transform, Load) is a data integration standard that combines data from multiple sources into a single entity at the end of the process and then stores it in a data warehouse, database, or other storage location that you need to query for your services.<h2 style="text-align:justify;">2. So How does ETL work?</h2>To easily understand how ETL works, it is best to go through each step in detail, but what happens inside the ETL process?  Actually, the ETL process will be run via one or multiple pipelines, and every pipeline will run with the 3 steps below:  “Note: ETL support for parallel execution.” Step 1 - Extract Activities: By default, ETL data pipelines cannot run without a trigger from a data source. At this step, one or more activities will receive triggered data (zip file, csv, image, etc.). and load raw data into pipeline memory, normally it will be called a "DataSet," and the data pipeline can have one or more "DataSet” objects.Example: If the source triggered file is a zip package, we need to create an activity to unzip first and then create an activity to load all the data files inside the zip package to the DataSet object.In summary, in this step, the main job is to prepare the input data for the ETL data pipeline. Step 2 - Transform Activities: This step will contain the main activities of ETL data pipeline processing. In this step, ETL provides very powerful built-in tools to assist us in executing transformations such as filtering, removing duplicate data, inserting and removing columns, using math formulas, and some data processing tools such as grouping, joining, combining data from multiple datasets, making HTTP requests, using webhooks, and so on. (Some ETL tools allow you to connect to third party software / service / execute scripts). However, ETL is a real low-code platform, so sometimes it cannot adapt to all your expectations with built-in tools only. So is there any way to execute transform activity in a special case?, yes So there will be special handling that requires intervention from the coders to handle.  Example scenarios: You need to append header and footer strings for your destination output file, but ETL does not have any tool support for custom adding free text to files at the expected position, so how do we resolve this issue? ETL tools allow us to call serverless functions (or execute scripts like Python,... depending on the ETL provider), so this is a workable solution:.Sink the output file into storage..Create a serverless function in C#, javascript, java, or another language (with a file path as an input parameter) to append a string to a file and save the override file to storage..To execute the transform, ETL invokes the serverless function in (2)..Finish the transformation activity. Step 3 - Load Activities: This is the final step in the ETL processing flow. This step depends on your business and the solution that you need to sink your final data to the target storage that you will use as a database, file storage system, data warehouse, data lake, etc.Refer to image below to see overview of the process<img src="https://api.careers.saigontechnology.com/storage/ck-editor/CKEDITOR-pMsyQPjtWloOrPEqNG20ya8Or6ZCDghaEd4PlUxQ.png" alt="">Figure 4. ETL process workflow<h2 style="text-align:justify;">3. Introduce ETL and Data Transformation tools</h2>There are a lot of ETL and data transformation tools, but I want to introduce and highly recommend some popular ones with powerful features and support from big corporations:<h3 style="text-align:justify;">Azure Data Factory (ADF)</h3>Adf is an ETL and data transformation cloud-based tool that was developed by Microsoft and runs on the Microsoft Azure Cloud. It can automatically pull data from both outside and inside Azure services, such as an FTP server, an on-premises database, Azure blob storage, other cloud services (AWS, Google,...), and it supports CI/CD with the Microsoft Azure devops tool.Reference: <a href="https://azure.microsoft.com/en-us/products/data-factory/">Azure Data Factory</a><h3 style="text-align:justify;">IBM DataStage</h3>DataStage is an ETL tool from IBM and is part of the IBM Platform Service. However, it is not a cloud-based tool; you need to buy a license and install software on your machine. When working with this software, you need to spend effort managing the software, such as setting up replication or parallel processing. But if you do not have to spend time and resources on this, one option for you is to try to use IBM DataStage on AWS, because AWS already hosts IBM DataStage as a cloud service.Reference:  <a href="https://www.ibm.com/support/pages/ibm-infosphere-information-server-version-117-windows">IBM DataStage</a> , <a href="https://aws.amazon.com/marketplace/pp/prodview-s42oa5unsg3tu">AWS DataStage</a><h3 style="text-align:justify;">Oracle Data Integrator</h3>ODT is a high-performance ETL tool developed by Oracle. It provides two types of environments: on-premises and cloud-based (Oracle Cloud). The strength of ODT is that it is designed in an open architecture, so it is easy to interface with other big data tools such as Hadoop, Spark Streaming, Hive, Kafka, HBase, Sqoop, Pig, Cassandra, NoSQL databases, etc.Reference: <a href="https://docs.oracle.com/en/cloud/paas/data-integrator-cloud/user/getting-started-oracle-data-integrator-cloud.html">Oracle Data Integrator</a><h3 style="text-align:justify;">Hadoop</h3>Hadoop is an open source software framework from Apache and a free power tool for big data processing. ETL is a feature supported in Hadoop (Use MapReduce). It's not easy to learn but Hadoop has a large community so you can get support from them.Reference: <a href="https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html">Hadoop</a><h3 style="text-align:justify;">AWS Glue</h3>AWS Glue is a serverless data integration tool run on Amazon cloud services that supports ETL and data transformation. But AWS Glue only connects to data sources that are hosted on AWS. It is great if your solution runs on an AWS environment.Reference: <a href="https://aws.amazon.com/glue/">AWS Glue</a><h2 style="text-align:justify;">4. Why should we use ETL vs Data Transformation?</h2>In fact, many companies have been using ETL and data transformation to aggregate and process data from many places to predict and make their business strategies, and with ETL, they can upload data to any location. anywhere, on any system, quickly and easily, without the need to assemble a team of professional programmers. Let's take a look at the strengths of ETL and data transformation to understand why they should be used.Low cost: There is no need to write any custom software; a data engineer can use the ETL tool without coding experience.Scale and Replica: There is no need to consider this because almost all ETL tools support this automatically.Support big data with different file formats and from anywhere in the world.Support non-relational databases, relational databases, data lakes, and data warehouses.Support to connect to multiple data analytics tools (Power BI, Power platform, Tableau, Zoho,..).A real-world example I'd like to share with you involves the use of ETL and data transformation (Of course, I will not share details because of the security agreement).Our partner is providing solutions for finance companies to analytic trading signals and their investors, but the problem is that every company has its own data format with different types of data, and the source storage is totally different as well. We are not allowed to directly access their database, but are only allowed to access the data through shared storage: FTP Server, WebApi, S3, Blob storage.The old solution that our partner used was building software to pull data from those companies automatically and have it run on an on-premises server. Another piece of software for converting raw data into databases has a clear structure.Everything worked fine at first, but in the long run, they had quite a few main problems as below:<ul><li style="text-align:justify;">A lot of issues were introduced by the product team. They need a strong team to develop and maintain issues.</li><li style="text-align:justify;">Slow performance when data is large. The end services take a long time to have data.</li><li style="text-align:justify;">Spend more cost to scale on-prem server for software. They built an IT and Devops team to manage this.</li></ul>They suggested a data expert engineer use ETL tools to replace this custom software. You can refer to the solution with specific images below:<img src="https://api.careers.saigontechnology.com/storage/ck-editor/CKEDITOR-fejqIX4FiS2xmsiOjeiQ8evX4TxybQrIDKttH9lv.png" alt=""> Figure 5. Apply Azure Data Factory to build Data TransformationBenefits after applied this solution:<ul><li style="text-align:justify;">Data engineers can involve transforming processes, they reduce development team members.</li><li style="text-align:justify;">Everything is controlled by Azure cloud so no need to keep and maintain on-prem servers.</li><li style="text-align:justify;">Scale and replica automatically with Azure cloud.</li><li style="text-align:justify;">High availability and performance.</li><li style="text-align:justify;">Update and deploy fast with Azure CI/CD.</li><li style="text-align:justify;">More security because Azure cloud handled security problems.</li><li style="text-align:justify;">Do not have any problem with big data when using DataLake.</li></ul><h2 style="text-align:justify;">5. Conclusion</h2>This article is sent to you through my own practical experience from a project that uses ETL tools and data transformation very well, so I hope this article can help you learn and apply it in the field of data collection and processing data, especially when the data is increasingly large and complex.  </body></html>

Data is the king, especially in data analysis. This requires collecting data from many places, and it takes effort to transform it before it can be used. What would you think if we had a system that could help without the need for developer expertise? Data transformation will be the best one.

ETL Data Transformation and Why We Should Use It

Data Lake and Data Warehouse: Which one would you choose when your application needs to store a set of data? To do that, you need to understand how it is stored and how it works and distinguish the differences between them.

Data Lake And Data Warehouse

Kỹ thuật Messaging là một kỹ thuật rất phổ biến trong các dự án phần mềm hiện nay và việc phát triển phần mềm không còn giới hạn trên một nền tảng cụ thể nào nữa. Vì vậy câu hỏi đặt ra là làm sao để các dịch vụ có thể giao tiếp với nhau một cách hiệu quả? Kỹ thuật Messaging chính là câu trả lời.

ETL Data Transformation and Why We Should Use It

Related articles

OTHER ARTICLES FROM LINH PHAM