Data Integration and ETL for Dummies (Like Me)
In early 2020, I was introduced to the idea of data integration through a friend who was working in the industry. Yes, I know. Extremely late. All I knew about it was that I could have my data in one (virtual) place and then have it magically appear in another (virtual) place. I had no clue how it was done or how important it was to modern businesses.
To give you some background, my past work experience is not in any kind of technical space. It is in business development and marketing for non-technical products. I probably should have been more aware of the technical world around me, but for now, you must forgive me for my ignorance.
A couple days after my friend and I had our initial discussion about the data integration space, he enlightened me to the importance of being able to extract, transform, and load (ETL) data. I was curious about the inner workings behind the process, so I decided I wanted to learn more about ETL and data integration.
My goal with this article is to help you learn the basics that I believe are the most important in understanding what data integration is. As a non-technical professional, I had no clue where to start other than to look up “What is data integration?” So let's start there.
What Is Data Integration?
In general, it is a bridge for data. Data integration allows for data that is sitting in separate places (often referred to as sources) to be brought together into a unified view. A unified view is when all the data you have gathered from various sources is put all in one place, allowing you to view all your data in one place.
This doesn’t mean that the data no longer exists in the original source you pulled it from. A more accurate way of portraying it is as if someone copied the data from the original source and then sent that copy of the original data across the bridge. This is similar to the idea of data replication, except for that data replication is just copying data into different places for security and safety purposes.
So in reality, data integration is not moving the original data around, but rather replicating it and putting the copy of the original data somewhere else (that somewhere else usually referred to as a target) for purposes we will get into a later in the article.
Another important piece of the data integration space is the transformation of data. This is an essential piece of the data integration world as it can inhibit your data integration pipeline (a pipeline is the data integration from start to finish — importing the data to exporting the data) from being able to join all the data.
The problem is that data from different sources often come in different formats which cannot be combined into a unified view unless some or all of the data is transformed to the same format. I will touch more on this later in the article.
Who Needs Data Integrated?
There are two major use cases for data integration.
Business Analysts
The first is business analysts who want to be able to check all their business intelligence in one place. Instead of going to Google Analytics to check how many people visited their website, Mixpanel to check what people clicked inside their website, Chargebee to check analytics on their subscription billings, and Mailchimp to check how their cold email campaign is doing, data integration allows these analysts to see it all in one place. Let’s look at an example.
This is a simple dashboard with two graphs. The first shows monthly recurring revenue and the second shows the churn for the company. Using a data integration pipeline, the user of this dashboard was able to connect their business accounts to populate these graphs. They are now able to view their data from multiple sources in a simple unified view. Starting to see why this can be useful?
Developers
The second use case is for developers who are building products that require some sort of input of data. Try to put yourself in this scenario.
You are a developer building an accounting product that helps accountants recognize inconsistencies with their (online) bookkeeping. To do this, you need to be able to access the data of the bank account they are working with, their ERP system (an ERP system is software that helps manage main business processes), and their billing system.
To bring all of this data together, the developer would have to set up a data integration pipeline that pulls from the bank account, the ERP system, and the billing system. Once they have done that, they can load the data into their product and check for any inconsistencies in the bookkeeping.
As you can now see, data integration is a valuable and important part of almost any company’s business processes these days. We will touch more on why it is so important later in this article.
How Does Data Integration Happen?
This is probably the most abstract thing that I’ll touch in this article. As I mentioned before, you have to imagine data integration as a virtual bridge between a source and a target.
Analysts or developers will choose from three options when it comes to building their data integration pipeline. The first is having their own team build integrations. This is a time consuming process — one integration (meaning an integration with one specific source) can take a small team up to two weeks to build, but that is not even the biggest expense.
Once built, these integrations need to be maintained. This is because, without getting too technical, there are some moving pieces that change on a consistent basis, requiring the specific integration to be updated regularly.
The second option is hiring a team of professionals to build integrations for you. While this is a hands-off custom solution, this option is typically expensive. The team of professionals would be doing the exact same work if you were to build it yourself, but (obviously) it would abstract away any of the work from your own team.
The third option is using a third-party tool to help a company build their pipeline. This is a hybrid between the two previous options. You are still using your own team to set up the integrations, but many of the necessary resources are abstracted away by the software platform they decide to use. You do not have to build the integration or build the target, but usually just define the transformation layer (this means that you are deciding the format you want your data to be in).
Something to note with these tools is that some are made for developers (like hotglue) and other tools are made for analysts (like Fivetran). The tools for developers often handle more complex processes as it allows developers to take a more granular approach, while the analyst tools are great for non-technical professionals looking to set up a simple data integration pipeline.
In the past few years, many companies have been trending towards using these types of tools as they are becoming increasingly more efficient, customizable, and affordable.
The reason that some companies elect to handle their own data integration pipeline is because of the complexity the integrations may require. Typically, third-party solutions tend to be a bit more rigid and don’t allow for companies to customize their pipeline on a granular level.
On a more specific level, there are a bunch of specific processes that are housed under the term data integration. The most popular one is…
ETL (Extract, Transform, Load)
ETL is the most popular processes within data integration. It is used when someone wants to bring data from multiple sources together but also requires that the format of the data to change.
For example, if someone had an Excel or CSV file that they wanted to move to a specific database but all the other data sources were in JSON format (a very popular data interchange format), that file would have to be transformed into a JSON in order for it to be combined with the others.
Once the file is transformed, it is sent to the target defined for this specific pipeline where it is put together with all the other information that was pulled through.
Sometimes the process is altered to be ELT (Extract, Load, Transform) for different use cases with smaller amounts of data, but each step still serves the same function.
Why Is Data Integration Important?
Companies that want to remain competitive need to embrace the benefits that data will give them, despite the challenges that arise when dealing with integrating data. Having an efficient data integration pipeline allows for these companies to utilize humongous datasets. It helps these companies improve their value to customers — from deepening their business intelligence to improving their real time information delivery.
Conclusion
Let’s quickly do a review of some of the vocabulary we learned today.
- Source: Where data originally sits and will be pulled from
- Unified view: The singular record that is made up of the data brought in from various sources
- Data replication: The process of creating copies of data and storing them in different locations — usually for safety and security measures
- Target: Where you want the data to end up
- Transform/transformation: Changing the formatting of data
- Pipeline: A data integration process from start (initially getting the data) to end (where the data is outputted)
- ERP: Enterprise resource planning is software that helps companies manage main business processes like accounting
- ETL: Stands for Extract, Transform, Load — the most popular data integration framework used today
- CSV: A popular file format called Comma Seperated Values (CSV)
- JSON: A popular type of data format
This is just the beginning to data integration. Data integration as a topic is quite large and this barely scratches the surface, but hopefully this article helped get you started. Cheers to no longer being a data integration dummy (maybe)!
Thank you so much for reading.