5 Things to Look for in Your Next Data Integration Platform
Learn what to look for in a data integration platform that will meet your company's unique needs.
- By Troy Hiltbrand
- May 28, 2021
As any business looks to expand their analytics program, one of the first decisions they face is how to bring disparate data together in a unified fashion conducive to analysis and reporting. This process of integrating data often starts with a set of manual, labor-intensive processes coupled with generic desktop technology such as spreadsheets and text editors.
Quickly, analytics teams start to realize how much time and effort exercises of this type require. They also see how inflexible these mechanisms are in supporting growth. It is at this point they look to the market for data integration tools, searching for something that will provide scalability and enhanced features to help their analytics program mature.
If this sounds familiar to you and your team in your search for a data integration platform that will meet your company's unique requirements, it is helpful to understand what areas are essential to include in your evaluation. These five characteristics of modern data integration platforms will help you to pare down potential solutions to the one that fits your specific use case.
Sources and Destinations
The first criteria in evaluating a data integration platform is to understand if the platform supports the types of data in your data environment. You need to ensure the tool can connect to your sources to pull data and store your clean and formatted results to the correct destination.
This discussion includes an evaluation of whether the data integration platform supports the specific type of technology deployed in your environment. If you have a specific brand of database, ensure that the data integration platform has connectors to access that data. When evaluating these connectors, ensure that you understand if this is a native connection or a more generic connection, such as ODBC or JDBC. The method the data integration platform uses to connect to the database can have an overall impact on the complexity of setup and the performance of the connection. Both factors can impact the timeliness of getting data flowing through your environment.
As part of this evaluation, make sure you understand how the data integration platform can tunnel through security protocols in your environment to allow data to move freely as needed. This is essential if you have data in multiple parts of your network, in the cloud, or spanning both on-premises and off-premises environments. You must understand how your data integration platform will work effectively with firewalls and other forms of perimeter security as you work to integrate your data assets into a consistent, unified view.
In your evaluation, go deeper than just a surface discussion of how your structured databases will integrate through your data integration platform. Also, take into account nonstructured data sources. This could include sources such as spreadsheets (on-premises or in the cloud), document repository systems, and exposed APIs and webhooks using different methods of authentication. This can also include alternative forms of structured data, such as NoSQL databases, including key-value, graph, columnar store, and in-memory databases.
All of these technologies can be both sources and destinations in your data landscape and need to be considered in your discussions with data integration platform providers.
Bulk and Batch Data Movement
Once you are assured the data integration platform can communicate with your sources and destinations, the next step in the evaluation process is understanding the timing associated with the data movement.
You will need to assess how data can be moved in pre-defined batches. With batch management, one of the key criteria to understand is how scheduling is configured and managed. If you have a highly technical data integration team, you can probably leverage more technical configurations that use scripting or the command line. If you need a clean user interface for scheduling jobs and monitoring their progress, make sure you are clear about what your data integration platform offers.
With the batch movement of data, it is also important to understand logging, how logs are accessible, and what type of alerting is available if something goes awry in the batch process. With many modern logging platforms, there is incredible power associated with being able to pump every event and log into a central system and then query that platform to see exactly what you want. However, this comes at a cost of complexity. The more flexible the platform is in capturing all logs and events in the system, the more effort it takes to extract that data in a format that can drive the right decisions.
With bulk processes, you will often need mechanisms to limit the amount of data flowing through your data integration process at any given time. Identify how your candidate tools can manage batches. This could include a control mechanism based on timestamp data integrated into your source or could utilize more complex forms of change data capture. These mechanisms help you ensure you have the right data flowing across without overloading the system by copying data multiple times from your source to your destination, potentially causing duplicates and performance inefficiencies.
As a companion to batch processing, today's modern data environments need to access streaming data. When working with streaming data, you need to understand how your data integration platform works with messages and queues. These form the backbone of a real-time data environment.
As part of the evaluation of the streaming capability, determine if processing can be parallelized so you can increase capacity to scale as the data load grows. Also, you need to understand what the capabilities and limitations are associated with the inbound processing of stream data. This will help you to understand what types of transformation can occur during streaming and what types of transformation must occur post load (in a subsequent batch process that is either scheduled or event-driven).
With message queuing, it is important to understand what the platform has in terms of high availability and recoverability. You will want to ensure that if there is a problem with the data mid-flight in the stream it can be replayed without corrupting your target destination.
As your data catalog grows and the number of sources increases, you will want to be clear about where the data came from and what types of transformations it underwent in the process. This is where the metadata management tools -- those that are included with your data integration platform or can be added to the platform -- become important. You must understand what data lineage tools are provided.
In addition to understanding where your data came from, you will want to have traceability on the business and technical rules in place during the transformation process. This could include data cleansing processes, match and merge processes, and specific business logic associated with getting your data ready for downstream analysis.
Be sure you understand how you will audit changes to the data over time. This will allow you to assure information consumers that the data they are using for key decisions is accurate, durable, and won't change without reason.
Finally, be sure you understand if the tool supports data virtualization. In real-time, can it federate multiple data sources to provide end users with a unified view, without removing the data from its source system? Data virtualization can be beneficial when there are requirements for a real-time view of the data or when there are regulatory or business reasons associated with limitations on where the data can be copied. It is important to understand what the limitations are in terms of real-time data transformations as the data is being federated and what the performance impacts are as these transformation requirements become more complex.
With your data management projects you might have a combination of data marts, a data warehouse, and a data lake. Each might have different requirements associated with data in the environment, what data is pulled in real time from a secondary source, and what data passes through the environment temporarily. Understanding this will help you better discuss your data integration platform requirements for data virtualization.
If you can delineate what you need in these five areas as you investigate data integration platforms, you will have a solid base for a comparative evaluation. This will help you to objectively assess multiple data integration platforms based specifically on what you need for success. Doing this will increase your potential for choosing a tool that meets your needs without overinvesting in functionality that has a low probability of being used or contributing to your company or team's goals.