Enabling the Real-Time Enterprise with Data Streaming
Data-driven businesses are replacing batch ETL uploads with pipelines to transform analytics and business intelligence.
- By Yair Weinberger
- August 7, 2017
In virtually every industry, organizations want to unlock real-time intelligence from disconnected data sources in order to improve business agility and competitiveness. However, legacy batch upload architectures are too rigid to accommodate applications that require continuous data streams for adaptive decision making.
Although a growing number of organizations have placed their data warehouses in the cloud, on-demand storage and compute cycles do not address their data pipeline challenges.
Because batch processes only upload data once a day, twice a day, or even once an hour, their ability to support the data-to-data needs of warehouses and today's real-time BI applications is diminished.
Data Streaming: A Disruptive New Approach
To take its place, an innovative new architecture called data streaming has emerged. It was conceived to create real-time data pipelines for implementing data streaming applications powered by open source technologies such as Apache Kafka.
Data streaming creates secure pipelines that stream data in real time from various sources -- notably databases, applications, and APIs -- to cloud data warehouse platforms.
It enables organizations to connect any data source within minutes to Amazon Redshift, Google BigQuery, Snowflake Computing, and other cloud data warehouses.
Organizations in virtually every industry are moving to a real-time data model to be more agile and achieve a competitive edge through faster, better decision making. Data streaming ensures that customers can integrate all their disparate data silos with the cloud provider of their choice.
Using Amazon Redshift, Google BigQuery, Snowflake Computing, and others, enterprises can create a central warehouse for data available from back-end relational databases, online events and metrics, support services, and other internal and external sources. Without centralization, analytics are both piecemeal and siloed. This makes it difficult, if not impossible, to produce real-time intelligence.
Creating this centralized data warehouse is not without its challenges because this data is spread across multiple sources and different systems in different formats. Some of it's flat, some is relational, some is JSON. Instead of writing custom scripts to integrate it all, which is beyond the resources of most companies, data streaming technology can perform these tasks.
Data streaming relieves IT staff from the drudgery of data movement so IT can focus entirely on data analytics. Because data streaming technologies can support a comprehensive set of integrations, enterprises can easily stream and access all their data in the cloud data warehouse of their choice. Every bit of data -- no matter how big or small and regardless of the source or format -- can be moved to the cloud without errors or requiring a team of engineers to write scripts.
Data Streaming Checklist
Here are the key elements to consider when planning a data streaming project.
Flexible data integration: The ability to transport data in the format required to any data warehouse, regardless of whether the data is structured or semistructured, direct or customized, static or changing. This includes sources such as:
- Transactional databases (e.g. Oracle, PostgreSQL)
- Salesforce.com (account information, stage, ownership, etc.)
- Website tracking (all Web event data)
- Web servers (customer activity such as adding inputs and deploying new code in the code engine)
- Back-end logs (internal platform events such as data being loaded to the output, new table created)
- Monitoring systems (to capture system issues such as input connections and latency)
Schema import and schema inference: Expect step-by-step data preconfiguration tools that make it easy to map every field of structured or semistructured data to a table and control how data is loaded into the data warehouse.
Code engine: Data scientists and engineers should be able to write custom code to enrich and cleanse data, create alerts, implement sessionization, and detect anomalies. To eliminate lengthy, high-latency data preparation jobs, all changes should be executed in stream in real time so data reaches its intended destination.
Live monitoring: Real-time visibility into data streams for monitoring behavior, identifying potential discrepancies, and debugging data records saves considerable time and helps you avoid problems. Live monitoring also lets you track incoming throughput, latency, loading rates, and error rates and can generate Web and email alerts.
Pipeline transparency: A dashboard that provides continuous views of data in motion and notifications that allow users to view incoming events, monitor throughput and latency, and identify errors in real time.
Schema management: When data changes, a real-time response is needed to make sure no event is lost. The ability to manage these automatically or generate notifications so changes can be made on-demand is critical.
Yair Weinberger is cofounder and CTO of Alooma, a company that specializes in data integration. He is an expert in data integration, real-time data platforms, big data, and data warehousing. Previously, he led development for ConvertMedia (later acquired by Taboola). Yair began his career with the Israel Defense Forces (IDF) where he managed cybersecurity and real-time support systems for military operations.