7 Vendors at Strata/Hadoop World 2016: First Impressions
Fern Halper, vice president and senior director of TDWI research for advanced analytics, offers her perspective on announcements from seven vendors.
- By Fern Halper
- December 5, 2016
If you've ever attended Strata/Hadoop World, you know that there are a lot of vendors there. I met with many of them while I was at the conference and heard a lot about data lakes, analytics, and data science. Here's my quick rundown on seven of them (in alphabetical order).
Who they are: Cambridge Semantics was formed in 2007 as an outgrowth of the Web semantics group at IBM. The company's goal is to enable organizations to speed time to value for smart data discovery and exploratory analytics. Its graph-based approach helps companies and end users to combine, manage, and analyze all of their data (structured and unstructured) to deliver deep and insightful business guidance. As part of this, the company just announced the creation of a "Smart Data Lake" solution that provides data ingestion, transformation, metadata management, discovery, and data analytics. The smart data lake provides a semantic overlay on top of all the data in the lake. It provides three dozen NLP applications that are part of offering; some are internal to the lake and some as add-ons.
What they announced: The company formally announced the release of its Anzo Smart Data Lake solution that can combine, manage, and analyze all of an enterprise's data regardless of format or location. The Smart Data Lake makes use of graph technology that can create facts by linking and contextualizing data. For example, a pharmaceutical company running 150 clinical trials at the same time can create relationships between compounds, entities, research materials, doctor, and client. The solution can track lineage back to the original source documents.
Why it matters: At TDWI, we're seeing considerable interest in data lakes -- where data is ingested in its raw, original state, straight from data sources, with little or no cleansing or transformation. Most data lakes are built on top of Hadoop, which means that a wide range of disparate data types can be ingested -- including unstructured data. What good is populating a data lake with unstructured data if you can't use it as data? It is important to be able to gain insight from this data in a useful way. Semantic technology can help here to extract entities, concepts, facts, and sentiments. Additionally, the graph technology illustrates the relationships between entities in a useable format.
Who they are: Paris-based Dataiku was founded three years ago with the mission to take raw data and turn it into predictions. The software -- Dataiku Data Science Studio -- connects to multiple data sources and installs on the desktop, server, or in the cloud. It provides a collaborative platform with a team work space that can be used by organizations to build flows to collect data, transform it, and analyze it. Currently the company connects to 25 different data sources including Hadoop, NoSQL databases, and big data appliances from the major vendors. The company supports five different machine-learning libraries (including Scikit-Learn, MLlib, and XGboost) and connects via APIs to products such as H20.
The company offers a free version on its website.
What they announced: The company announced a $14M Series A funding round led by FirstMark Capital which it plans to use to accelerate global growth by nearly doubling its staff in the coming year in their main offices in Paris and New York. The company now has 100 customers, mostly large enterprises and most are based in Europe. However, the company is expanding in the U.S. and has opened a New York City office.
Why it matters: Organizations want easy-to-use tools for advanced analytics such as machine learning, and Dataiku provides that for business analysts in addition to providing a coding environment for data scientists. This role or "persona"-driven approach is one that we're seeing more frequently at TDWI and speaks to the diversity of people in an organization who want to make use of data science. What is interesting about Dataiku is that it provides a collaboration space as part of its solution. This is important because as more business analysts try to make use of more advanced technologies, they will need to have the support of the data scientists that are part of their team.
Who they are: Informatica provides end-to-end data management solutions in the cloud, on premises, or in a hybrid environment. The company is 100 percent focused on data and its platform includes data integration, data quality, data governance, master data management, data archiving, and data security. With purpose-built interfaces for multiple kinds of users (business analysts, data stewards, IT data engineers, etc.), Informatica's solutions include pre-built connectors, pre-built parsers, pre-built transformations along with development and runtime optimizations for data integration. It utilizes machine-learning-based data, domain, and schema discovery along with a metadata-driven enterprise information catalog and real-time data mastering to help organizations discover data relationships.
What they announced: Informatica said it is expanding its data lake management solution (called Intelligent Data Lake) for Hadoop to help organizations easily find, prepare, share, master, govern, and protect data in an integrated platform supported both on-premises and in the cloud. New platform components include:
- Informatica Enterprise Information Catalog helps users discover and understand enterprise data
- Informatica Intelligent Streaming helps enterprises process real-time events and big data streams
- Informatica Blaze increases processing performance using data pipeline optimization with job partitioning and recovery
- Informatica [email protected] provides proactive monitoring and alerting functionality for the proliferation of sensitive data
- Informatica Big Data Management supports deployment in Microsoft Azure HDInsight
Why it matters: TDWI is seeing increased interest in data lakes to manage large volumes of raw data containing highly diverse data types for analysis. That said, enterprises must be able to manage that data to quickly and repeatably get value from it, as well as to make sure that users with the appropriate roles can work with it. Informatica has a multi-year history of providing data management solutions so databases and data warehouses don't become data swamps. With increasing quantity and complexity of big data, there is an even greater need for data management on Hadoop data lake platforms.
Who they are: Canada-based InfoTrellis, formed in 2007, focuses on helping MDM customers design and implement their MDM solutions. By 2012, the company had introduced InfoTrellis AllSight to the market, a product that allows organizations to efficiently manage and analyze customer-related big data. The company's focus is on customer intelligence management (CIM) -- i.e., a pre-built system to help to manage and analyze structured and unstructured customer data and provide an intelligent 360-degree view of customers to organizations.
The software is based on open source and runs on Hadoop using MapReduce as well as Spark. InfoTrellis partners with analytics companies for specialized, deep analytics on this data.
What they announced: AllSight 4.6 includes three major enhancements:
1. Graph-based visualization. This feature helps data scientists find, use, and analyze customer data more efficiently. Users can search for customers and find relationships to other data (products, concepts, events, etc.). Complex graph networks of billions of entities and hierarchies of 10,000+ entities can be summarized and visualized in a single customer 360 dashboard.
2. Spark as a compute engine. InfoTrellis announced support for Spark for compute intensive workloads.
3. AllSight support for salesforce.com. AllSight shares intelligent customer data with business application via service APIs; version 4.6 contains an enhanced integration with salesforce.com.
Why it matters: Organizations want to understand their customers and get a complete view of their customer activity. Yet, customer data evolves. What is interesting about InfoTrellis is that it is using machine learning to synthesize customer data sources, as well as a graph-based approach to understanding customers, products, and accounts. AllSight manages complex relationship structures with advanced graph database technology and its engine finds and relates data attributes related to that customer. In other words, it provides a graph network view of relationships between that customer and any piece of relevant data. AllSight is also built to learn from new sources of data.
At Strata, InfoTrellis stressed the use of Allsight 4.6 with data lakes; however, it can also provide a customer 360-view to other systems such as sales and marketing platforms and augment master data management for a consolidated customer 360-view.
Pentaho, A Hitachi Group Company
Who they are: Pentaho, a Hitachi Group company, provides a data integration and business analytics platform that incorporates data preparation and integration for analytics processes, as well as a suite of reports, visualizations, and dashboards for business users. These analytics can be embedded into existing applications for increased adoption. Pentaho focuses on data preparation across a variety of traditional and emerging data sources and formats, as part of Pentaho Data Integration (PDI) which is core to the Pentaho platform. Pentaho supports Hadoop and other big data stores, enabling teams to visually create transformation processes that can run in these environments.
Additionally, given its open source heritage, Pentaho integrates with a series of more advanced analytics as well such as R and Weka and Python.
What they announced: Pentaho made a series of big-data-related announcements, including:
- Enhanced Spark integration: Pentaho wants to make Spark more accessible so existing IT resources can manage and coordinate Spark applications. It announced SQL on Spark so that analysts can write SQL to pull Spark data. It also announced drag-and-drop Spark libraries (such as the machine-learning library) that can be part of PDI -- the data pipeline. There are also new Spark orchestration capabilities.
- Enhanced Hadoop security: Pentaho said it has expanded its Hadoop data security integration to promote better big data governance. This included enhanced Kerberos compatibility for secure multi-user authentication and an Apache Sentry capability to enforce rules that control access to specific Hadoop data assets. In other words, it is providing enhanced granularity around Kerberos access control.
- Metadata injection: Pentaho announced expanded metadata injection to ingest more data sources at scale. This enables data engineers to dynamically generate PDI transformations at runtime; i.e., to create one transformation process that has intelligence to interrogate each file individually and obtain the metadata to then drive it into a conformed data set on Hadoop.
- Support for Kafka, Avro, and Parquet: Pentaho announced optional support for additional PDI steps to further integrate with the broader big data ecosystem. This includes support for Kafka, a publish/subscribe messaging queue for big data and IoT use cases. In addition, PDI offers support for Avro and Parquet output files, two common formats for storing data in Hadoop in big data onboarding use cases.
Why it matters: Pentaho's goal with these new product announcements is to accelerate time to value for big data and big data analytics as well as to help organizations using open source (more often a path to big data and data science) to feel more comfortable doing so.
We're definitely seeing increased interest in Spark and other open source technologies at TDWI. For instance, in a recent TDWI survey, approximately 30 percent of respondents were already using Hadoop on premises (with additional respondents using it in the cloud). Data security is especially important in any kind of data management platform and we're seeing more vendors enhance their security of open source platforms. Likewise, the attention on Spark is important. Spark is an open source in-memory processing engine that is known for its speed. It also has a sophisticated analytics library and supports streaming. In a recent TDWI survey, close to a quarter of respondents rated Spark as a top technology for data science in the coming year. It makes sense that Pentaho would look to support Spark as part of its analytics pipeline.
Who they are: Sisense's solution helps enterprises prepare, analyze, and visualize big or disparate datasets using a single technology stack. The company's mission is to simplify BI against complex data for business users and allow BI to be deployed in days rather than weeks. It supports dozens of data sources and runs on commodity hardware using Sisense's patented in-chip analytics, an alternative to in-memory technology designed to maximize disk, memory, and CPU. This helps business users analyze and visualize big data quickly.
What they announced: Sisense talked about Sisense everywhere, which is now in private beta. The idea behind Sisense Everywhere is to make BI an ambient thing that tells you what you need to know when you need to know it. There are several components to Sisense everywhere:
- An NLP interface to enable users to ask BI questions in a natural way. This currently integrates with Amazon Echo.
- Visual cues-connected lamps to use color rather than dashboards to convey BI insights. For example, the light bulb will change color based on BI results (green is good, for instance).
- Sisense bots provide two-way dialogue through instant-message conversations about data.
Why it matters: Sisense provides tools to end-user organizations and offers its tools via OEM licenses. Its goal with Sisense everywhere is to make self-service more self-serving by providing an easy-to-use platform as well as new devices and interfaces for humans to interact with data in a more natural way so BI becomes more pervasive, consumable, and human.
Who they are: Zoomdata provides fast visualizations for big data using its patented Data Sharpening technology -- the visualization progressively renders and becomes clearer as more data is processed from a query. This allows business users to visually consume and interact with data in seconds, even across billions of rows of data. Zoomdata can support a multitude of data types, including real-time streaming data and text/search data, with native connectors to multiple big data sources such as Hadoop, MPP databases, NoSQL databases, as well as traditional SQL databases.
What they announced: The company now has a partnership with Teradata and supports the Teradata Unified Data Architecture (UDA). In addition, Zoomdata announced support for Google Big Query. The company also announced a joint solution with Cloudera for customer insight. The Customer Insights Solution combines the visual analytics of Zoomdata with Cloudera Enterprise to provide a high-performance solution for businesses to better understand their customers. It is also partnering with Amazon to provide free credits towards a trial in the Amazon cloud.
Why it matters: TDWI research finds that the vast majority of organizations that participate in our surveys are performing some kind of visual data discovery. Likewise, organizations are collecting ever-increasing amounts of data and are modernizing their data warehouse to include new platforms such as Hadoop or data appliances. They are also expanding into the cloud. These organizations want to be able to perform data discovery and visual analytics on this big data, which is increasingly disparate in nature, and they want to do it at scale and with high performance.