TDWI Blog

Data Management Blog Posts

See the most recent Data Management related items below.


The Role of Centralization and Self-Service in a Successful Data Hub

A hub should centralize governance, standards, and other data controls, plus provide self-service data access and data prep for a wide range of user types.

By Philip Russom, Senior Research Director for Data Management, TDWI

I recently spoke in a webinar run by Informatica Corporation, sharing the stage with Informatica’s Scott Hedrick and Ron van Bruchem, a business architect at Rabobank. We three had an interactive conversation where we discussed the technology and business requirements of data hubs, as faced today by data management professionals and the organizations they serve. There’s a lot to say about data hubs, but we focused on the roles played by centralization and self-service, because these are two of the most pressing requirements. Please allow me to summarize my portion of the webinar.

A data hub is a data platform that serves as a distribution hub.

Data comes into a central hub, where it is collected and repurposed. Data is then distributed out to users, applications, business units, and so on.

The feature sets of data hubs vary. Home-grown hubs tend to be feature poor, because there are limits to what the average user organization can build themselves. By comparison, vendor-built data hubs are more feature rich, scalable, and modern.

A true data hub provides many useful functions. Two of the highest priority functions are:

  • Centralized control of data access for compliance, governance, security
  • Self-service access to data for user autonomy and productivity

A comprehensive data hub integrates with tools that provide many data management functions, especially those for data integration, data quality, technical and business metadata, and so on. The hallmark of a high-end hub is the publish-and-subscribe workflow, which certifies incoming data and automates broad but controlled outbound data use.

A data hub provides architecture for data and its management.

A quality data hub will assume a hub-and-spoke architecture, but be flexible so users can customize the architecture to match their current data realities and future plans. Hub-and-spoke is the preferred architecture for integration technologies (for both data management and applications), because it also falls into obvious, predictable patterns that are easy to learn, design, optimize, and maintain. Furthermore, a hub-and-spoke architecture greatly reduces the number of interfaces deployed, as compared to a point-to-point approach, which in turn reduces complexity for greater ease of use and maintainability.

A data hub centralizes control functions for data management.

When a data hub follows a hub-and-spoke architecture, it provides a single point of integration that fosters technical standards for data structures, data architecture, data management solutions, and multi-department data sharing. That single point also simplifies important business control functions, such as governance, compliance, and collaboration around data. Hence, a true data hub centralizes and facilitates multiple forms of control, for both the data itself and its usage.

A data hub enables self-service for controlled data access.

Self-service is very important, because it’s what your “internal customers” want most from a data hub. (Even so, some technical users benefit from self-service, too.) Self-service has many manifestations and benefits:

  • Self-service access to data makes users autonomous, because they needn’t wait for IT or the data management team to prepare data for them.
  • Self-service creation of datasets makes users productive
  • Self-service data exploration enables a wide range of user types to study data from new sources and discover new facts about the business

These kinds of self-service are enabled by an emerging piece of functionality called data prep, which is short for data preparation and is sometimes called data wrangling or data munging. Instead of overwhelming mildly technical or non-technical users with the richness of data integration functionality, data prep boils it down to a key subset of functions. Data prep’s simplicity and ease-of-use yields speed and agility. It empowers a data analyst, data scientist, DM developer, and some business users to construct a dataset with spontaneity and speed. With data prep, users can quickly create a prototype dataset, improve it iteratively, and publish it or push it into production.

Hence, data prep and self-service work together to make modern use cases possible, such as data exploration, discovery, visualization, and analytics. Data prep and self-service are also inherently agile and lean, thus promoting productive development and nimble business.

A quality hub supports publish and subscribe methods.

Centralization and self-service come together in one of the most important functions found in a true data hub, namely publish-and-subscribe (or simply pub/sub). This type of function is sometimes called a data workflow or data orchestration.

Here’s how pub/sub works: Data entering the hub is certified and cataloged on the way in, so that data’s in a canonical form, high quality, and audited, ready for repurposing and reuse. The catalog and its user-friendly business metadata then make it easy for users and applications to subscribe to specific datasets and generic categories of data. That way, users get quality data they can trust, but within the governance parameters of centralized control.

Summary and Recommendations.

  • Establish a data architecture and stick with it. Rely on a data hub based around a hub-and-spoke architecture, not point-to-point hairballs.
  • Adopt a data hub for the business benefits. At the top of the list would be self-service for data access, data exploration, and diverse analytics, followed by centralized functions for data governance and stewardship.
  • Deploy a data hub for technical advancement. A hub can organize and modernize your infrastructure for data integration and data management, as well as centralize technical standards for data and development.
  • Consider a vendor-built data hub. Home-grown hubs tend to be feature-poor compared to vendor-built ones. When it comes to data hubs, buy it, don’t build it.
  • Demand the important, differentiating functions, especially those you can’t build yourself. This includes pub/sub, self-service data access, data prep, business metadata, and data certification.
  • A modern data hub potentially has many features and functions. Choose and use the ones that fit your requirements today, then grow into others over time.

If you’d like to hear more of my discussion with Informatica’s Scott Hedrick and Rabobank’s Ron van Bruchem, please click here to replay the Informatica Webinar.

Posted on July 12, 20160 comments


Comprehensive and Agile End-to-End Data Management

The trend toward integrated platforms of multiple tools and functions enables broader designs and practices that satisfy new requirements.

By Philip Russom, Senior Research Director for Data Management, TDWI

Earlier this week, I spoke in a webinar run by Informatica Corporation and moderated by Informatica’s Roger Nolan. I talked about trends in user practices and vendor tools that are leading us toward what I call end-to-end (E2E) data management (DM). My talk was based on three assumptions:

  1. Data is diversifying into many structures from new and diverse sources.
  2. Business wants to diversify analytics and other data-driven practices.
  3. End-to-end data management can cope with the diversification of data, analytics, and business requirements in a comprehensive and agile manner.

In our webinar, we answered a number of questions pertinent to comprehensive and agile end-to-end data management. Allow me to summarize some of the answers for you:

What is end-to-end (E2E) data management (DM)?

End-to-end data management is one way to adopt to data’s new requirements. In this context, “end-to-end” has multiple meanings:

End-to-end DM functions. Today’s diverse data needs diverse functions for data integration, quality, profiling, event processing, replication, data sync, MDM, and more.

End-to-end tool platform. Diverse DM functions (and their user best practices) must be enabled by a portfolio of many tools, which are unified in a single integrated platform.

End-to-end agility. With a rich set of DM functions in one integrated toolset, developers can very quickly on-board data, profile it, and iteratively prototype, in the spirit of today’s agile methods.

End-to-end DM solutions. With multiple tools integrated in one platform, users can design single solutions that bring to bear multiple DM disciplines.

End-to-end range of use cases. With a feature-rich tool platform and equally diverse user skills, organizations can build solutions for diverse use cases, including data warehousing, analytics, data migrations, and data sync across applications.

End-to-end data governance. When all or most DM functions flow through one platform, governance, stewardship, compliance, and data standards are greatly simplified.

End-to-end enterprise scope. End-to-end DM draws a big picture that enables the design and maintenance of enterprise-scope data architecture and DM infrastructure.

What is the point of E2E DM?

End-to-end (E2E) data management (DM) is all about being comprehensive and agile:

  • Comprehensive -- All data management functions are integrated for development and deployment, with extras for diverse data structures and business-to-DM collaboration.
  • Agile -- Developers can very quickly on-board diverse data, profile it, and both biz/tech people can iteratively prototype and collaborate, in today’s agile spirit.

What’s an integrated tool platform? What’s it for?

An integrated platform supports many DM tool types, but with tight integration across them. The end-to-end functionality seen in an integrated DM platform typically has a data integration and/or data quality tool at its core, with additional tools for master data management, metadata management, stewardship, changed data capture, replication, event processing, data exchange, data profiling, and so on.

An integrated platform supports modern DM architectures. For example, the old way of architecting a DM solution is to create a plague of small jobs, then integrate and deploy them via scheduling. The new way (which requires an integrated toolset) architects fewer but more complex solutions, where a single data flow calls many different tools and DM functions in a controlled and feature-rich fashion.

An integrated tool platform supports many, diverse use cases. Furthermore, the multiple integrated tools of the end-to-end platform support the agile reuse of people, skills, and development artifacts across use cases. Important use cases include: data warehousing, analytics, application modernization, data migration, complete customer views, right-time data, and real-time data warehousing.

How does an integrated toolset empower agile methods?

Multiple data disciplines supported in one integrated toolset means that developers can design one data flow (instead of dozens of jobs) that includes operations for integration, quality, master data, federation, and more.

The reuse of development artifacts is far more likely with one integrated toolset than working with tools from multiple vendors.

Daily collaboration between a business subject-matter expert and a technical developer is the hallmark of agile development; an integrated DM platform supports this.

Feature-rich metadata management propels the collaboration of a business person (acting as a data steward) and a data management professional, plus self-service for data.

Self-service data access and data prep presented in a visual environment (as seen in mature integrated toolsets) can likewise propel the early prototyping and iterative development assumed of agile methods.

Automated testing and data validation can accelerate development. Manual testing distracts from the true mission, which is to build custom DM solutions that support the business.

Develop once, deploy at any latency. Reuse development artifacts, but deploy them at the speed required by specific business processes, whether batch, trickle feed, or real time.

Reinventing the wheel bogs down development. Mature integrated toolsets include rich libraries of pre-built interfaces, mappings, and templates that plug and play to boost developer productivity and agility.

What’s the role of self service in agile development methods?

Self-service data access for business users. For example, think of a business person who also serves as a data steward and therefore needs to browse data. Or consider a business analyst who is capable of ad hoc queries, when given the right tools.

Data prep for business users, analytics, and agility. Users want to work fast and independently – at the speed of thought – without need for time-consuming data management development. To enable this new best practice, the tools and platforms that support self-service data access now also support data prep, which is a form of data integration, but trimmed down for reasons of agility, usability, and performance.

Self-service and data prep for technical users. For example, self-service data exploration can be a prelude to the detailed data profiling of new data. As another example, the modern, agile approach to requirements gathering involves a business person (perhaps a steward) and a data professional, working side-by-side to explore data and decide how best to get business value from the data.

What’s the role of metadata in self-service and agile functionality?

We need complete, trusted metadata to accomplish anything in DM. And DM’s not agile, when development time is burned up creating metadata. Hence, a comprehensive E2E DM platform must support multiple forms of metadata:

  • Technical metadata – documents properties of data for integrity purposes. Required for computerized processes and their interfaces.
  • Business metadata – describes data in ways biz people understand. Absolutely required for self service data access, team collaboration, and development agility.
  • Operational metadata – records access by users and apps. Provides an audit trail for assuring compliance, privacy, security, and governance relative to data.

If you’d like to hear more, please click here to replay the Informatica Webinar.

Posted on June 30, 20160 comments


Highlights from Informatica World 2016

Bigger than ever, with more user speakers and an impressive executive vision for product R&D

By Philip Russom, Senior Research Director for Data Management, TDWI

I just spent three days attending and speaking at Informatica World 2016 in San Francisco’s Moscone Center. Compared to previous years, this year’s event was bigger than ever, with over three thousand people in attendance and five or more simultaneous break-out tracks.

The change this year that I like most is the increased number of user case study speakers – almost double last year! To be honest, that’s my favorite part of any event, although I also like hearing executives explain their product vision and direction. With that in mind, allow me to share some highlights in those two areas, based on sessions I was able to attend at Informatica World 2016.

User Case Studies

I had the honor of sharing the stage with data integration veteran Tom Kato of Republic Services. Based on my research at TDWI, I talked about users’ trends toward integrated platforms that include tools for many data disciplines from a single vendor, as opposed to silo’d tools from multiple vendors. Tom talked about how an integrated tool strategy has played out successfully for his team at Republic Services. By adopting a comprehensive end-to-end toolset from Informatica, it was easier for them to design a comprehensive data architecture, with information lifecycle management that extends from data creation to purge.

I heard great tips by a speaker from Siemens about how their data lake is successful due to policies governing who can put data in the lake, what kind of data is allowed, and how the data is tagged and cataloged. “We saved six to twelve months by using simple flat schema in the data lake,” he said. “Eventually, we’ll add virtual dimensional models to some parts of the data lake to make it more like a data warehouse.”

A speaker from Harvard Business Publishing described a three-year migration and consolidation project, where they moved dozens of applications and datasets to clouds, both on premises and off-site (including AWS). They feel that Informatica Cloud and PowerCenter helped them move to clouds very quickly, which reduced the time that old and new systems ran concurrently with synchronization, which in turn reduced the costs and risks of migration.

Red Hat’s data warehouse architect explained his strategy for data warehouse modernization, based on modern data platforms, hybrid mixtures of clouds, complete views of customers, virtual technologies, and agile methods. Among those, clouds are the secret sauce – including Informatica Cloud, AWS, Redshift, and EC2 – because they provide the elasticity and performance Red Hat needs for the variety of analytic, reporting, and virtual workloads they run.

A dynamic duo from Verizon’s data warehouse team laid out their methods for success with clickstream analytics. They follow Gartner’s Bimodal IT approach, where old and new systems coexist and integrate. New tools capture and process clickstreams, and these are correlated with historic data in the older data warehouse. This is enabled by a hybrid architecture that integrates a mature Teradata implementation and a new Hadoop cluster, via data integration infrastructure by Informatica.

Another dynamic duo explained why and how they use Informatica Data Integration Hub (or simply DI Hub). “As a best practice, a data integration hub should connect four key entities,” said one of the Humana reps. “Those are source applications, publications of data, people who subscribe to the data, and a catalog of topics represented in the data.” Humana chose Informatica DI Hub because it suits their intended best practice, plus it supports additional requirements for a data fabric, virtual views, canonic model, data audit, and self service.

Executive Vision for Product R&D

The general sessions mostly featured keynote addresses by executives from Informatica and leading partner firms. For example, Informatica’s CEO Anil Chakravarthy discussed how Informatica technology is supporting Data 3.0, an emerging shift in data’s sources, types, technical management, and business use.

All the executive speakers were good, but I got the most out of the talk by Amit Walia, Informatica’s Chief Product Officer. It was like drinking from the proverbial fire hose. Walia announced one new product, release, or capability after the next, including new releases of Informatica Cloud, Big Data Management, Data Integration Hub, and Master Data Management (with a cloud edition). Platform realignments are seen in Informatica Intelligent Data Platform (with Hadoop as a compute engine, controlled by a new Smart Executor) and Informatica Intelligent Streaming (based on Hadoop, Spark, Kafka, and Blaze); these reveal a deep commitment to modern open source software (OSS) in Informatica’s tool development strategy. One of Walia’s biggest announcements was the new Live Data Map, which will provide a large-scale framework for complex, multi-platform data integration, as is increasingly the case with modern data ecosystems.

That’s just a sample of what Amit Walia rolled out, and yet it’s a tsunami of new products and releases. So, what’s up with that? Well, to me it means that the acquisition of Informatica last year (which made it a private company) gave Informatica back the mojo that made it famous, namely a zeal and deep financial commitment to product research and development (R&D). Informatica already has a broad and comprehensive integrated platform, which addresses just about anything you’d do in traditional data management. But, with the old mojo for R&D back, I think we’ll soon see that portfolio broaden and deepen to address new requirements around big data, machine data, analytics, IoT, cloud, mobile, social media, hubs, open source, and security.

Informatica customers have always been the sort to keep growing into more data disciplines, more data types and sources, and the business value supported by those. In the near future, those users will have even more options and possibilities to grow into.

Further Learning

To get a feel for Informatica World 2016, start with a one-minute overview video

However, I strongly recommend that you “drink from the fire hose” by hearing Amit Walia’s 40-minute keynote, which includes his amazing catalog of new products and releases.

You might also go to www.YouTube.com and search for “Informatica World 2016,” where you’ll find many useful speeches and sessions that you can replay. For something uplifting, search for Jessica Jackley’s keynote about micro loans in the third world.

Posted on May 31, 20160 comments


Modernizing Business-to-Business Data Exchange

Keep pace with evolving data and data management technologies, plus the evolving ecosystem of firms with whom you do business.

By Philip Russom, TDWI Research Director for Data Management

Earlier this week, I spoke in a webinar run by Informatica Corporation, along with Informatica’s Daniel Rezac and Alan Lundberg. Dan, Alan, and I talked about trends and directions in a very interesting data management discipline, namely business-to-business (B2B) data exchange (DE). Like all data management disciplines, B2B DE is modernizing to keep pace with evolving data types, data platforms, and data management practices, as well as evolving ways that businesses leverage exchanged data to onboard new partners and clients, build up accounts, improve operational efficiency, and analyze supply quality, partner profitability, procurement costs, and so on.

In our webinar, we answered a number of questions pertinent to the modernization of B2B DE. Allow me to summarize those for you:

What is business-to-business (B2B) data exchange (DE)?

It is the exchange of data among operational processes and their applications, whether in one enterprise or across multiple ones. A common example would be a manufacturing firm and the ecosystem of supplier and distributor companies around it. In such examples, many enterprises are involved. However, large firms with multiple, independent business units often practice B2B DE as part of their inter-unit communications within a single enterprise. Hence, B2B DE scales up to global partner ecosystems, but it also scales down to multiple business units of the same enterprise.

B2B DE integrates data across two or more businesses, whether internal or external. But it also integrates an ecosystem of organizations as it integrates data. Therefore, B2B DE is a kind of multi-organizational collaboration. And the collaboration is enabled by the transfer of datasets, documents, and files that are high quality, trusted, and standardized. Hence, there’s more than data flowing through B2B data exchange infrastructure. Your business flows through it, as well.

What are common industries and use cases for B2B DW?

The business ecosystems enabled by B2B DE are often industry specific, as with a manufacturer and its suppliers. The manufacturing ecosystem becomes quite complex, when we consider that it can include several manufacturers (who may work together on complex products, like automobiles) and that many suppliers are also manufacturers. Then there are financiers, insurers, contractors, consultants, distributors, shippers, and so on. The data and documents shared via B2B DE are key to establishing these diverse business relationships, then growing and competing within the business ecosystem.

The retail ecosystem is equally complex. A retailer does daily business with wholesalers and distributors, plus may buy goods directly from manufacturers. All these partners may also work with other retailers. A solid hub for B2B DE can provide communications and integration infrastructure for all.

Other examples of modern business practices relying on B2B DE include subrogation in insurance, trade exchanges in various industries, and the electronic medical record, HL7 standards, and payer activities in healthcare.

Why is B2B DE important?

In the industries and use cases referenced above, much of the business is flowing through B2B DE; therefore users should lavish upon it ample resources and modernization. Furthermore, B2B DE involves numerous technical interfaces, but it also is a metaphorical interface to the companies with whom you need to do business.

What’s the state of B2B DE?

There are two main problems with the current state:

B2B DE is still low-tech or no-tech in many firms. It involves paper, faxes, FedEx packages, poorly structured flat files, and ancient interfaces like electronic document interchange (EDI) and file transfer protocol (FTP). These are all useful, but they should not be the primary media. Instead, a modern B2B DE solution is online and synchronous, ideally operating in real time or close to it, while handling a wide range of data and document formats. Without these modern abilities, B2B relationships are slow to onboard and inflexible over time.

B2B DE is still too silo’d. Whether packaged or home-grown, applications for supply chain and procurement are usually designed to be silos, with little or no interaction with other apps. One way to modernize these apps is to deploy a fully functional data integration (DI) infrastructure that integrates data from supply chain, procurement, and related apps with other enterprise applications, whether for operations or analytics. With a DI foundation, modernized B2B DE can contribute information to other apps (for a more complete view of partners, supplies, etc.) and analytic data (for insights into B2B relationships and activities).

What’s driving users to modernize B2B DE?

Business ecosystems create different kinds of “peer pressure.” For example, if your partners and clients are modernizing, you must too, so you can keep doing business with them and grow their accounts. Likewise, if competitors in the ecosystem are modernizing, you must too, to prevent them from stealing your business. Similarly, data standards and technical platforms for communicating data and documents evolve over time. To continue to be a “player” in an ecosystem, you must modernize to keep pace with the evolution.

Cost is also an important driver. This why many firms are scaling down their dependence on expensive EDI-based legacy applications and the value-add networks (VANs) they often require. The consensus says that systems built around XML, JSON, and other modern standards are more feature-rich, agile, and integrate-able with the enterprise.

Note that some time-sensitive business practices aren’t possible without B2B DE operating in near time, such as like just-in-time inventory in the retail industry and outsourced material management in manufacturing. For this reason, the goal of many modernizations is to add more real-time functions to a B2B DE solution.

Self-service is a driver, too. Business people who are domain experts in supply chain, procurement, material management, manufacturing, etc. need self-service access, so they can browse orders, negotiations, shipments, build plans, and more, as represented in B2B documents and data. Those documents and datasets are infamous for data quality problems, noncompliance with standards, and other issues demanding human intervention; so domain experts need to remediate, onboard, and route them in a self-service fashion.

Why are data standards and translations key to success with B2B DE?

The way your organization models data is probably quite different from how your partners and clients do it. For this reason, B2B DE is regularly accomplished via an exchange data model and/or document type. Many of these are industry specific, as with SWIFT for financials and HL7 for healthcare. Many are “de jure” in that they are adjudicated by a standards body, such as the American National Standards Institute (ANSI) or the International Standards Organization (ISO). However, it’s equally common that partners come together and design their own ad hoc standards.

With all that in mind, your platform for B2B DE should support as many de jure standards as possible, out of the box. But it must also have a development environment where you can implement ad hoc standards. In addition, translating between multiple standards can be a critical success factor; so your platform should include several pre-built translators, as well as development tools for creating ad hoc translations.

What are some best practices and critical success factors for B2B DE?

  • Business-to-business data exchange is critical to your business. So give it ample business and technical resources, and modernize it to remain competitive in your business ecosystem.
  • Remember that B2B DE is not just about you. Balance the requirements of clients, partners, competitors, and (lastly) your organization.
  • Poll the ecosystem you operate in to keep up with its changes. As partners, clients, and competitors adopt new standards and tools, consider doing the same.
  • Mix old and new B2B technologies and practices. Older low-tech and EDI-based systems will linger. But you should still build new solutions on more modern platforms and data standards. The catch is to integrate old and new, so you support all parties, regardless of the vintage of tech they require.
  • Build a business case for B2B data exchange. To get support for modernization, identify a high-value use case (e.g., enterprise integration, real time, pressure from partners and competition), and find a business sponsor who also sees the value.

If you’d like to hear more of my discussion with Informatica’s Daniel Rezac and Alan Lundberg, you can replay the Informatica Webinar.

Posted on April 29, 20160 comments


Modernizing Data Integration and Data Warehousing with Data Hubs

As data and its management continue to evolve, users should consider a variety of modernization strategies, including data hubs.

By Philip Russom, TDWI Research Director for Data Management

This week, I spoke in a webinar run by Informatica Corporation, sharing the stage with Informatica’s Scott Hedrick. Scott and I had an interactive conversation where we discussed modernization trends and options, as faced today by data management professionals and the organizations they serve. Since data hubs are a common strategy for capturing modern data and for modernizing data integration architectures, we included a special focus on hubs in our conversation. We also drilled into how modern hubs can boost various applications in analytics and application data integration operations.

Scott and I organized the webinar around a series of questions. Please allow me to summarize the webinar by posing the questions with brief answers:

What is data management modernization?

It’s the improvement of tools, platforms, and solutions for data integration and other data management disciplines, plus the modernization of both technical and business users’ skills for working with data. Modernization is usually selective, in that it may focus on server upgrades, new datasets, new data types, or how all the aforementioned satisfy new data-driven business requirements for new analytics, complete views, and integrating data across multiple operational applications.

What trends in data management drive modernization?

Just about everything in and around data management is evolving. Data itself is evolving into more massive volumes of greater structural diversity, coming from more sources than ever and generated faster and more frequently than ever. The way we capture and manage data is likewise evolving, with new data platforms (appliances, columnar databases, Hadoop, etc.) and new techniques (data exportation, discovery, prep, lakes, etc.). Businesses are evolving, too, as they seek greater business value and organizational advantage from growing and diversifying data – often through analytics.

What is the business value of modernizing data management?

A survey run by TDWI in late 2015 asked users to identify the top benefits of modernizing data. In priority order, they noted improvements in analytics, decision making (both strategic and operational), real-time reporting and analytics, operational efficiency, agile tech and nimble business, competitive advantage, new business requirements, and complete views of customers and other important business entities.

What are common challenges to modernizing data management?

The TDWI survey mentioned above uncovered the following challenges (in priority order): poor stewardship or governance, poor quality data or metadata, inadequate staffing or skills, funding or sponsorship, and the growing complexity of data management architectures.

What are the best practices for modernizing data management?

First and foremost, everyone must assure that the modernization of data management aligns with the stated goals of the organization, which in turn assures sponsorship and a return on the investment. Replace, update, or redesign one component of data management infrastructure at a time, to avoid a risky big bang project. Don’t forget to modernize your people by training them in new skills and officially supporting new competencies on your development team. Modernization may lead you to embrace best practices that are new to you. Common ones today include: agile development, light-weight data prep, right-time data movement, multiple ingestion techniques, non-traditional data, and new data platform types.

As a special case, TDWI sees various types of data hubs playing substantial roles in data management modernization, because they can support a wide range of datasets (from landing to complete views to analytics) and do so with better and easier data governance, audit trail, and collaboration. Plus, modernizing your data management infrastructure by adding a data hub is an incremental improvement, instead of a risky, disruptive rip-and-replace project.

What’s driving users toward the use of modern data hubs?

Data integration based on a data hub replaces two of the biggest problems in data management design and development: point-to-point interfaces (which limit reuse and standards, plus are impossible to maintain or optimize) and traditional waterfall or other development methods (which take months to complete and are difficult to keep aligned with business goals).

What functions and benefits should users expect from a vendor-built data hub?

Vendor-built data hubs support advanced functions that are impossible for most user organizations to build themselves. These functions include: controlled and governable publish and subscribe methods; the orchestration of workflows and data flows across multiple systems; easy-to-use GUIs and wizards that enable self-service data access; and visibility and collaboration for both technical and business people across a range of data.

Data hubs are great for analytics. But what about data hubs for operational applications and their data?

Instead of consolidating large operational applications in the multi-month or year project, some users integrate and modernize them quickly at the data level via a shared data hub, perhaps on a cloud. For organizations with multiple customer facing applications for customer relationship management (CRM) and salesforce automation (SFA), a data hub can be a single, trusted version of customer data, which is replicated and synchronized across all these applications. A data hub adds additional functions that users of operational applications can use to extend their jobs, namely self-service data access and collaboration over operational data.

What does a truly modern data hub offer as storage options?

Almost all home-grown data hubs and most vendor-built hubs are based on one brand of relational database management system, despite the fact that data’s schema, formats, models, structures, and file types are diversifying aggressively. A modern data hub must support relational databases (because these continue to be vital for data management), but also support newer databases, file systems, and – very importantly – Hadoop.

If you’d like to hear more of my discussion with Informatica’s Scott Hedrick, please click here to replay the Informatica Webinar.

Posted on March 29, 20160 comments


Big Themes under the Big Tent

By David Stodder, Senior Research Director for Business Intelligence, TDWI

Hard to believe, but the New Year is over a month old now and moving by fast. TDWI just finished its first Conference of the year in Las Vegas, which included the co-located Executive Summit chaired by me and my TDWI colleague, Research Director Fern Halper. The Summit was fantastic; many thanks to our great speakers, sponsors, and attendees. Other industry events focused on TDWI’s core topics are coming up, including the TDWI Solution Summit in Savannah, Strata and Hadoop World, and Gartner Business Intelligence & Analytics Summit. So, it’s time to check the condition of my shoes, luggage, and lumbar vertebrae (have to stop carrying that heavy computer bag) because they are all about to get a workout.

These events and others later in the year will no doubt highlight some of the major themes that TDWI Research is seeing as top concerns among leadership at user organizations. Here are three themes that we expect to be top of mind at conferences the rest of this year:

Theme #1: “Governed” self-service analytics and data discovery. At the Summit, several attendees and speakers observed that the pendulum in organizations could be swinging toward stronger data governance. As organizations supply users with self-service visual analytics and data discovery tools and ease constraints on data access and data blending, they are becoming increasingly concerned about data quality, management, and security. TDWI Research advises that the best approach to expanding self-service analytics and data discovery is a balanced one that includes data governance. Our research finds that this is largely IT’s responsibility, but governance is better tailored to users’ needs if the business side is closely involved, such as through establishment of a committee that includes stakeholders from business and IT. Governance and other steps organizations can take to improve their "analytics culture" will be a key topic at TDWI and other events.

Theme #2: Self-service data preparation. One of the hot trends in the industry is the technology evolution in data preparation toward self-service data blending, data wrangling, and data munging. I heard a great deal about this at Strata in 2015 and expect to again this year. Not only business users but data scientists working with Hadoop data lakes need technologies that can support easier, faster, and more standardized processes for data access, cataloging, integration, and transformation. I will be researching and writing a TDWI Best Practices Report on this topic in the first half of this year; look for the research survey to be launched at the end of February. I expect that this will be a major topic at the aforementioned events as organizations try to improve the productivity and satisfaction of business users and data scientists.

Theme #3: The maturing Hadoop ecosystem. Within the past few years, the developers across the Hadoop landscape have made progress in taking what has been a disparate collection of open source projects and technologies and moving the ecosystem toward a more coherent ecosystem. To be sure, most organizations still need to work with vendors’ platforms to achieve the level of integration and management they need. What will be interesting to see at TDWI's Savannah Solution Summit and at Strata and Hadoop World is how the pendulum is swinging in the Hadoop environment between the tradition of freewheeling development focused on innovation and the use of more tightly integrated systems based on frameworks, governance, and management processes.

As we move forward in 2016, I hope to see members of the TDWI and greater business intelligence and analytics community at these events. I also look forward to hearing your thoughts about how these major themes will play out during the course of this year.

Posted on February 10, 20160 comments


Seven Recommendations for Becoming Big Data Ready

New big data sources and data types – and the need to get business value from new data – are forcing organizations to evolve their data management practices.

By Philip Russom, TDWI Research Director for Data Management

I recently participated as a core speaker in the Informatica Big Data Ready Virtual Summit, sharing a session with Amit Walia, the Chief Product Officer at Informatica Corporation. Amit and I had an interactive conversation where we discussed one of the most pressing questions in data management today, namely: How should an organization get ready to capture and leverage big data? This is an important question, because many organizations in many industries are facing big data, with its new data sources, data types, large volumes, and fast generation rates. Organizations need to modernize their data integration (DI) infrastructure, so they can capture and leverage the new data for new business insights and analytics.

Amit Walia and I boiled down this complex issue to seven recommendations, which I will now summarize:

Achieve agility and autonomy, as required of big data and analytics. The creation of data management solutions must keep up with the pace of business by adopting agile and lean development methods. New tool functions that assist with agility and autonomy include those for data exploration and profiling, self-service data access, and rapid dataset prototyping (or “data prep”).

Govern big data, as you would any enterprise data asset. Big data has a bit of a “hall pass” today, because it’s new and exotic. But eventually, it will be assimilated as yet another category of enterprise data. Prepare for that day, by assuming that new data demands governance, stewardship, privacy, security, quality, and standards.

Include Hadoop in your data integration infrastructure. Hadoop can replace some of the database management systems and file systems you’re using today, while scaling at a reasonable cost and handling new data types. Modern users’ DI architectures already include Hadoop for landing, staging, push-down processing, archiving, hubs, and lakes.

Integrate fit-for-purpose data to enable data exploration and profiling. The trend is to integrate big data in its raw, original state, into a big data platform, such as Hadoop or a large relational MPP implementation. That way, users can explore and profile new big data to determine its business value. Later, users can repurpose discovered data many ways, sometimes at runtime, as new requirements arise for analytics or operations.

Embrace real-time data ingestion, as required by some forms of big data and analytics. A modern DI infrastructure supports many speeds and frequencies of data ingestion, because diverse data sources and business processes have diverse requirements relative to time. A new challenge for DI is to capture and process, streaming data in real time, to enable near time analytics and business operations.

Prepare to integrate big data by upgrading skills and team structures. TDWI surveys say that a lack of skill is the biggest barrier to success with new big data. Data management professionals need training for Hadoop, NoSQL, natural language processing, and new data types (e.g., JSON, social media, streams). These competencies should be added to those of existing DI competency centers.

Modernize data management solution development by combining agile, stewardship, and collaborative methods. Both agile and stewardship methods recommend the use of a pair of specialists, working together closely: a data specialist and a business representative (or steward). This “dynamic duo” accelerates requirements gathering, ensures data-to-business alignment, and delivers solutions faster than ever.

If you’d like to hear more of my discussion with Informatica’s Amit Walia (and hear other expert speakers in the Informatica Big Data Ready Virtual Summit, too), please replay the Informatica Webinar by clicking here.

Posted on January 6, 20160 comments


Trip Report: What I Learned at Informatica World 2015

Inspirational User Case Studies and Educational Product Demonstrations

By Philip Russom, TDWI Research Director for Data Management

When I attend a user group meeting or a vendor’s conference, my top two priorities are (1) to hear case studies from successful users and (2) to see practical demonstrations of the vendor’s products. I got both of those in spades last week, when I spent three days attending Informatica World 2015 in Las Vegas.

It was a huge conference, with about 2,500 people attending and five or more tracks running simultaneously. I couldn’t attend all these sessions, so I decided to focus on the keynotes and the Data Integration Track. To give you a taste of the conference, allow me to share highlights from what I was able to attend, with a stress on case studies and demos.

User Case Studies

An enterprise architect at MasterCard discussed their implementation of an enterprise data hub. The hub gives data analysts the data they need in a timely fashion, provides self-service data access for a variety of users, and serves as a unified platform for both internal and external data exchange.

Tom Tshontikidis explained why and how Kaiser Permanente migrated its large collection of data integration solutions from a legacy product (heavily extended via hand coding) to PowerCenter and other Informatica tools.

Two representatives from Cleveland Clinic spoke of their journey from quantity based metrics for performance management (which mostly laid blame on employees for missed targets) to quality based predictive analytics (which now sets realistic goals for helping their patients).

Dr. John Frenzel is the chief medical information officer at the MD Anderson Cancer Center. At Informatica World, he discussed how big data analytics is accelerating clinical research. Among the many great tips he shared, Frenzel described how data scientists at MD Anderson work like consultants, traveling among multiple teams, to share their expertise.

An IT systems architect at a major telecommunications company told the story about how they needed to simplify operations, so it could transform into better integrated – and hence more nimble – global organization. In support of those business goals, IT replaced hundreds of systems, mostly with six primary ones. This gargantuan consolidation project was mostly powered by Informatica tools.

Tom Kato of Mototak Consulting spoke in a few sessions. In one, he described how to manage data from cradle to grave, using best practices and leading tools for Information Lifecycle Management (ILM). In another, he explained his use of the Informatica Data Validation Option (DVO) in an early phase of the merger between American Airlines and US Airways.

John Racer from Discount Tire explained why validating data is important to assuring that data arrives where it’s supposed to be and in the condition intended. He discussed practical applications in cross-platform data flows, application migrations, and data migrations, involving tools from Informatica and other providers.

Product Demonstrations

Some of the coolest demos were presented by users. For example, I saw a management dashboard built by folks at a major energy company, using a visualization tool and data from PowerCenter. The dashboard enables business users to do pipeline capacity management and related operational tasks, many with near time data.

The Informatica Data Validation Option (DVO) kept coming up in presentations by both Informatica employees and customers. I was glad to see this, because I’ve long felt that data integration users do not validate data as often as they should. For example, validation should be part of most ETL testing and all data migration projects.

For a variety of reasons, I was glad see Secure@Source demo’d. The demo clarified that this is not a security tool, per se, although it can guide your security and other efforts. Instead, Secure@Source provides analytics for assessing data-oriented risks relevant to security, privacy, compliance, governance, and so on. Essentially, you create policies and other business rules (typically inspired by your compliance and governance policies), and Secure@Source helps you identify risks and quantify compliance.

Informatica’s Krupa Natarajan spent most of a session demonstrating Informatica Cloud. This product has been in production since 2006, so there’s a lot of robust functionality to look at. Long story short, Informatica Cloud comes across as a full-featured integration tool, not some after-thought hastily ported to a cloud (as too many cloud-based products are). Although Krupa didn’t say it explicitly, the demo brought home to me the point that data integration with a cloud-based tool is pretty much the same as with traditional tools. That good news should help users get more comfortable with clouds in general, as well as the potential use of cloud-based data management tools.

Further Learning

If you go to www.YouTube.com and search for “Informatica World 2015” you’ll find many useful speeches and sessions that you can replay. Here’s a couple of links to get you started:

Keynote by Informatica’s CEO, Sohaib Abbasi. This is a “must see,” if you care about Informatica’s vision for the future, especially in the context of the proposed acquisition of Informatica.

Interviews filmed on site by theCUBE. All the interviews are good. But I especially like the interviews with my analyst friends: John Myers and Mark Smith.

Posted by Philip Russom, Ph.D. on May 18, 20150 comments


Successful Application and Data Migrations and Consolidations

Minimizing Risk with the Best Practices for Data Management
By Philip Russom, TDWI Research Director for Data Management

I recently broadcast a really interesting Webinar with Rob Myers – a technical delivery manager at Informatica – talking about the many critical success factors in projects that migrate or consolidation applications and data. Long story short, we concluded that the many risks and problems associated with migrations and consolidations can be minimized or avoided by following best practices in data management and other IT disciplines. Please allow me to share some of the points Rob and I discussed:

There are many business and technology reasons for migrating and consolidating applications and data.
  • Mergers and Acquisitions (M&As) – Two firms involved in a M&A don’t just merge companies; they also merge applications and data, since these are required for operating the modern business in a unified manner. For example, cross-selling between the customer bases of the two firms is a common business goal in a merger, and this is best done with merged and consolidated customer data.
  • Reorganizations (reorgs) – Some reorgs restructure departments and business units, which in turn can require the restructuring of applications and data. 
  • Redundant Applications – For example, many firms have multiple applications for customer relationship management (CRM) and sales force automation (SFA), as the result of M&As or departmental IT budgets. These are common targets for migration and consolidation, because they work against valuable business goals, such as the single view of the customer and multi-channel customer marketing. In these cases, it’s best to migrate required data, archive the rest of the data, and retire legacy or redundant applications.
  • Technology Modernization – These range from upgrades of packaged applications and database management systems to replacing old platforms with new ones.
  • All the above, repeatedly – In other words, data or app migrations and consolidations are not one-off projects. New projects pop up regularly, so users are better off in the long run, if they staff, tool, and develop these projects with the future in mind.
Migration and consolidation projects affect more than applications and data:
  • Business Processes – The purpose of enterprise software is to automate business processes, to give the organization greater efficiency, speed, accuracy, customer service, and so on. Hence, migrating software is tantamount to migrating business processes, and a successful project executes without disrupting business processes.
  • Users of applications and data – These vary from individual people to whole departments and sometimes beyond the enterprise to customers and partners. A successful project defines steps for switching over users without disrupting their work.
Application or data migrations and consolidations are inherently risky. This is due to their large size and complexity, numerous processes and people affected, cost of the technology, and (even greater) the cost of failing to serve the business on time and on budget. If you succeed, you’re a hero or heroine. If you fail, the ramifications are dire for you personally and the organization you work for.

Succeed with app/data migrations and consolidations. Success comes from combining the best practices of data management, solution development, and project management. Here are some of the critical success factors Rob and I discussed in the Webinar:
  • Go into the project with your eyes wide open – Realize there’s no simple “forklift” of data, logic, and users from one system to the next, because application logic and data structures often need substantial improvements to be fit for a new purpose on a new platform. Communicate the inherent complexities and risks, in a factual and positive manner, without sounding like a “naysayer.”
  • Create a multi-phased plan for the project – Avoid a risky “big bang” approach by breaking the project into manageable steps. Pre-plan by exploring and profiling data extensively. Follow a develop-test-deploy methodology. Coordinate with multi-phased plans from outside your data team, including those for applications, process, and people migration. Expect that old and new platforms must run concurrently for awhile, as data, processes, and users are migrated in orderly groups.
  • Use vendor tools – Programming (or hand coding) is inherently non-productive as the primary development method for either applications or data management solutions. Furthermore, vendor tools enable functions that are key to migrations, such as data profiling, develop-test-deploy methods, full-featured interfaces to all sources and targets, collaboration for multi-functional teams, repeatability across multiple projects, and so on.
  • Template-ize your project and staff for repeatability – In many organizations, migrations and consolidations recur regularly. Put extra work into projects, so their components are easily reused, thereby assuring consistent data standards, better governance, and productivity boosts over time.
  • Staff each migration or consolidation project with diverse people – Be sure that multiple IT disciplines are represented, especially those for apps, data, and hardware. You also need line-of-business staff to coordinate processes and people. Consider staff augmentation via consultants and system integrators.
  • Build a data management competency center or similar team structure – From one center, you can staff data migrations and consolidations, as well as related work for data warehousing, integration, quality, database administration, and so on.
If you’d like to hear more of my discussion with Informatica’s Rob Myers, please replay the Webinar from the Informatica archive.

Posted by Philip Russom, Ph.D. on March 11, 20150 comments


Q&A RE: The State of Big Data Integration

It’s still early days, but users are starting to integrate big data with enterprise data, largely for business value via analytics.

By Philip Russom, TDWI Research Director for Data Management

A journalist from the IT press recently sent me an e-mail containing several very good questions about the state of big data relative to integrating it with other enterprise data. Please allow me to share the journalist’s questions and my answers:

How far along are enterprises in their big data integration efforts?

According to my survey data, approximately 38% of organizations don’t even have big data, in any definition, so they’ve no need to do anything. See Figure 1 in my 2013 TDWI report Managing Big Data. Likewise, 23% have no plans for managing big data with a dedicated solution. See Figure 5 in that same report.

Even so, some organizations have big data, and they are already managing it actively. Eleven percent have a solution in production today, with another 61% coming in the next three years. See Figure 6.



Does data integration now tend to be haphazard, or one-off projects, in many enterprises, or are architectural strategies emerging?

I see all the above, whether with big data or the usual enterprise data. Many organizations have consolidated most of their data integration efforts into a centralized competency center, along with a centrally controlled DI architecture, whereas a slight majority tend to staff and fund DI on a per-application or per-department basis, without an enterprise strategy or architecture. Personally, I’d like to see more of the former and less of the latter.

What are the best approaches for big data integration architecture?

Depends on many things, including what kind of big data you have (relational, other structures, human language text, XML docs, etc.) and what you’ll do with it (analytics, reporting, archiving, content management). Multiple big data types demand multiple data platforms for storing big data, whereas multiple applications consuming big data require multiple processing types to prepare big data for those applications. For these reasons, in most cases, managing big data and getting business use from it involves multiple data management platforms (from relational DBMSs to Hadoop to NoSQL databases to clouds) and multiple integration tools (from ETL to replication to federation and virtualization).

Furthermore, capturing and integrating big data can be challenging from a data integration viewpoint. For example, the streaming big data that comes from sensors, devices, vehicles, and other machines requires special event-processing technologies to capture, triage, and route time-sensitive data—all in a matter of milliseconds. As with all data, you must transform big data as you move it from a source to a target, and the transformations may be simple (moving a click record from a Web log to a sessionization database) or complex (deducing a fact from human language text and generating a relational record from it).

What "traditional" approaches are being updated with new capabilities and connectors?

The most common data platform being used for capturing, storing, and managing big data today are relational databases, whether based on MPP, SMP, appliance, or columnar architectures. See Figure 16 in the Managing Big Data report. This makes sense, given that in a quarter of organizations big data is mostly or exclusively structured data. Even in organizations that have diverse big data types, structured and relational types are still the most common. See Figure 1.

IMHO, we’re fortunate that vendors’ relational database management systems (RDBMSs) (from the old brands to the new columnar and appliance-based ones) have evolved to scale up to tens and hundreds of terabytes of relational and otherwise structured data. Data integration tools have likewise evolved. Hence, scalability is NOT a primary barrier to managing big data.

If we consider how promising Hadoop technologies are for managing big data, it’s no surprise that vendors have already built interfaces, semantic layers, and tool functionality for accessing a broad range of big data managed in the Hadoop Distributed File System (HDFS). This includes tools for data integration, reporting, analysis, and visualization, plus some RDBMSs.

What are the enterprise "deliverables" coming from users’ efforts with big data (e.g., analytics, business intelligence)?

Analytics is the top priority and hence a common deliverable from big data initiatives. Some reports also benefit from big data. A few organizations are rethinking their archiving and content management infrastructures, based on big data and the potential use of Hadoop in these areas.

How is the role of data warehousing evolving to meet the emergence of Big Data?

Big data is a huge business opportunity, with few technical challenges or downsides. See figures 2 through 4 in the report Managing Big Data. Conventional wisdom says that the opportunity for business value is best seized via analytics. So the collection, integration, and management of big data is not an academic exercise in a vacuum. It is foundational to enabling the analytics that give an organization new and broader insights via analytics. Any calculus for the business return on managing big data should be based largely on the benefits of new analytics applied to big data.

On April 1, 2014, TDWI will publish my next big report on Evolving Data Warehouse Architectures in the Age of Big Data. At that time, anyone will be able to download the report for free from www.tdwi.org.

How are the new platforms (such as Hadoop) getting along with traditional platforms such as data warehouses?

We say “data warehouse” as if it’s a single monolith. That’s convenient, but not very accurate. From the beginning, data warehouses have been environments of multiple platforms. It’s common that the core warehouse, data marts, operational data stores, and data staging areas are each on their own standalone platforms. The number of platforms increased early this century, as data warehouse appliances and columnar RDBMSs arrived. It’s now increasing again, as data warehouse environments now fold in new data platforms in the form of the Hadoop Distributed File System (HDFS) and NoSQL databases. The warehouse has always evolved to address new technology requirements and business opportunities; it’s now evolving again to assure that big data is managed appropriately for the new high-value analytic applications that many businesses need.

For an exhaustive discussion of this, see my 2013 TDWI report Integrating Hadoop into Business Intelligence and Data Warehousing.

Posted by Philip Russom, Ph.D. on January 22, 20140 comments