The Cloud Can Be a Massive Data Source for Analytics
Don't overlook cloud-resident open data sources for populating your data warehouses.
- By Mike Schiff
- September 12, 2018
Almost everyone realizes the many benefits of cloud computing as an alternative or adjunct to in-house computing power, storage, and software. Cloud computing was once considered to be a low-cost way for smaller companies to acquire computing power without having to invest in a "glass house" data center with its associated capital expenditures, operations staff, and ongoing maintenance. Larger organizations soon realized that cloud-based resources could augment their in-house computing power and accommodate seasonal demand and provide processing power to remote locations on an as-needed basis.
Furthermore, many software companies started to offer their software through the cloud as application service providers rather than through traditional software licenses for their clients' on-premises computing platforms. Additionally, many organizations realized that the cloud was also suitable for backup and failover purposes.
That said, I believe the cloud is more than just an off-site computing platform. It is also a massive lake of structured and unstructured data and a potential data source for data mining and other analytics. Of course, permitting access to a private data cloud will likely raise concerns about privacy and security. After all, no organization would allow its competitors to access its customer files or non-public financial and sales data.
Cloud-Based Open Data Sets
As data warehouse practitioners, we certainly recognize that some of the data our users require is external to our in-house data centers or our own cloud storage. Our organizations likely store some of their proprietary data in the cloud, but there are many thousands of cloud-resident public or "open" data sets that can be accessed for free and which could prove valuable. This is not to say that you should necessarily deploy clouds for your data warehousing platforms; rather you should consider cloud-resident data as an additional data source.
For example, cloud resident medical databases such as those available at the National Institutes of Health's National Library of Medicine website or from various disease-specific research sites should provide a rich source of data that could facilitate medical diagnosis and treatments, especially when augmented with technology such as artificial intelligence and machine learning capabilities. Although there certainly will be privacy concerns including HIPPA regulations and even the European Union's General Data Protection Regulations, these could be minimized by ensuring that a given individual's identity would be anonymized.
Other examples of these open data sets include the multitude of government databases catalogued and linked-to at Data.gov which is managed and hosted by the United States General Services Administration (GSA). According to its website, "Data.gov is the central clearinghouse for federal open data, including hosting the Public Data Listings required under the 2013 Federal Open Data Policy, but Data.gov also hosts state, local, and tribal government sources voluntarily." I strongly recommend browsing this website to discover links to databases you didn't previously even know existed that could be of use to your organization.
In addition to open data sets, various industry and trade organizations provide member access to cloud-based data, often at minimal or no cost. These, too, can be a rich source of useful data.
A Final Word
We recognize the usefulness of the cloud for providing computational, software, and data storage capabilities, but we should not overlook how the cloud can also be a potential data source for data mining and other analytics. Although our current data sources almost certainly include our own organizations' cloud-resident data, we should also expand our potential sources to include external cloud-resident pubic data sources as well.
If you have not already done so, ask your user communities if they are aware of any public or private sources that you might not currently be aware of that would benefit your organization.
Recognize that the cloud can be a massive data lake; let's not be afraid of taking a dip in it.
Michael A. Schiff is founder and principal analyst of MAS Strategies, which specializes in formulating effective data warehousing strategies. With more than four decades of industry experience as a developer, user, consultant, vendor, and industry analyst, Mike is an expert in developing, marketing, and implementing solutions that transform operational data into useful decision-enabling information.
His prior experience as an IT director and systems and programming manager provide him with a thorough understanding of the technical, business, and political issues that must be addressed for any successful implementation. With Bachelor and Master of Science degrees from MIT's Sloan School of Management and as a certified financial planner, Mike can address both the technical and financial aspects of data warehousing and business intelligence.