Is Data Lake merging into Big Data Sea ~ Future of CIO

Sunday, December 7, 2014

Is Data Lake merging into Big Data Sea

11:11 PM Pearl Zhu 1 comment

Either Big Data Sea or Data Lake, they are all means to the end, to fit for business purposes.

What is a data lake? A repository for large quantities and varieties of data, both structured and unstructured. The lake can serve as a staging area for the data warehouse, the location of more carefully “treated” data for reporting and analysis in batch mode.

The information contained in the lake is structurally varied. It is also likely contextually varied. In fact, the lake is premised on contextual diversity. For example, if you have a large amount of transit data pertaining vehicle operations, contextualization can achieved in part by associating this data with contextual metrics such as delays, accidents, and complaints. But there is also a need to distinguish between the events pertaining to intervention, outcome, operations (thing done to support business), users (those that interact with the business), external intervention (implemented by outsiders), and the environment (such as traffic conditions). The data has to some extent retain this type of contextual reality (the transpositional differences latent in the phenomena conveyed by the data). There might simply be an utter lack of sophistication, rendering large amounts of data and the lake itself superfluous.

Many innovative companies build a data lake also indexes the data with a general purpose search engine. Once you index and rank the data within a data lake, and make it navigable through faceting and metadata enrichment, users find it easier to gain insights from the disparate data sources inside the data lake. There is a huge difference between indexing data and actually managing it. Finding something is not the same as having robust control over where it came from, who touched it and how it was touched, how long it is stored for, who can see or modify it, and how it relates to the rest of an Enterprise data. Data lakes are great to make Big Data accessible to regular users instead of circling the enterprise data warehouse wagons around an outdated approach to dealing with the explosion of disparate, high-velocity data sources that contain key insights companies can use to grow revenue, cut costs and improve customer experience.

Smart enterprises will balance innovation (data lakes) with governance (semantic consistency, etc.); and not stunt the growth of one in pursuit of the other. Not paying adequate attention to the demands for governance and properly life cycle management when needed is, frankly, reckless. It's hard to build governance structures and semantic consistency when the underlying data sources are inherently turbulent. Indexing and enriching data in a data lake through application of enterprise search technology is the key first step to building higher-order applications that allow users to query, navigate and discover relationships between disparate data stores. In addition, the definition of a data lake is varied. Essentially, every vendor in this space (large, small, whatever) says, "What we sell is a data lake." This leads to a fair amount of confusion. People looking to escape the perceived tyranny of the data warehouse with a data lake are going to experience information management anarchy. The lack of data curation in a data lake, at least as commonly understood, severely limits the audience able to use it. It also assumes that all those audiences are highly skilled at data manipulation and analysis, as data lakes lack semantic consistency and governed metadata.

Big Data is still the big puzzle in many organizations, one size does not fit all, and the best approach is what being called Fit For Purpose where you match the problem/requirements to solve it via the best emerging technology.