Wednesday, February 5, 2014

Big Data Signal vs. Noise

Big Data Leads us Further into a "Knowable Future'.


Big Data is moving from ‘hype’ to production stage, The positive foresight is that the vast quantities of data that humans and machines will be creating by the year 2020 could enhance productivity, improve organizational transparency, and expand the frontier of the “knowable future.” However, it takes time to mature the technologies and develop analytics talent, there’re still many concerns and hot debates upon how to manage it effectively, in order to amplify signal and capture insight from the growing mountains of data, here is an interesting debate: Does noise risk faster than information you get from Big Data?

It's all about the dimensions of analysis. A challenge is that we do not have a definition or clarity on noise. Because we do not have this, we seem to have a range of different contexts being looked at and considered; wide-ranging data from a variety of disparate sources naturally increases the window of observation which, in turn, increases the number of dimensions you can approach through. More data, particularly more diverse data that is intelligently analyzable is supportive of the basic scientific method. Whether your data is good or bad depends on what you are looking for and the context that you are looking in it. If you want to understand the impact of factor X on something, then the data you add is in needs to have factor X being varied across an appropriate range in the area where you want to understand it

Noise is an issue if you are using a single data source as ground truth, but it is often inappropriate to use a single data source as ground truth for Big Data. This is about the availability of more data points from which correlations previously hidden from view can be derived and put to valuable use. The better approach is to fuse and integrate many independent data sources from which the same underlying model of reality can be derived. This essentially creates a voting protocol among the unrelated data sources about the underlying facts of reality. This type of data fusion and integration is critical because it is the only robust and scalable error correction, bias reduction, and noise removal mechanism for non-trivial data models.

Connect Dots. True genius is found in our ability to connect data dots in ways that help advance various objectives. The more dots and the more combinations of connections, the greater potential insight and diversity of perspectives you may capture. Noise problem is always there no matter how and when. We are in big data time now, and of course that means additional noise from different sources, but the most important part is you analyze what you couldn't do before and find something new from data ignored or treated like "noise" in the past. Reducing noise efficiently is why data scientists needed

Data analysis should be a multi-disciplinary exercise, that benefits from data scientists/artists, business leaders, psychologists, philosophers, cognitive scientists, economists and others depending on the analytical goals. There will always be more data than "information" - more noise than usable signal. But, we don't know what we don't know, and the value of "one unit of new information", in the form of meaningful "signals" will likely far surpass the cost of weeding through the rapidly expanding noise from which it is gleaned. So Big Data really depends on the  analytical objective at hand, a multi-disciplinary approach is necessary. What looks like noise in your signal might be the signal in someone else's noise!

Data quality and data management are critical issues. This is really where the "variety" part of the 3Vs comes in. The noise problem is a reflection of the data management approach rather than the size of the data per se. There is no requirement in Big Data that anyone use a single data source as ground truth and that approach does have many problems. If you integrate variety of data sources into data models and use them to corroborate each other with respect to ground truth, then bigger data makes it much easier to construct a high-fidelity model of whatever phenomena you might be interested in. However, as most Big Data platforms today may still handle variety and data fusion quite poorly, so the tools do not exactly encourage reducing noise in this way about people’s ability to analyze it wisely.

Big Data talent: The challenge is upon understanding of the data, the segmentation of the meaningful business insight from the rest and providing clear information for executives to decide upon. There are few people around who can do this and even VERY fewer who can successfully build those teams in an organization’s ecosystem. In short, today's hyper-digital, high-speed world offers all disciplines of a new perspective that can give rise to new sources of questions, fed by data that has historically been hidden from view. The creative data scientist/analyst/artiest shall frame the new important questions, use both creativity and logic, to amplify signals from noise. Many companies still do not know what to look, also have no idea how complex the work is and how to assemble their teams of experts.

‘Success formula’ with logic steps: Quality Data + analytical objectives + appropriate analysis guided by highly skilled big data talent/subject matter experts = signal in context of the mission at hand. The same data with different objectives looked from the different experts will render different signals and filter out different noise. So besides success formula, the logic steps are a) really know what you are doing (multiple comparison problems etc) or b) formulate very precisely what exactly you want to know from your data BEFORE looking at it.


Big Data is not just an IT or marketing project, it is an innovation experiment which demands out of box thinking and philosophical perspective , not as the only perspective, but certainly a critical perspective. Anything less leaves the view by blind spot or a tunnel scope. 

2 comments:

This comment has been removed by the author.

The concept of "Signal vs. Noise" in the context of big data refers to the challenge of distinguishing meaningful information (signal) from irrelevant or distracting data (noise). Here’s a breakdown of this concept:

Signal:
Definition: The signal represents valuable, actionable insights derived from data that can inform decisions or strategies.
Characteristics:
Relevant trends or patterns that provide insights into customer behavior, operational efficiency, etc.
Data that contributes to understanding complex phenomena, leading to improved outcomes.
Examples: Customer purchasing trends, operational bottlenecks, predictive analytics in supply chain management.
Noise:
Definition: Noise refers to irrelevant, redundant, or random data that can obscure the signal and complicate analysis.
Characteristics:
Data that adds little to no value or might mislead decision-making.
Outliers or anomalies that do not represent broader trends.
Examples: Erroneous data entries, irrelevant social media mentions, or non-actionable metrics.
Importance in Big Data:
Data Overload: With the vast amounts of data generated, distinguishing between signal and noise is crucial to avoid information overload.
Effective Analysis: Focusing on the signal enables more effective data analysis, leading to better business insights and decisions.
Resource Efficiency: Helps in allocating resources towards data that truly matters, improving efficiency in data processing and analytics efforts.
Strategies to Enhance Signal vs. Noise:
Data Cleaning: Implement robust data cleaning processes to remove irrelevant or erroneous data.
Advanced Analytics: Use machine learning and statistical methods to identify patterns and extract meaningful insights.
Visualization: Leverage data visualization tools to highlight important trends and insights, making it easier to discern signal from noise.
Domain Expertise: Involve subject matter experts to contextualize data and filter out noise effectively.

Big Data Projects For Final Year Students

Post a Comment