Big Data is moving from ‘hype’ to production stage, The
positive foresight is that the vast quantities of data that humans and machines
will be creating by the year 2020 could enhance productivity,
improve organizational transparency, and expand the frontier of the
“knowable future.” However, it takes time to mature the technologies and
develop analytics talent, there’re still many concerns and hot debates upon how
to manage it effectively, in order to amplify signal and capture insight from the growing mountains of
data, here is an interesting debate: Does noise risk faster than information
you get from Big Data?
It's all about the
dimensions of analysis. A challenge is that we do not have a definition or
clarity on noise. Because we do not
have this, we seem to have a range of different contexts being looked at and
considered; wide-ranging data from a
variety of disparate sources naturally increases the window of observation
which, in turn, increases the number of dimensions you can approach through. More
data, particularly more diverse data that is intelligently analyzable is
supportive of the basic scientific method.
Whether your data is good or bad depends on what you are looking for and
the context that you are looking in it. If you want to understand the
impact of factor X on something, then the data you add is in needs to have factor X
being varied across an appropriate range in the area where you want to
understand it
Noise is an issue if
you are using a single data source as ground truth, but it is often
inappropriate to use a single data source as ground truth for Big Data. This
is about the availability of more data points from which correlations previously hidden from view can be derived
and put to valuable use. The better approach is to fuse and integrate many
independent data sources from which the same underlying model of reality can be
derived. This essentially creates a voting protocol among the unrelated data
sources about the underlying facts of reality. This type of data fusion and
integration is critical because it is the only robust and scalable error
correction, bias reduction, and noise removal mechanism for non-trivial data
models.
Connect Dots. True
genius is found in our ability to connect data dots in ways that help advance
various objectives. The more dots and the more combinations of connections, the
greater potential insight and diversity of perspectives you may capture. Noise
problem is always there no matter how and when. We are in big data time now,
and of course that means additional noise from different sources, but the most
important part is you analyze what you couldn't do before and find something
new from data ignored or treated like "noise" in the past. Reducing
noise efficiently is why data scientists needed
Data analysis should
be a multi-disciplinary exercise, that benefits from data scientists/artists,
business leaders, psychologists, philosophers, cognitive scientists, economists
and others depending on the analytical goals. There will always be more
data than "information" - more noise than usable signal. But, we
don't know what we don't know, and the value of "one unit of new
information", in the form of meaningful "signals" will likely
far surpass the cost of weeding through the rapidly expanding noise from which it
is gleaned. So Big Data really depends on the analytical objective at hand, a multi-disciplinary approach is necessary. What looks like noise in your signal might be the signal in someone else's noise!
Data quality and data
management are critical issues. This is really where the
"variety" part of the 3Vs comes in. The noise problem is a reflection
of the data management approach rather than the size of the data per se. There
is no requirement in Big Data that anyone use a single data source as ground
truth and that approach does have many problems. If you integrate variety of
data sources into data models and use them to corroborate each other with
respect to ground truth, then bigger data makes it much easier to construct a
high-fidelity model of whatever phenomena you might be interested in. However, as
most Big Data platforms today may still handle variety and data fusion quite
poorly, so the tools do not exactly encourage reducing noise in this way about
people’s ability to analyze it wisely.
Big Data talent: The
challenge is upon understanding of the data, the segmentation of the meaningful
business insight from the rest and providing clear information for executives
to decide upon. There are few people around who can do this and even VERY fewer
who can successfully build those teams in an organization’s ecosystem. In
short, today's hyper-digital, high-speed world offers all disciplines of a new
perspective that can give rise to new sources of questions, fed by data that has
historically been hidden from view. The creative data scientist/analyst/artiest
shall frame the new important questions, use both creativity and logic, to amplify signals from noise. Many
companies still do not know what to look, also have no idea how complex the
work is and how to assemble their teams of experts.
‘Success formula’ with
logic steps: Quality Data + analytical objectives + appropriate analysis
guided by highly skilled big data talent/subject matter experts = signal in context of the
mission at hand. The same data with different objectives looked from the
different experts will render different signals and filter out different noise.
So besides success formula, the logic steps are a) really know what you are
doing (multiple comparison problems etc) or b) formulate very precisely what
exactly you want to know from your data BEFORE looking at it.
Big Data is not just an IT or marketing project, it is an innovation experiment which demands out of box thinking and philosophical perspective , not as the only perspective, but
certainly a critical perspective. Anything less leaves the view by blind spot
or a tunnel scope.
This comment has been removed by the author.
ReplyDeleteThe concept of "Signal vs. Noise" in the context of big data refers to the challenge of distinguishing meaningful information (signal) from irrelevant or distracting data (noise). Here’s a breakdown of this concept:
ReplyDeleteSignal:
Definition: The signal represents valuable, actionable insights derived from data that can inform decisions or strategies.
Characteristics:
Relevant trends or patterns that provide insights into customer behavior, operational efficiency, etc.
Data that contributes to understanding complex phenomena, leading to improved outcomes.
Examples: Customer purchasing trends, operational bottlenecks, predictive analytics in supply chain management.
Noise:
Definition: Noise refers to irrelevant, redundant, or random data that can obscure the signal and complicate analysis.
Characteristics:
Data that adds little to no value or might mislead decision-making.
Outliers or anomalies that do not represent broader trends.
Examples: Erroneous data entries, irrelevant social media mentions, or non-actionable metrics.
Importance in Big Data:
Data Overload: With the vast amounts of data generated, distinguishing between signal and noise is crucial to avoid information overload.
Effective Analysis: Focusing on the signal enables more effective data analysis, leading to better business insights and decisions.
Resource Efficiency: Helps in allocating resources towards data that truly matters, improving efficiency in data processing and analytics efforts.
Strategies to Enhance Signal vs. Noise:
Data Cleaning: Implement robust data cleaning processes to remove irrelevant or erroneous data.
Advanced Analytics: Use machine learning and statistical methods to identify patterns and extract meaningful insights.
Visualization: Leverage data visualization tools to highlight important trends and insights, making it easier to discern signal from noise.
Domain Expertise: Involve subject matter experts to contextualize data and filter out noise effectively.
Big Data Projects For Final Year Students