Tuesday, February 24, 2015

Do you Understand the Big Data Pipeline?

Big Data is neither equal to the fancy new technologies nor a huge database, it is a full step-by-step process pipeline.
Big Data as a big phenomenon causes big attention, however,  people are getting confused about the difference between tools and the knowledge in what everyone is calling now Big Data. Data analysis, and Big Data is not Hadoop or SQL or Python or R. These are just tools that help you analyze data. Getting Big Data right is about more than the size of your database, or the fancy tools, it is the full step-by-step process PIPELINE from collecting and storing data to analyzing and visualizing data to extract business insight & foresight from it.

Big data is not just a Big database; it is multiple databases that have to work together and have to be connected; those databases have been created by different people, even different companies with different formats. All of these have to work together. The problem caused by big data is "how can we get so huge data." These produce because, for example, fast growing "user table," but you cannot delete data on behalf of user. Hence it comes "storage" for fast querying and updating. Then the tech community introduces Hadoop technology, which actually are distributed database server architecture for easily extend the existing data volume. Then you need to understand the data, run machine learning algorithms or statistic tools, then visualize the data and gain an insight if you did it right and you are lucky.

Put your Big Data"fit-for-purpose." Data analysis is not Hadoop, it is much more than the simple technical tools. Data Analysis is a process that needs connecting to databases, or a data server provider or creating the database using a crawler. Then analyzes and visualizes the data. Data storage is only a part of the whole Big Data business logic. It seems like too many people think if you aren't using Hadoop, you are not doing Big Data; many people are thinking or inquiring if Big Data = Hadoop and vice versa. That's why you have to put emphasis on "fit-for-purpose." Big data is about "data collection" (crawling for example) + "data storage" +data analysis & visualization. As for analysis part, RDBMS is a data warehouse after ETL over distributed data sources, Data analysis as final stage is about data dimensional analysis not about amount analysis. selecting enough samples is to ease random errors so that the result is stable.

There are pros and cons with various approaches to Big Data. Big Data is "bigger than the size of your database." Solving a problem and optimizing the algorithms often do need to collect huge amount of data. The more data you collect, the better the results it get. However, the reality does not allow you to have all the data you want, so there'll be time that you need to be selective in choosing the best proxy available. Also, you should realize there are pros and cons with various approaches to Big Data, but users rather have it 80% right than wait until there is a perfect prediction. The criteria to select different platforms and tools is to improve speed and scalability, although it is sometimes difficult to make real apples-to-apples comparison.

The five Vs (volume, volatility, variety, veracity, and value)  of Big Data is its very characteristics. Still, it is means to the end -to achieve business value, not the end itself. The goal to get Big Data right is to ensure the data as the lifeblood of modern business can nurture the whole body of business and keep it fit, energetic and resilient in order to adapt to the accelerating speed of digital dynamic.


Post a Comment

Twitter Delicious Facebook Digg Stumbleupon Favorites More