by Patricia Ames | 7/14/15
It took a Song of the Day club and over 5 million Instagram followers for me to meet “Big Data.” I’d like you to meet him too. The video comes from Viacom Velocity, an integrated marketing firm that created its own ad to demonstrate how it uses big data to understand and solve its clients’ problems.
The opening lines come from a character called “Hadoop” (played by singer and dancer Todrick Hall) and aptly describe the average office environment: “I know you’re struggling – I can feel your pain.” What business these days doesn’t feel the pain of the information explosion erupting from every portal?
I thought it would be fun to use this crazy video to walk us through some of the complex terminology that makes up the landscape we commonly refer to as “Big Data.”
First of all what is Hadoop? Apache Hadoop is an open source software library that provides a framework that allows distributed processing of large data sets across clusters of servers. It is scalable with a high fault tolerance and is designed to work on just a single server or across thousands.
Open source software makes its source code publicly available to anyone for modification and enhancement. It is typically collectively designed by many people and can be freely used, changed, and shared by anyone.
Open source is important for several reasons:
· Robust – there are many people working on it, optimizing it – open source code is typically fixed, updated and upgraded rapidly because there are so many authors
· Secure – with the source code open, it tends to be thoroughly tested and any holes are quickly patched
· Flexible –it can be tweaked for uses other than original intent
MapReduce is a way of processing big data. It breaks the data down into two parts: a mapping job that performs filtering and sorting, and then a reducing job that summarizes or counts the data. This system enlists clusters of servers to run the various tasks in parallel. MapReduce is disk-based and Hadoop supports it.
Spark is a different way of processing big data that is performed in-memory and so it can function at even faster multiples of speed than MapReduce. Hadoop also supports Spark.
NoSQL or “Not only SQL” is a term that became prevalent with the rise of Facebook and Google and the storage needs of Web 2.0 companies. This database system is designed for more flexible queries, horizontal scaling and faster processing than SQL databases because it stores and retrieves the data differently than relational databases. NoSQL databases are increasingly used in big data and real-time web applications.
Pig (NOT the farm animal): Pig is a platform for analyzing large data sets. Once your big data is crunched and queried, you will need to analyze and evaluate the results in a program like Pig. Pig, also by Apache, is open source and is known for its ease of programming, its ability to automatically optimize tasks and users can also create their own functions to do special-purpose processing, which is called extensibility.
Which leads me to the ZooKeeper – ZooKeeper is an open source project that provides a centralized infrastructure and services that enable synchronization across a cluster. Managing a database cluster with just a few servers is already a challenge, so imagine the task with the large clusters required for big data processing. ZooKeeper provides centralized management of the entire cluster in terms of configuration management and names, synchronization and group services, among other things.
Petabytes: A petabyte is 1,000 terabytes (TB) or 1 million gigabytes (GB). The prefix “peta” indicates the fifth power of 1,000, or 1015, and therefore 1 petabyte is one quadrillion (short scale) bytes, or 1 billiard (long scale) bytes. The unit symbol for the petabyte is PB.
So now we know how “Big Data” knows so much. As “Hadoop” so eloquently put it – “Big data is not only in the … house – he knows everything about the … house.“
There are some great messages in the video. They ask if you know what type of data you are collecting — “Is your data tiny or large? Do you know who is buying what and what you should charge? Does your data really know? Or is it confused? “
They also ask what you are doing with your data once you collect it: “Do you find yourself selling beer to babies, diapers to teens? Do you barely even know what your customer needs? Maybe you should check what kind of data you’re using and whether it illuminates what everybody is choosing.”
And that’s the priceless key to harnessing the power of your data – getting the right tools to store, archive, index and mine that information.
Peace out – I have more stuff to learn.
Want to know more about the man behind “Big Data”? So did I. Check out his interview with Katie Couric: https://www.youtube.com/watch?v=fUKBvNgqoKc
Also check out our Workflow magazine — we write about big data a lot: www.WorkflowOTG.com
Patricia Ames is an analyst at BPO Media, which publishes The Imaging Channel and Workflow magazines. Ames has lived and worked in the United States, Southeast Asia and Europe and enjoys being a part of a global industry and community. Follow her on Twitter at @OTGPublisher or contact her by email at patricia@BPOMedia.com.