Big Data - Step by Step

Big Data Step by Step.Part 1

Getting started with Big Data can be a challenge. There is a lot of information out there, which makes it easy to be overloaded with things to do and learn. There is no lack of articles on the opportunities and great potential of Big Data technologies. The aim of this upcoming series of articles is to provide a step-by-step introduction to the technical and analytical side of the subject, with easy to reproduce examples and instructions. A hands-on introduction for the upcoming Big Data Engineer or Data Scientist.

Some of the subjects of this blog series will lean more towards the job description of Big Data Engineers. Others will speak more to Data Scientists. I will mention the target audience at the start of every article, although the Engineering articles may well be interesting to Scientists and vice versa.

This first article introduces the hardware and software that will be used throughout the series. It is mainly a Big Data Engineer subject, but as mentioned, possibly useful as background to Data Scientists too.

Arthur Vogels

11 april 2016

Plaats reactie

Parts in this series:

Big Data Step by Step. Part 1: Getting Started
Big Data Step by Step. Part 2: Setting up the Cluster
Big Data Step by Step. Part 3: Experiments with MapReduce
Big Data Step by Step. Part 4: Experiments with HBase

The Hadoop ecosystem
The Apache Hadoop family is by far the most widely used group of Big Data products today. For this reason it is the technology of choice for this series. It is a large family of Big Data related applications and components, developed mainly as open source projects under the Apache flag. There are too many components to describe them all in this article, but here are some of the main ones.

HDFS: This is the Hadoop Distributed File System. As implied by the name, this is a file system used to store data for use by other ecosystem components. It is not a file system like NTFS or Ext4 that resides directly on block devices, but rather sits on top of these “normal” filesystems as an additional layer. It is distributed over the nodes of the cluster. Data is chopped into chunks that can be replicated over several nodes to mitigate data loss from disk failure. This RAID-like redundancy mechanism allows for HDFS to work reliably with cheap consumer-grade storage devices. Furthermore, Hadoop is built with the idea to bring processing to the data, as opposed to bringing the data to the processing (like traditional databases do). Chopping data into chunks and distributing these chunks over several nodes enables multiple nodes to work on parts of the same file simultaneously, thus speeding up processing.
MapReduce: MapReduce is the first and best known processing framework in the Hadoop ecosystem. It allows for distributed processing of large data sets from HDFS, by using Mappers to create Key-Value pairs, sorting these pairs by key and using Reducers to process the bags of sorted Key-Value pairs into an end result.
The best known example of a MapReduce job is the wordcount. The Mapper is used to cut lines from an input text file into separate words and to emit Key-Value pairs, where the word itself is the key and a ‘1’ is the value (indicating 1 occurence of the word). The native MapReduce sorting mechanism ensures that all pairs with the same key end up in the same Reducer. The Reducer then increments a counter for every occurrence of the word, resulting in a list of words with the amount of occurences of that word in the input file.
The power of MapReduce is very fast parallel processing of huge data files. Its weakness is the fact that, even though processing is parallel, file access is very much sequential and batch-oriented. This means that every time the job is executed, it will run through the entire input file. There is no random access to data, which makes MapReduce processing an extremely high latency business.
MapReduce jobs are usually written in Java. A future article in this series will demonstrate the structure and some examples.
HBase: HBase is the column store database of the Hadoop family. As mentioned, MapReduce is sequential in nature and will not allow random access. This means that finding something specific in a big data file will introduce enormous latency, which is the time to read and process the entire file. While this may be fine for batch jobs that run overnight, it is not acceptable in situations where instant access is desired. HBase uses HDFS to store large amounts of data, but makes it randomly accessible by storing data as columns of Key-Value pairs. By using smart hashing algorithms for calculating the key, it is possible to store and search very sparse data sets (data sets with large “gaps”) in an efficient way. HBase is not a database in the classic sense. It does not provide full ACID (Atomicity, Consistency, Isolation, Durability) in the way a relational database does. More on this subject can be found here. The main purpose of HBase is the ability to have fast random access to massive data sets.
HBase can be accessed from Java code. A future article in this series will demonstrate some examples.
Spark: Spark can be seen as the successor and an improvement of MapReduce. Like MapReduce, it is a distributed processing framework, built to work with distributed data on HDFS. The difference is found in how the processing is done. Where MapReduce is high latency disk based and works with intermediate results on HDFS, Spark will do most work in memory and will try to execute operations only once the results are needed. In general it can be used for the same tasks as MapReduce, but will execute them faster. It is expected to replace MapReduce completely in time.
Spark jobs can be written in Python, Java and Scala, the latter being the natural choice, since Spark itself is written in Scala. A future article in this series will demonstrate some examples.
Hue: Created as the “Hadoop User Experience”, Hue is the web application that moved command line tasks like managing HDFS and moving data files into and out of HDFS to the web browser. It also provides functionality to browse HBase tables and manage job orchestration via Apache Oozie. It just makes life easier.

Picking a distribution
Almost all Hadoop related software is developed in open source projects under the Apache flag. The source code can be taken straight from these projects, compiled and run on any compatible system. However, due to the complexity of the ecosystem this is not an easy choice. Development of components runs at a fast pace, new versions come in rapid succession and not all versions of all components play nice with each other. Mixing and matching to get all the components and versions right is a big task that needs to be repeated on every upgrade. For this reason, Apache started the BigTop project. BigTop delivers a nicely matched set of Hadoop ecosystem components, guaranteed to work together. In general, it still requires users to compile the product for themselves, although for some operating systems repositories with pre-packaged versions of BigTop are available.

In corporate environments, running straight from open source is often seen as cumbersome and usually even considered a risk. This is where commercial Hadoop distributions like HortonWorks and Cloudera find their markets. In essence, these products are similar to Apache BigTop, but with some additional user interfaces, support contracts, certification schemes and the like. A fitting analogy would be the Red Hat distribution of GNU/Linux, offering a supported distribution of an otherwise freely available open source project.

Since all three distributions are built on mostly the same codebase, BigTop, HortonWorks and Cloudera are very similar in structure and use. For this series of articles we will go with Cloudera, due to its very accessible installation, web interface and integration with the Oracle Big Data Discovery product, which will be used in articles to come. However, do not let this article limit your experimentation with other distributions.

Cluster setup
Hadoop environments can be run in various distribution modes.

First, there is the standalone mode. This setup is ideal for small scale education and experimentation, mostly in self-contained virtual machines that can be made available and shared easily. Everything runs in a single Java Virtual Machine (JVM) process and local filesystem is used for data storage instead of the distributed HDFS file system.

The second option is pseudo-distributed mode. This is logically the same setup as a fully distributed cluster, only all daemons and components run on the same host. It is useful for development and testing, where logical similarity to production environments is important, but with limited hardware resources available.

Finally, fully distributed mode is what makes Hadoop a very powerful platform. By adding more hosts, processing power will scale almost linearly. Twice the amount of (similar) machines will yield nearly twice the processing speed in many usage scenarios. Components run on separate machines, taking full advantage of all processing power available. Storage of data is done with the HDFS distributed file system.

For obvious reasons, fully distributed is the mode most prevalent in production scenarios. The purpose of our experimentation is to gain experience, to better be able to handle practical tasks later. Therefore we choose fully distributed mode as the basis for our experiments.

Hardware
Hadoop was created for use on clusters of cheap and readily available commodity hardware. We can use this fact to our advantage. A number of unused office computers with roughly 8GB of RAM and a quad core CPU each, or even a stack of no longer used laptops are enough to build a cluster. It will probably not be the most powerful or energy efficient solution, but for the purposes of this experiment it serves well. Sizing of production environments naturally requires some more consideration and some beefier hardware.

It is important to note that Hadoop and HDFS were built to cope with disk and node failures without disrupting operations on the cluster. On worker nodes it is not needed to use enterprise class harddisks. Also RAID controllers and other enterprise class storage systems are unnecessary. Simple and inexpensive consumer class SATA drives suffice. The choice of SSD drives versus normal mechanical harddisks depends on the amount of random access in the typical workload. MapReduce jobs are very sequential by nature and will rarely benefit from SSD storage. However, Key-Value databases like HBase could generate a lot of random I/O, in which case SSD storage is helpful. Consider your usage scenarios carefully before spending the additional funds, since price per GB is still considerably higher for SSD than for HDD.

Master nodes like the HDFS namenode are single points of failure. As opposed to slave nodes, master nodes will benefit from enterprise class RAID controllers and reliable enterprise harddisks in critical environments.

Never use network- or shared storage for the nodes of your cluster. Using shared storage (like a SAN) will nullify many of the benefits of Hadoop. Since Hadoop speeds things up by bringing the processing to the data (in the worker nodes), moving the data away from the processing into a SAN is counterproductive and on top of that a lot more expensive. Furthermore, sharing disks between nodes will not only share disk performance but also load disks with much higher levels of random access, which is I/O expensive.

For memory the choice is easy. More is better, add as much as you can afford. If it is not needed for the daemons, Hadoop will use it for things like disk caching to speed up performance.

The choice of CPU is less straightforward. Faster is not always better. Since Hadoop scales out instead of up, the main consideration for CPU should be performance per watt of energy consumption. It is very well possible that a cluster of 50 power efficient machines will be cheaper to operate in the long run than a cluster of just 25 high powered machines with the same total performance.

Hardware sizing and planning is well described in the HortonWorks cluster planning guide.

For the experimentation cluster in this series of articles, we will assume a quad core machine with 16GB of RAM as a master machine and three quad core 8GB machines as processing nodes, each equipped with a 2TB SATA mechanical harddisk. Of course all the steps described can also be performed on more powerful hardware. The main consideration for our experimentation cluster is to have enough RAM available to run all required components simultaneously.

Part 2 of this series will provide a detailed step-by-step instruction for installing a Hadoop cluster on our hardware. Stay tuned!

[ps2id id=’commentaren’ target=”/]