Hadoop is a computing environment built on top of a distributed clustered file system that was built specifically for large scale data. The approach of Hadoop is distributing the data into a collection of commonly available servers where each server is an in inexpenseive disk drive. Moreover, as it says here : “we introduce the idea of “big data” that the string is too huge to one master machine, so “master method” failed. Now we distribute the task to thousands of low cost machines.” With Hadoop, redundancy is built into the environment where data is stored in multiple places across the cluster. Not only is the data stored in multiple places in the cluster, but the programming model is such that failures are expected and resolvedby running portions of the program on different servers in the cluster.
Hadoop has two main parts :
- HDFS, the Hadoop Distributed File System, is a distributed file system designed to hold very large amounts of data (terabytes or even petabytes), and provide high-throughput access to this information. The design of HDFS is based on GFS (Google File System). Files are stored in a redundant fashion across multiple machines to ensure their durability to failure and high availability to very parallel applications. There is one NameNode, and multiple DataNodes. Moreover, via Facebok Under the Hood says : “HDFS clients perform filesystem metadata operations through a single server known as the Namenode, and send and retrieve filesystem data by communicating with a pool of Datanodes. Data is replicated on multiple datanodes, so the loss of a single Datanode should never be fatal to the cluster or cause data loss.” Also see IBM Big Data Analytics HDFS, Facebook’s Realtime Hadoop
- MapReduce , the programming model. it is a programming paradigm that allows for massive scalability across the many servers in a hadoop cluster. map reduce performs two seperate tasks. first is the map job which takes a set of data and converts it into another set of data where elements are broken down into key/value pairs. The reduce job takes the output from a map as input and combines the key/value pairs into a smaller set of key/value pairs.