A Techno Blog, mainly about Java

Learning Big Data

From the post on  Crowdsourcing :

The Wikipedia article on Big Data says it requires exceptional technologies to efficiently process large quantities of data within tolerable elapsed times.

Big data is making us think of ways to harness the excessive amount of unstructured data that is generated on a daily basis.  Moreover, it is no surprise we have seen the introduction of many new Big Data technologies .

It was   Carlo Strozzi who coined the term NoSQL (“Not only SQL”) in 1998, referring to a lightweight database that did not expose a SQL interface. The NoSQL databases provide Infinite  scalability, fault  tolerance, high availibilty,  design-friendly lack of schema.  For example,  Oracle has a NoSQL offering, and  Globals is an Open Source NoSQL that supports a Java API.

My first experience with an unstructured like database was with the  Google App Engine in which uses the  BigTable like concepts (App Engine datastore and BigTable are not the same thing – datastore is built on top of the lower level BigTable, and adds extra capabilities. Bigtable is a sparse, distributed, persistent multidimensional sorted map. The map is indexed by a row key, column key, and a timestamp; each value in the map is an uninterpreted array of bytes.

Likewise, “The guts of Google are where concepts such as key-value pairs and MapReduce have been brought to the everyday user, albeit transparently.To finish the thought, NoSQL is a database-like storage engine for key-value pairs and Hadoop is an open-source implementation of MapReduce, among other things. Together, they enable mountains and mountains of data to be used purposefully

The Primer on Big data defines the  four “V’s” of data: volume, velocity, variety and veracity :

Volume: The sheer amount of data being digitized, maintained, secured, and then used. Knowing the organization’s current needs and having a plan for its growth is fundamental.

Velocity: The speed at which data must be moved, stored, transformed, managed, analyzed or reported on in order to maintain competitiveness. This will vary by organization and application or usage.

Variety: The different types of data, from source (origin) to storage and usage, must be well understood because competitiveness requires access to the right types of data more than ever. From aged flat files to spatial and unstructured data, a plan must be in place.

Veracity: The truthfulness or quality of data can either lead to poor understanding and decisions that belie progress or deliver a powerful jolt of reality that fuels new insight and ideas. Ultimately, data quality may be the most important frontier.

I first heard the hadoop and cassandra  buzz words bounced around when they talked of the technology behind google/facebook.

More on Hadoop :

“Traditional databases have columns and structures — name, rank, serial number, data of entry, date of departure,” Kay said. “In a Hadoop cluster, it’s unstructured. You don’t know what the structure is.”

“Hadoop was created by computer scientist Doug Cutting, who developed the platform based on data-indexing research from Google Inc. Cutting, now Cloudera’s chief architect, named the technology after his son’s yellow stuffed-elephant toy, which went on to become the platform’s logo.”

Now,  VMWare is bringing in Hadoop to its SpringSource Umbrella.


June 28, 2012 Posted by | Uncategorized | Leave a comment