A Techno Blog, mainly about Java

Learning Big Data

From the post on  Crowdsourcing :

The Wikipedia article on Big Data says it requires exceptional technologies to efficiently process large quantities of data within tolerable elapsed times.

Big data is making us think of ways to harness the excessive amount of unstructured data that is generated on a daily basis.  Moreover, it is no surprise we have seen the introduction of many new Big Data technologies .

It was   Carlo Strozzi who coined the term NoSQL (“Not only SQL”) in 1998, referring to a lightweight database that did not expose a SQL interface. The NoSQL databases provide Infinite  scalability, fault  tolerance, high availibilty,  design-friendly lack of schema.  For example,  Oracle has a NoSQL offering, and  Globals is an Open Source NoSQL that supports a Java API.

My first experience with an unstructured like database was with the  Google App Engine in which uses the  BigTable like concepts (App Engine datastore and BigTable are not the same thing – datastore is built on top of the lower level BigTable, and adds extra capabilities. Bigtable is a sparse, distributed, persistent multidimensional sorted map. The map is indexed by a row key, column key, and a timestamp; each value in the map is an uninterpreted array of bytes.

Likewise, “The guts of Google are where concepts such as key-value pairs and MapReduce have been brought to the everyday user, albeit transparently.To finish the thought, NoSQL is a database-like storage engine for key-value pairs and Hadoop is an open-source implementation of MapReduce, among other things. Together, they enable mountains and mountains of data to be used purposefully

The Primer on Big data defines the  four “V’s” of data: volume, velocity, variety and veracity :

Volume: The sheer amount of data being digitized, maintained, secured, and then used. Knowing the organization’s current needs and having a plan for its growth is fundamental.

Velocity: The speed at which data must be moved, stored, transformed, managed, analyzed or reported on in order to maintain competitiveness. This will vary by organization and application or usage.

Variety: The different types of data, from source (origin) to storage and usage, must be well understood because competitiveness requires access to the right types of data more than ever. From aged flat files to spatial and unstructured data, a plan must be in place.

Veracity: The truthfulness or quality of data can either lead to poor understanding and decisions that belie progress or deliver a powerful jolt of reality that fuels new insight and ideas. Ultimately, data quality may be the most important frontier.

I first heard the hadoop and cassandra  buzz words bounced around when they talked of the technology behind google/facebook.

More on Hadoop :

“Traditional databases have columns and structures — name, rank, serial number, data of entry, date of departure,” Kay said. “In a Hadoop cluster, it’s unstructured. You don’t know what the structure is.”

“Hadoop was created by computer scientist Doug Cutting, who developed the platform based on data-indexing research from Google Inc. Cutting, now Cloudera’s chief architect, named the technology after his son’s yellow stuffed-elephant toy, which went on to become the platform’s logo.”

Now,  VMWare is bringing in Hadoop to its SpringSource Umbrella.


June 28, 2012 Posted by | Uncategorized | Leave a comment

SAS Proc Summary and Means joined at hip

I have been working with SAS PROC Summary , and now know the PROC Means provides almost the exact same functionality (as the legend goes that two different people were designing  procs that did the same thing) . If you go to documentation for either, it shows the same syntax. The core
difference is that by default PROC MEANS sends the results to an Output Window and that PROC SUMMARY, by default, creates a SAS data set.

Class statement:   class variable(s) </ option(s)>;  

– Specifies the variables whose values define the subgroup combinations for the analysis

Id statement:  Id variable(s); 

includes additional variables in the output data set.

Output statement: OUT=SAS-data-set

  • statistics : The statistical analyses PROC MEANS will generate
  • _FREQ_ is the count of the number of observations available for use
  • _TYPE_  is a numeric flag which indicates the subgroup of the CLASS variables summarized by that observation in the output data set.

Here is an example. Nice presentation here.

June 22, 2012 Posted by | Uncategorized | Leave a comment

SAS BI Tasks

I am using the SAS Enterprise Guide and trying out a few things.

My samples directory is C:\Program Files\SASHome\x86\SASEnterpriseGuide\4.3\Sample\Data

  • Tools-> SAS Enterprise Guide Explorer
  • Tools -> Update Library Metadata
  • File -> Open ->
  • Right click on SAS program and click create stored process

June 20, 2012 Posted by | Uncategorized | Leave a comment

More BI Learnings

SAS Snippets

June 17, 2012 Posted by | Uncategorized | Leave a comment

Pre-assign Library in SAS Management Console (SMC)

We created a new SAS library (which is just a name that points to a directory in the system),  but the SAS BI Dashboard  was not recognizing it.  Turns out you have to pre-assign the library.

We had to reset the tool for SAS BI Dashboard to take notice.  Simply checking the library was not good enough.

June 17, 2012 Posted by | Uncategorized | Leave a comment

Stored Processes

A SAS stored process is a SAS program that is hosted on a server and described by metadata.

Because a stored process is basically a SAS program, it can access any SAS data source or external file as input
and can create multiple types of output, such as new data sets, files, and report output in a variety of formats.

code is not embedded in client applications  it is stored on a server

SAS stored processes can be hosted by either the SAS Stored Process Server or the SAS Workspace Server

Any SAS program can be a stored process

More on Stored Processes…

June 17, 2012 Posted by | Uncategorized | Leave a comment

SAS data manipulations

SAS Transpose : “So many times we need to take our data and turn it around. One of the reasons that this is done is that it
is more efficient to store your data in a vertical format and processing the data is easier in a horizontal

  • Changes the variables (columns)  into observations (rows)
  • eliminates the need to write a complex data step
  • Horizontal data has many variables with few rows. There will be empty cells in the data if there are any missing values

SAS Summary/SAS Means : In SAS, you can use the UNIVARIATE, MEANS, or SUMMARY procedure to obtain summary statistics such as the median, skewness, and kurtosis

By definition, a median is a statistical term identifying a piece of data (number) that divides numerically ordered data into two equal halves. In easier terms, the median is the middle piece of data when those data are placed in numerical order.

June 15, 2012 Posted by | Uncategorized | Leave a comment

Play the InfoMap Game

an information map bridges the gap between the:

  • the physical data warehouse
  • business user who views or builds reports from the data

An Example:

  •  information architects  sees physical data source that might be an array of interconnected tables and columns
  •  business users sees a simple list of business terms

an information map:

  • enables business users access to the most current data that is needed for business reporting
  • hides from business users the details of the physical data view

Information maps are :

user-friendly metadata definitions of physical data sources.

information map contains two basic elements:

data item   (a table column, an OLAP hierarchy, or an OLAP measure) –

  • are used for building queries and can be an item that represents either physical data or a calculation.
  •  are usually customized in order to present the physical data in a form that is relevant and meaningful to a business user.

filter is criteria that subset the data (i.e. where clause).

SAS Information Maps contain metadata to describe the following elements:

Metadata about the data sources:  An information map can be based on SAS data sets, SAS OLAP cubes, or a third-party database such as Oracle, Teradata, DB2, or Microsoft Excel

Metadata about relationships : Multiple relational data tables can be combined or joined to enable optimized queries, regardless of the data source.

Metadata about the appearance and usage of data items : control the display of data items through labels and formats. It can also control the usage of the data items. For example, you can decide that a certain data item should not be used in a sort or to compute statistics.

Metadata about business rules : Standard calculations and filters can be predefined so that business users do not need to re-create them every time that they are needed.

Metadata Repository pane on left: display of the information maps

Presentation tab: physical data sources, data items, and filters

Physical data on left : shows physical data sources

Information Map: details panel displays the Information Map data items and filters

Relationship Tab: Tables and their relationships to others

You cannot use both a table and a cube in the same information map

By inserting a data source for your information map, you have made the data available to the information map.

Now, you must create data items in order to include them in the information map.

A business user sees data from the data source only if you create a data item to represent it.

  • A data item can be a logical view of a field in the physical data.
  • A data item can be calculated from an expression.

To create data items, you use the Presentation tab of the main SAS Information Map Studio window.

To create a data item that is a logical view of the physical data field, you select an item in the Physical Data pane, and you use the Insert button to create a corresponding data item in the Information Map pane.

Data items are listed in the Information Map pane with an icon indicating their classification. There are two classes of data items

category: distinct value that is used to group or summarize measure data items

measured: a value that is measured and can be used in expression

Each data item has metadata to describe its properties

You can view and edit a data item’s properties in the Data Item Properties window

you right-click the data item in the Information Map pane of the Presentation tab and select Properties

Filters :  A filter can also be based on a physical data column or on an expression. The expression that you use in a filter can reference data items, physical data columns, or both.

category data item:

  • query code uses a WHERE clause
  • the clause is evaluated for individual records prior to any aggregations

measure data item:

  • query code uses a HAVING clause
  • the clause is evaluated on summarized information after aggregating the data

To open the New Filter window, you click the New Filter tool

In the Data item box, you specify the data item, physical column, or expression to which this filter applies.

In the Condition box, you specify the condition that is used to filter the data. For relational filters, the list of available conditions is based on the data item that is selected in the Data item box.

In the Value(s) box, you specify the unformatted values that the condition uses to filter the data items

You click Combinations to display or hide the elements that enable you to create compound expressions for the filter. Or/And expressions

he Edit button on the Definition tab enables you to open the Expression Editor window, where you can specify an expression for the new data item.


June 12, 2012 Posted by | Uncategorized | Leave a comment

OLAP Sandwich

I am learning about OLAP cubes… [concepts] [more]

Cubes are logical, multidimensional models that consist of the following elements:

  • One or more dimensions
  • One or more levels
  • One or more hierarchies
  • Members

Dimension – collection of closely related hierarchies that group data into natural categories (i.e. variables)

Level – level of detail within dimension

Dimensions have top level

Hierarchy – order of levels in a dimension based on parent child relationships

Member – individual value within level

 Star Schema – The set of tables includes a single fact table and one or more dimension tables. ” It is called a star schema because the diagram resembles a star, with points radiating from a center. The center of the star consists of fact table and the points of the star are the dimension tables.”

Fact Table – ” The fact tables contain fields for the individual facts as well as foreign key fields relating the facts to the dimension tables.”

Aggregate Tables – “are special fact tables in a data warehouse that contain new metrics derived from one or more aggregate functions (AVERAGE, COUNT, MIN, MAX, etc..) or from other specialized functions that output totals derived from a grouping of the base data.”

June 7, 2012 Posted by | Uncategorized | Leave a comment