“The Wikipedia article on Big Data says it requires exceptional technologies to efficiently process large quantities of data within tolerable elapsed times.”
Big data is making us think of ways to harness the excessive amount of unstructured data that is generated on a daily basis. Moreover, it is no surprise we have seen the introduction of many new Big Data technologies .
It was Carlo Strozzi who coined the term NoSQL (“Not only SQL”) in 1998, referring to a lightweight database that did not expose a SQL interface. The NoSQL databases provide Infinite scalability, fault tolerance, high availibilty, design-friendly lack of schema. For example, Oracle has a NoSQL offering, and Globals is an Open Source NoSQL that supports a Java API.
My first experience with an unstructured like database was with the Google App Engine in which uses the BigTable like concepts (App Engine datastore and BigTable are not the same thing – datastore is built on top of the lower level BigTable, and adds extra capabilities. Bigtable is a sparse, distributed, persistent multidimensional sorted map. The map is indexed by a row key, column key, and a timestamp; each value in the map is an uninterpreted array of bytes.
Likewise, “The guts of Google are where concepts such as key-value pairs and MapReduce have been brought to the everyday user, albeit transparently.To finish the thought, NoSQL is a database-like storage engine for key-value pairs and Hadoop is an open-source implementation of MapReduce, among other things. Together, they enable mountains and mountains of data to be used purposefully”
The Primer on Big data defines the four “V’s” of data: volume, velocity, variety and veracity :
Volume: The sheer amount of data being digitized, maintained, secured, and then used. Knowing the organization’s current needs and having a plan for its growth is fundamental.
Velocity: The speed at which data must be moved, stored, transformed, managed, analyzed or reported on in order to maintain competitiveness. This will vary by organization and application or usage.
Variety: The different types of data, from source (origin) to storage and usage, must be well understood because competitiveness requires access to the right types of data more than ever. From aged flat files to spatial and unstructured data, a plan must be in place.
Veracity: The truthfulness or quality of data can either lead to poor understanding and decisions that belie progress or deliver a powerful jolt of reality that fuels new insight and ideas. Ultimately, data quality may be the most important frontier.
More on Hadoop :
“Traditional databases have columns and structures — name, rank, serial number, data of entry, date of departure,” Kay said. “In a Hadoop cluster, it’s unstructured. You don’t know what the structure is.”
“Hadoop was created by computer scientist Doug Cutting, who developed the platform based on data-indexing research from Google Inc. Cutting, now Cloudera’s chief architect, named the technology after his son’s yellow stuffed-elephant toy, which went on to become the platform’s logo.”
Now, VMWare is bringing in Hadoop to its SpringSource Umbrella.
I have been working with SAS PROC Summary , and now know the PROC Means provides almost the exact same functionality (as the legend goes that two different people were designing procs that did the same thing) . If you go to documentation for either, it shows the same syntax. The core
difference is that by default PROC MEANS sends the results to an Output Window and that PROC SUMMARY, by default, creates a SAS data set.
Class statement: class variable(s) </ option(s)>;
– Specifies the variables whose values define the subgroup combinations for the analysis
Id statement: Id variable(s);
includes additional variables in the output data set.
Output statement: OUT=SAS-data-set
- statistics : The statistical analyses PROC MEANS will generate
- _FREQ_ is the count of the number of observations available for use
- _TYPE_ is a numeric flag which indicates the subgroup of the CLASS variables summarized by that observation in the output data set.
I am using the SAS Enterprise Guide and trying out a few things.
My samples directory is C:\Program Files\SASHome\x86\SASEnterpriseGuide\4.3\Sample\Data
- Tools-> SAS Enterprise Guide Explorer
- Tools -> Update Library Metadata
- File -> Open ->
- Right click on SAS program and click create stored process
We created a new SAS library (which is just a name that points to a directory in the system), but the SAS BI Dashboard was not recognizing it. Turns out you have to pre-assign the library.
We had to reset the tool for SAS BI Dashboard to take notice. Simply checking the library was not good enough.
A SAS stored process is a SAS program that is hosted on a server and described by metadata.
Because a stored process is basically a SAS program, it can access any SAS data source or external file as input
and can create multiple types of output, such as new data sets, files, and report output in a variety of formats.
code is not embedded in client applications it is stored on a server
SAS stored processes can be hosted by either the SAS Stored Process Server or the SAS Workspace Server
More on Stored Processes… http://www.bi-notes.com/2012/05/stored-process-edit-modify-change/
an information map bridges the gap between the:
- the physical data warehouse
- business user who views or builds reports from the data
- information architects sees physical data source that might be an array of interconnected tables and columns
- business users sees a simple list of business terms
an information map:
- enables business users access to the most current data that is needed for business reporting
- hides from business users the details of the physical data view
Information maps are :
user-friendly metadata definitions of physical data sources.
information map contains two basic elements:
A data item (a table column, an OLAP hierarchy, or an OLAP measure) –
- are used for building queries and can be an item that represents either physical data or a calculation.
- are usually customized in order to present the physical data in a form that is relevant and meaningful to a business user.
A filter is criteria that subset the data (i.e. where clause).
SAS Information Maps contain metadata to describe the following elements:
Metadata about the data sources: An information map can be based on SAS data sets, SAS OLAP cubes, or a third-party database such as Oracle, Teradata, DB2, or Microsoft Excel
Metadata about relationships : Multiple relational data tables can be combined or joined to enable optimized queries, regardless of the data source.
Metadata about the appearance and usage of data items : control the display of data items through labels and formats. It can also control the usage of the data items. For example, you can decide that a certain data item should not be used in a sort or to compute statistics.
Metadata about business rules : Standard calculations and filters can be predefined so that business users do not need to re-create them every time that they are needed.
Metadata Repository pane on left: display of the information maps
Presentation tab: physical data sources, data items, and filters
Physical data on left : shows physical data sources
Information Map: details panel displays the Information Map data items and filters
Relationship Tab: Tables and their relationships to others
You cannot use both a table and a cube in the same information map
By inserting a data source for your information map, you have made the data available to the information map.
Now, you must create data items in order to include them in the information map.
A business user sees data from the data source only if you create a data item to represent it.
- A data item can be a logical view of a field in the physical data.
- A data item can be calculated from an expression.
To create data items, you use the Presentation tab of the main SAS Information Map Studio window.
To create a data item that is a logical view of the physical data field, you select an item in the Physical Data pane, and you use the Insert button to create a corresponding data item in the Information Map pane.
Data items are listed in the Information Map pane with an icon indicating their classification. There are two classes of data items
category: distinct value that is used to group or summarize measure data items
measured: a value that is measured and can be used in expression
Each data item has metadata to describe its properties
You can view and edit a data item’s properties in the Data Item Properties window
you right-click the data item in the Information Map pane of the Presentation tab and select Properties
Filters : A filter can also be based on a physical data column or on an expression. The expression that you use in a filter can reference data items, physical data columns, or both.
category data item:
- query code uses a WHERE clause
- the clause is evaluated for individual records prior to any aggregations
measure data item:
- query code uses a HAVING clause
- the clause is evaluated on summarized information after aggregating the data
To open the New Filter window, you click the New Filter tool
In the Data item box, you specify the data item, physical column, or expression to which this filter applies.
In the Condition box, you specify the condition that is used to filter the data. For relational filters, the list of available conditions is based on the data item that is selected in the Data item box.
In the Value(s) box, you specify the unformatted values that the condition uses to filter the data items
You click Combinations to display or hide the elements that enable you to create compound expressions for the filter. Or/And expressions
he Edit button on the Definition tab enables you to open the Expression Editor window, where you can specify an expression for the new data item.
Cubes are logical, multidimensional models that consist of the following elements:
- One or more dimensions
- One or more levels
- One or more hierarchies
Dimension – collection of closely related hierarchies that group data into natural categories (i.e. variables)
Level – level of detail within dimension
Dimensions have top level
Hierarchy – order of levels in a dimension based on parent child relationships
Member – individual value within level
Star Schema – The set of tables includes a single fact table and one or more dimension tables. ” It is called a star schema because the diagram resembles a star, with points radiating from a center. The center of the star consists of fact table and the points of the star are the dimension tables.”
Fact Table – ” The fact tables contain fields for the individual facts as well as foreign key fields relating the facts to the dimension tables.”
Aggregate Tables – “are special fact tables in a data warehouse that contain new metrics derived from one or more aggregate functions (AVERAGE, COUNT, MIN, MAX, etc..) or from other specialized functions that output totals derived from a grouping of the base data.”