Module1 BDA
Module1 BDA
Analytics
Module 1- Introduction to Big Data Analytics
• Data science is a multidisciplinary science
• Main objective is to perform data analysis to generate
knowledge that can be used for decision making
• Knowledge can be in form of similar patterns or
predictive planning models.
• A data science application collects data and info. from
multiple heterogeneous sources, cleans, integrates,
processes and analyses the data using various tools and
present information and knowledge in various visual
forms.
Big Data Overview
• Data is created constantly, and at an ever-increasing
rate.
• Challenge is to identify meaningful patterns and extract
useful information.
• Credit card companies, Mobile phone companies,
Linkedin , Facebook, etc.,
Attributes Defining BigData
Characteristics
• Huge Volume of data
• Complexity of data types and structures
• Speed of new data creation and growth.
Different sources of Big data
Structured data:
Data containing a defined data type, format, and structure (that
is, transaction data, online analytical processing [OLAP] data cubes,
traditional RDBMS, CSV files, and even simple spreadsheets)
Semi-structured data:
Textual data files with a discernible pattern that enables parsing
(such as Extensible Markup Language [XML] data files that are
selfdescribing and defined by an XML schema)
Quasi-structured data:
Textual data with erratic data formats that can
be formatted with effort, tools,and time (for instance,
web clickstream data that may contain inconsistencies in
data values and formats)
Unstructured data:
Data that has no inherent structure, which may
include text documents, PDFs, images, and video.
Statistical Data types
• Categorical or qualitative
• Nominal – only categorized
• Ordinal – categorized and ranked
• Quantitative – define the scale of data
• Discrete - countable
• Continuous - measurable
Measurement Scale of Data
Population and Sample
State of Practice in Analytics
Data Analytics Life
Cycle
• Phase 1—Discovery:
The team assesses the resources available to support the project in
terms of people, technology, time, and data.
• Phase 2—Data preparation:
The team needs to execute extract, load, and transform (ELT).
• Phase 3—Model planning:
The team determines the methods, techniques, and workflow it intends to
follow for the subsequent model building phase.
• Phase 4—Model building:
The team develops datasets for testing, training, and production
purposes. The team builds and executes models based on the work done
in the model planning phase.
• Phase 5—Communicate results:
The team should identify key findings, quantify the business value, and
develop a narrative to summarize and convey findings to stakeholders.
• Phase 6—Operationalize:
The team delivers final reports, briefings, code, and technical documents.
Phase 1 : Discovery
• Learning the business domain
• Resources
• Framing the Problem
• Identifying Key Stakeholders
• Interviewing the Analytics Sponsor
• Developing Initial Hypotheses
• Identifying Potential Data Sources
Phase 2: Data Preparation
• Preparing the Analytic Sandbox
• Performing ETLT
• Learning About the Data
• Data Conditioning
• Survey and Visualize
• Common Tools for the Data Preparation Phase
1. Hadoop
2. Alpine iner
3. Open Refine
4. Data wrangler
Phase 3: Model Planning
• Data Exploration and Variable Selection
• Model Selection
• Common Tools for the Model Planning Phase
1. R
2. Sql Analysis service
3. SAS / ACCESS
Phase 4: Model Building
• Common Tools for the Model Building Phase
• Commercial tools
1. SAS Enterprise Miner
2. SPSS Modeler
3. Matlab
4. Alpine Miner
5. STATISTICA and Mathematica
• Open source tools
1. R and PL/R
2. Python
3. Sql
4. Octave
5. WEKA
Basic Methods of Data Analytics
• Descriptive Analysis
• Exploratory Analysis
• Inferential Analysis
• Predictive Analysis
Descriptive Analysis
• is used to present basic summaries about data
• Example: Summarize the given data
• Enrolment Gend Heig
Number
Descriptive er ht data
of categorical
S20200001 F 155
S20200002 F 160 Gender Frequen Proporti Percenta
S20200003 M 179 cy on ge
• Outliers can affect the mean value, but not the median.
2. Describing the spread of the data: (include
following measures)
• Range: Minimum to Maximum
• Variance:
Sample variance:
Population variance:
• Standard deviation
Sample:
• 5-Point Summary and Interquartile Range (IQR)