0% found this document useful (0 votes)
3 views39 pages

Module1 BDA

The document provides an overview of Big Data Analytics, emphasizing the importance of data science in generating knowledge for decision-making through various data types and analytics methods. It outlines the data analytics life cycle, which includes phases such as discovery, data preparation, model planning, model building, communicating results, and operationalizing findings. Additionally, it discusses data preparation techniques, data types, and basic methods of data analytics, including descriptive, exploratory, inferential, and predictive analyses.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views39 pages

Module1 BDA

The document provides an overview of Big Data Analytics, emphasizing the importance of data science in generating knowledge for decision-making through various data types and analytics methods. It outlines the data analytics life cycle, which includes phases such as discovery, data preparation, model planning, model building, communicating results, and operationalizing findings. Additionally, it discusses data preparation techniques, data types, and basic methods of data analytics, including descriptive, exploratory, inferential, and predictive analyses.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 39

ITDX45- Big Data

Analytics
Module 1- Introduction to Big Data Analytics
• Data science is a multidisciplinary science
• Main objective is to perform data analysis to generate
knowledge that can be used for decision making
• Knowledge can be in form of similar patterns or
predictive planning models.
• A data science application collects data and info. from
multiple heterogeneous sources, cleans, integrates,
processes and analyses the data using various tools and
present information and knowledge in various visual
forms.
Big Data Overview
• Data is created constantly, and at an ever-increasing
rate.
• Challenge is to identify meaningful patterns and extract
useful information.
• Credit card companies, Mobile phone companies,
Linkedin , Facebook, etc.,
Attributes Defining BigData
Characteristics
• Huge Volume of data
• Complexity of data types and structures
• Speed of new data creation and growth.
Different sources of Big data
Structured data:
Data containing a defined data type, format, and structure (that
is, transaction data, online analytical processing [OLAP] data cubes,
traditional RDBMS, CSV files, and even simple spreadsheets)
Semi-structured data:
Textual data files with a discernible pattern that enables parsing
(such as Extensible Markup Language [XML] data files that are
selfdescribing and defined by an XML schema)
Quasi-structured data:
Textual data with erratic data formats that can
be formatted with effort, tools,and time (for instance,
web clickstream data that may contain inconsistencies in
data values and formats)
Unstructured data:
Data that has no inherent structure, which may
include text documents, PDFs, images, and video.
Statistical Data types
• Categorical or qualitative
• Nominal – only categorized
• Ordinal – categorized and ranked
• Quantitative – define the scale of data
• Discrete - countable
• Continuous - measurable
Measurement Scale of Data
Population and Sample
State of Practice in Analytics
Data Analytics Life
Cycle
• Phase 1—Discovery:
The team assesses the resources available to support the project in
terms of people, technology, time, and data.
• Phase 2—Data preparation:
The team needs to execute extract, load, and transform (ELT).
• Phase 3—Model planning:
The team determines the methods, techniques, and workflow it intends to
follow for the subsequent model building phase.
• Phase 4—Model building:
The team develops datasets for testing, training, and production
purposes. The team builds and executes models based on the work done
in the model planning phase.
• Phase 5—Communicate results:
The team should identify key findings, quantify the business value, and
develop a narrative to summarize and convey findings to stakeholders.
• Phase 6—Operationalize:
The team delivers final reports, briefings, code, and technical documents.
Phase 1 : Discovery
• Learning the business domain
• Resources
• Framing the Problem
• Identifying Key Stakeholders
• Interviewing the Analytics Sponsor
• Developing Initial Hypotheses
• Identifying Potential Data Sources
Phase 2: Data Preparation
• Preparing the Analytic Sandbox
• Performing ETLT
• Learning About the Data
• Data Conditioning
• Survey and Visualize
• Common Tools for the Data Preparation Phase
1. Hadoop
2. Alpine iner
3. Open Refine
4. Data wrangler
Phase 3: Model Planning
• Data Exploration and Variable Selection
• Model Selection
• Common Tools for the Model Planning Phase
1. R
2. Sql Analysis service
3. SAS / ACCESS
Phase 4: Model Building
• Common Tools for the Model Building Phase
• Commercial tools
1. SAS Enterprise Miner
2. SPSS Modeler
3. Matlab
4. Alpine Miner
5. STATISTICA and Mathematica
• Open source tools
1. R and PL/R
2. Python
3. Sql
4. Octave
5. WEKA
Basic Methods of Data Analytics
• Descriptive Analysis
• Exploratory Analysis
• Inferential Analysis
• Predictive Analysis
Descriptive Analysis
• is used to present basic summaries about data
• Example: Summarize the given data
• Enrolment Gend Heig
Number
Descriptive er ht data
of categorical
S20200001 F 155
S20200002 F 160 Gender Frequen Proporti Percenta
S20200003 M 179 cy on ge

S20200004 F 175 Female 5 0.5 50%


(F)
S20200005 M 173
Male (M) 5 0.5 50%
S20200006 M 160
S20200007 M 180
S20200008 F 178
S20200009 F 167
S20200010 M 173
Descriptive of Quantitative Data:
• The height is a quantitative variable. The descriptive of
quantitative data is given by the following two ways:
1. Describing the central tendencies of the data
2. Describing the spread of the data.

1. Describing Central tendencies of Quantitative


data:
• Mean
• Median
• Mode
Mean , Median & Mode
• The median of the data would be the mid value of the
sorted data.

• Outliers can affect the mean value, but not the median.
2. Describing the spread of the data: (include
following measures)
• Range: Minimum to Maximum
• Variance:
Sample variance:

Population variance:
• Standard deviation
Sample:
• 5-Point Summary and Interquartile Range (IQR)

Minimum Value (Min)


1stQuartile<=25% values (Q1)
2nd Quartile is median (M)
3rd Quartile is <=75% values (Q3)
Maximum Value (Max)

IQR is the difference between 3rd and 1st


quartiles values.
Inferential Analysis

• answer the question that what is the probability of the


results obtained from an analysis.
Data preparation for Analysis
• NEED FOR DATA PREPARATION – Data quality factors
Data preprocessing
Data cleaning
• Missing Values
• Ignore the tuple
• Manually enter the omitted value
• Fill up the blank with a global constant
• To fill in the missing value, use a measure of the attribute's
central tendency (such as the mean or median)
• For all samples that belong to the same class as the specified
tuple, use the mean or median
• Fill in the blank with the value that is most likely to be there
• Noisy Data
• Noise is the variance or random error in a measured variable.
• Binning – smoothing by bin means, smoothing by bin median,
smoothing by bin boundaries
• Regression - adjusting the data values to a function and may
also be used to smooth out the data.
• Outlier analysis –clustering used to identify outliers
• Data discretization- a data transformation and data
reduction technique, is an extensively used data smoothing
technique.
Data Integration
Things to consider during data integration
• Entity Identification Problem (including metadata)
• Redundancy and Correlation Analysis
• Tuple Duplication
• Data Value Conflict Detection and Resolution
Data Reduction
• Dimensionality reduction
• Numerosity reduction (compact forms of data
representation for the original data volume)
• Transformations are used in data compression
• Data Discretization - converting values to interval or
concept labels, data discretization alters numerical
data.
Data Transformation
used to change the data into formats that are suited for
the analytical process.
• Smoothing
• Attribute construction (or feature construction)
• Aggregation (e.g. the daily sales data may be
combined to produce monthly or yearly sales)
• Normalization (where the attribute data is resized to
fit a narrower range:−1.0 to 1.0; or 0.0 to 1.0)
• Discretization
• Concept hierarchy creation using nominal data

You might also like