0% found this document useful (0 votes)

25 views51 pages

Big Data

Uploaded by

k61.2214340706

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views51 pages

Big Data

Uploaded by

k61.2214340706

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 51

Chapter 8.

Big Data
Table of Content

1. Executing a Data Analysis Project

2. Data Preparation and Wrangling
3. Data Exploration Objectives and Methods
4. Model Training

2
1. Executing a Data Analysis Project
▪ In the era of big data, firms treat data like they do important
assets.
▪ Traditionally, financial forecasting relied on various financial and
accounting numbers, ratios, and metrics coupled with statistical
or mathematical models.
▪ However, with the proliferation of textual big data (e.g., online
news articles, internet financial forums, social networking
platforms), such unstructured data have been shown to offer
insights faster (as they are real-time) and have enhanced
predictive power.

3
1. Executing a Data Analysis Project
▪ ML model building steps for structured data – Data that are
organized in a systematic format that is readily searchable and
readable by computer operations:
▪ 1. Conceptualization of the modeling task: Define the
problem
▪ 2. Data collection: The researcher has to determine which
sources (internal or external) to use to collect this data.
▪ 3. Data preparation and wrangling:
▪ Data preparation (Cleansing): Clean the data set and
prepare it for the model. Cleaning the data set includes
addressing any missing values or verification of any
out-of-range values.
4
1. Executing a Data Analysis Project
▪ 3. Data preparation and wrangling:
▪ Data wrangling (Preprocessing): Preprocessing data
may involve aggregating, filtering, or extracting
relevant variables.
▪ 4. Data exploration: This step involves feature selection and
engineering as well as initial (i.e., exploratory) data analysis.
▪ 5. Model training: Select appropriate ML algorithm to use,
evaluate the algorithm using a training data set, and tuning
the model.

5
1. Executing a Data Analysis Project
▪ ML model building steps for unstructured data (text data) – Data
are not organized into any systematic format that can be
processed by computers directly:
▪ 1. Text problem formulation: Determine the problem and
identify the exact inputs and output of the model.
▪ 2. Data curation (collection): Determine the sources of data to
be used (e.g., web scouring, specific social media sites). If
using supervised learning, annotating a reliable target
variable is also necessary.

6
1. Executing a Data Analysis Project
▪ 3. Text preparation and wrangling. Convert unstructured
data into structured data.
▪ 4. Text exploration. This involves text visualization as well
as text feature selection and engineering.
▪ 5. Model training: Select appropriate ML algorithm to use,
evaluate the algorithm using a training data set, and tuning
the model.

7
2. Data Preparation and Wrangling
▪ Data preparation and wrangling involve cleansing and
organizing raw data into a consolidated format. The resulting
dataset is suitable to use for further analyses and training a
machine learning (ML) model.
▪ This stage involves two important tasks: cleansing and
preprocessing, respectively.

8
2. Data Preparation and Wrangling
▪ Data preparation (Cleansing): Data cleansing is the process of
examining, identifying, and mitigating errors in raw data.
Normally, the raw data are neither sufficiently complete nor
sufficiently clean to directly train the ML model.
▪ Data Wrangling (Preprocessing): This task performs
transformations and critical processing steps on the cleansed
data to make the data ready for ML model training. Data need to
be processed by dealing with outliers, extracting useful variables
from existing data points, and scaling the data

9
2. Data Preparation and Wrangling – Structured Data
▪ The table below is an example of structured data.

10
2. Data Preparation and Wrangling – Structured Data
▪ Data preparation (Cleansing): In structured data, data errors can
be in the form of incomplete, invalid, inaccurate, inconsistent,
non-uniform,and duplicate data observations. The data cleansing
process mainly deals with identifying and mitigating all such
errors:
▪ Incompleteness error: missing data → omitted or replaced
with “NA” for deletion or substitution. Example: rows 4 (ID 3),
5 (ID4), 6 (ID 5), and 7 (ID6)
▪ Invalidity error: data are outside of a meaningful range →
verifying other administrative data records. Example: Date of
birth in Row 5.
11
2. Data Preparation and Wrangling – Structured Data
▪ Inaccuracy error: The data are not a measure of true value →
Check with the help of business records and administrators.
Example: Credit card in Row 5.
▪ Inconsistency error: The data conflict with the corresponding
data points or reality → This contradiction should be eliminated
by clarifying with another source. Example: Gender in row 3 (ID
2).
▪ Non-uniformity error: The data are not present in an identical
format → Converting the data points into a preferable standard
format. Example: Monetary unit in Salary and Other Income.
▪ Duplication error: duplicate observations are present →
removing the duplicate entries. Example: Row 6 and row 3 are
identical. 12
2. Data Preparation and Wrangling – Structured Data
▪ Data wrangling (Preprocessing) primarily includes transformations
and scaling of the data.
▪ The following transformations are common in practice:
▪ Extraction (e.g., extracting number of years employed based on
dates provided).
▪ Aggregation, which involves consolidating two related
variables into one, using appropriate weights.
▪ Filtration, which involves removing irrelevant observations.
▪ Selection, which involves removing features (i.e., data
columns) not needed for processing.
▪ Conversion of data of diverse types (e.g., nominal, ordinal).
13
2. Data Preparation and Wrangling – Structured Data
▪ Outliers may be present in the data. Any outliers that are present must
first be identified.
▪ Detect:
▪ Use standard deviation: data value that is outside of 3 standard
deviations from the mean may be considered an outlier.
▪ Use interquartile range: IQR is the difference between the 75th and
the 25th percentile values of the data.
▪ Handle:
▪ Trimming: The highest and lowest x% of observations are excluded.
▪ Winsorization: Extreme values may be replaced by the maximum
value allowable for that variable.
14
2. Data Preparation and Wrangling – Structured Data
▪ Scaling is a process of adjusting the range of a feature by shifting
and changing the scale of data:
▪ Normalization scales variable values between 0 and 1.

▪ Standardization centers the variables at 0 and scales them as

units of standard deviations from the mean.

▪ Unlike normalization, standardization is not sensitive to

outliers, but assumes that the variable is normally distributed. 15
2. Data Preparation and Wrangling – Unstructured Data
▪ Unstructured data constitute approximately 80% of the total data
available today.
▪ Unstructured, text-based data is more suitable for human use rather
than for processing by a computer. For analysis, unstructured data
has to be converted into structured data.
▪ The cleansing and processing of text data is called text processing.
Text processing can be divided into two tasks: cleansing and
preprocessing.

16
2. Data Preparation and Wrangling – Unstructured Data
▪ Steps in Text Preparation (Cleansing):
▪ 1. Remove HTML tags. Text collected from web pages has
embedded HTML tags, which may need to be removed before
processing. A regular expression (regex) is a text string used to
identify characters in a particular order.
▪ 2. Remove punctuations. Text analysis usually does not need
punctuations, so these need to be removed as well. Some
punctuations (e.g., %, $, ?) may be needed for analysis, and if so,
they are replaced with annotations (i.e., textual expressions) for
model training.

17
2. Data Preparation and Wrangling – Unstructured Data
▪ Steps in Text Preparation (Cleansing):
▪ 3. Remove numbers. Numbers are removed or replaced with
annotations to let the ML program know that a number is
present, but its value is not important in the analysis. If the
value of a number is important for analysis, such values are
first extracted via text applications.
▪ 4. Remove white spaces. Extra formatting-related white spaces
(e.g., tabs, indents) do not serve any purpose in text processing
and are removed.

18
2. Data Preparation and Wrangling – Unstructured Data

19
2. Data Preparation and Wrangling – Unstructured Data

20
2. Data Preparation and Wrangling – Unstructured Data
▪ Steps in Text Wrangling (Preprocessing):
▪ 1. Tokenization: In text wrangling, a token is a word, and
tokenization is the process of splitting a sentence into tokens.

21
2. Data Preparation and Wrangling – Unstructured Data
▪ Steps in Text Wrangling (Preprocessing):
▪ 2. Normalization:
▪ Lowercasing. So as to not discriminate between market
and Market.
▪ Removal of stop words. In some ML applications, stop
words such as the, is, and so on do not carry any semantic
meaning; hence, they are removed to reduce the number
of tokens in the training data.

22
2. Data Preparation and Wrangling – Unstructured Data
▪ 2. Normalization:
▪ Stemming. The process of converting inflected forms of a
word into its base word (known as stem). For example,
the stem of the words “analyzed” and “analyzing” is
“analyz.” While stemming makes the text confusing for
human processing, it is ideally suited for machines.
▪ Lemmatization. This involves the conversion of inflected
forms of a word into its lemma (i.e., morphological root).
Lemmatization is similar to stemming, but is more
computationally advanced and resource intensive. For
example: analyzed, analyzing, analyzes should be noted
as analyze.
23
2. Data Preparation and Wrangling – Unstructured Data
▪ Steps in Text Wrangling (Preprocessing):
▪ 3. Create a bag-of-word (BOW): After the data is cleansed text
is normalized, a bag-of-words (BOW) procedure is applied,
which simply collects all the words or tokens without regard to
the sequence of occurrence.

24
2. Data Preparation and Wrangling – Unstructured Data

25
2. Data Preparation and Wrangling – Unstructured Data
▪ Steps in Text Wrangling (Preprocessing):
▪ 4. Create a document term matrix (DTM): DTM is then used to
convert the unstructured data into structured data. In this
matrix, each text document is a row, and the columns are
represented by tokens. The cell value represents the number of
occurrences of a token in a document (i.e., row).

26
2. Data Preparation and Wrangling – Unstructured Data
▪ The table below is an example of DTM.

▪ At this point, the unstructured text data are converted to structured

data that can be processed further and used to train the ML model

27
3. Data Exploration Objectives and Methods
▪ Data exploration is a crucial part of big data projects. The prepared
data are explored to investigate and comprehend data distributions
and relationships.
▪ Data exploration involves three important tasks: exploratory data
analysis, feature selection, and feature engineering.

28
3. Data Exploration Objectives and Methods
▪ Exploratory graphs, charts, and other visualizations, such as heat
maps and word clouds, are designed to summarize and observe data.
▪ Data can also be summarized and examined using quantitative
methods, such as descriptive statistics and central tendency
measures.
▪ An important objective of exploratory data analysis is to
understanding data properties, distributions, and other
characteristics, or find patterns or relationships, and evaluate basic
questions and hypotheses.

29
3. Data Exploration Objectives and Methods
▪ Feature selection is a process whereby only pertinent features from
the dataset are selected for ML model training. Selecting fewer
features decreases ML model complexity and training time.
▪ Feature engineering is a process of creating new features by
changing or transforming existing features. Model performance
heavily depends on feature selection and engineering.

30
3. Data Exploration – Structured Data
▪ Exploratory Data Analysis:
▪ Summary statistics, such as mean, median, quartiles, ranges,
standard deviations, skewness, and kurtosis, of a feature can
be computed.
▪ One-dimension visualization summarizes each feature in the
dataset. The basic one-dimension exploratory visualizations
are histograms, bar charts, scatterplots …

31
3. Data Exploration – Structured Data
▪ Feature Selection:
▪ We try to select only the features that contribute to the out-of-
sample predictive power of the model.
▪ A parsimonious model (i.e., a model with fewer features)
reduces feature-induced noise and improves the model’s
prediction accuracy.
▪ Feature selection must not be confused with the data
preprocessing steps during data preparation. Feature selection
requires a good understanding of the business environment
and the interrelationships among the features identified in the
EDA. 32
3. Data Exploration – Structured Data
▪ Feature Engineering:
▪ FE involves optimizing and improving the selected features:
▪ Feature engineering involves either decomposing a feature into
multiple features or converting an existing feature into a new
feature.

33
3. Data Exploration – Unstructured Data
▪ Exploratory Data Analysis: Unstructured text can be tokenized, and
summary statistics include:
▪ Term frequency: number of times the word appears in the text.
▪ Co-occurrence: where two or more words appear together.
▪ A word cloud: a visual representation of all the words in a
BOW, such that words with higher frequency have a larger font
size. This allows the analyst to determine which words are
contextually more important. Figure below shows an example
of a word cloud.

34
3. Data Exploration – Unstructured Data

35
3. Data Exploration – Unstructured Data
▪ Feature Selection: Involves selecting a subset of tokens in the BOW.
Reduction in BOW size makes the model more parsimonious and
reduces feature-induced noise.
▪ Noisy features are both the most frequent and most sparse (or rare)
tokens in the dataset. Noisy features should be removed.
▪ High-frequency words tend to be stop words or common
vocabulary words are typically present frequently in all the
texts across the dataset.
▪ Low-frequency words may be irrelevant and are present in
only a few text cases
36
3. Data Exploration – Unstructured Data
▪ Feature Engineering: The following are some techniques for feature
engineering:
▪ Numbers. Tokens with standard lengths are identified and
converted into a token such as /numberX/. Four-digit numbers
may be associated with years and are assigned a value of
/number4/.
▪ N-grams. These are multiword patterns, and if they are useful,
the order is preserved. For example, the words monetary policy
may be best kept as a sequence rather than broken into 2
different tokens, and therefore would be replaced by a single
token, monetary_policy 37
3. Data Exploration – Unstructured Data
▪ Feature Engineering: The following are some techniques for feature
engineering:
▪ Name entity recognition (NER). NER algorithms search for
token values, in the context it was used, against their internal
library and assign a NER tag to the token. For example,
Microsoft would be assigned a NER tag of ORG and Europe
would be assigned a NER tag of Place.
▪ Parts of speech (POS). This uses language structure
dictionaries to contextually assign tags (POS) to text. For
example, market could be a verb as in “to market” or a noun as
in “in the market”. 38
4. Model Training
▪ The three tasks of ML model training are method selection,
performance evaluation, and tuning.

39
4. Model Training – Method Selection
▪ Method selection is the art and science of choosing the appropriate
ML method (i.e., algorithm) given the objectives and data
characteristics. Method selection is based on the following factors:
▪ Supervised or unsupervised learning. Supervised learning is
used when the training data contains the ground truth or the
known outcome (i.e., the target variable). Unsupervised
learning occurs when there is no target variable.

40
4. Model Training – Method Selection
▪ Type of data. For numerical data (e.g., predicting earnings) we
may use classification and regression tree (CART) methods.
For text data, we can use generalized linear models (GLMs) and
SVMs. For image data, neural networks and deep learning
methods can be employed.
▪ Size of data. Large data sets with many observations and
features can be handled with SVMs. Neural networks work
better with a large number of observations, but few features.

41
4. Model Training – Method Selection
▪ Once a method is selected, the researcher has to specify appropriate
hyperparameters. For example, KNN.
▪ Class imbalance occurs when one class has a large number of
observations relative to other classes. For example, in a model for
predicting default of customer, if the data set has a large number of
high-score customer, then the model would be more likely to predict
nondefault for a new customer →The training data set should have a
variety of customers so as to have enough diversity to make correct
predictions.

42
4. Model Training – Method Selection
▪ One way to overcome class imbalance is to undersample the
overrepresented class and oversample the underrepresented class.

43
4. Model Training – Performance Valuation
▪ It is important to measure the model training performance or
goodness of fit for validation of the model.
▪ For classification problems, we can use error analysis.
▪ For predicting continous variable (like stock price), Root mean
square error (RMSE) can be used.
▪ Errors in classification problems can be false positives (type I error)
or false negatives (type II error). A confusion matrix shows the
results of a classification problem as below. Tip: Think confusion
matrix like a Covid test: 0 = No covid, 1 = Have covid.

44
4. Model Training – Performance Valuation
▪ For example, if someone actually has Covid (Actual label = 1) but the
kit test return “Negative” (Predict result = 0, that is no Covid), then
the test resut is False Negative or Type II error.

45
4. Model Training – Performance Valuation
▪ Metrics such as precision (the ratio of true positives to all predicted
positives) and recall (the ratio of TPs to all actual positives) can be
used.

▪ While FP and FN are both errors, they may not be equally important
and are subject to business decision. For example, a lender may
want to avoid lending to potential defaulters, and so will want to
maximize recall.
▪ High precision is valued when the cost of a type I error is large,
while high recall is valued when the cost of a type II error is large.
46
4. Model Training – Performance Valuation
▪ Trading off precision and recall is subject to business decisions and
model application → additional evaluation metrics that provide the
overall performance of the model are generally used.
▪ Accuracy is the percentage of correctly predicted classes out of total
predictions. F1 score is the harmonic mean of precision and recall.

▪ F1 score is more appropriate (than accuracy) when unequal class

distribution is in the dataset.

47
4. Model Training – Performance Valuation

48
4. Model Training – Performance Valuation

49
4. Model Training – Performance Valuation
▪ For predicting continous variable (like stock price), Root mean
square error (RMSE) can be used. This is useful for data predictions
that are continuous, such as regression models. The RMSE is a
single metric summarizing the prediction error in a sample.

50
4. Model Training – Model Tuning
▪ After model evaluation, the model needs to be revised until it
reaches an acceptable performance level. It is necessary to find an
optimum trade-off between bias and variance errors, such that the
model is neither underfitting nor overfitting.
▪ Tuning involves altering the hyperparameters until a desirable level
of model performance is achieved.

AMD Dragon AM3 AM2 Performance Tuning Guide
No ratings yet
AMD Dragon AM3 AM2 Performance Tuning Guide
19 pages
Sample Justification For Travel For Teachers
100% (5)
Sample Justification For Travel For Teachers
2 pages
FinQuiz - Curriculum Note, @InsightSquad Study Session 3, Reading 8
No ratings yet
FinQuiz - Curriculum Note, @InsightSquad Study Session 3, Reading 8
11 pages
Unit IV
No ratings yet
Unit IV
27 pages
U1 - DA - Data Preprocessing
No ratings yet
U1 - DA - Data Preprocessing
6 pages
Data Wrangling
No ratings yet
Data Wrangling
18 pages
2-Data Wrangling
No ratings yet
2-Data Wrangling
13 pages
211101088math - Data Ass 2
No ratings yet
211101088math - Data Ass 2
12 pages
Math211101020
No ratings yet
Math211101020
12 pages
Model Evaluation
No ratings yet
Model Evaluation
39 pages
MSDSModule 2
No ratings yet
MSDSModule 2
35 pages
Module 2
No ratings yet
Module 2
8 pages
1725892639module 3 The Machine Learning Process
No ratings yet
1725892639module 3 The Machine Learning Process
17 pages
DR Kruti Dangarwala CSE & IT Department Svmit: Python For Data Science Unit 5: Data Wrangling
No ratings yet
DR Kruti Dangarwala CSE & IT Department Svmit: Python For Data Science Unit 5: Data Wrangling
91 pages
Lab Assignment 1 Title: Data Wrangling I: Problem Statement
No ratings yet
Lab Assignment 1 Title: Data Wrangling I: Problem Statement
12 pages
Unit - 2 ML
No ratings yet
Unit - 2 ML
8 pages
Data Wrangling
No ratings yet
Data Wrangling
17 pages
Ch8 Data and Its Processing
No ratings yet
Ch8 Data and Its Processing
32 pages
Social Media Analytics Techniques
No ratings yet
Social Media Analytics Techniques
77 pages
Unit - 2 ML
No ratings yet
Unit - 2 ML
8 pages
Data Wrangling
No ratings yet
Data Wrangling
9 pages
Unit 1
No ratings yet
Unit 1
41 pages
Module - 1 (Introduction To Data Wrangling)
No ratings yet
Module - 1 (Introduction To Data Wrangling)
29 pages
Lecture - 2 Classification (Machine Learning Basic and KNN)
No ratings yet
Lecture - 2 Classification (Machine Learning Basic and KNN)
90 pages
Unit I
No ratings yet
Unit I
41 pages
Data Preparation: January 2017
No ratings yet
Data Preparation: January 2017
15 pages
Introduction To Big Data Projects
No ratings yet
Introduction To Big Data Projects
36 pages
Data Wrangling
No ratings yet
Data Wrangling
15 pages
Lect 04 Preprocessing Structured
No ratings yet
Lect 04 Preprocessing Structured
39 pages
Unit 2 ML
No ratings yet
Unit 2 ML
14 pages
Data Wrangling
No ratings yet
Data Wrangling
30 pages
Unit I 1
No ratings yet
Unit I 1
203 pages
Step by Step Data Wrangling
No ratings yet
Step by Step Data Wrangling
4 pages
Data Science PPT Module 1
100% (1)
Data Science PPT Module 1
24 pages
DWDV Notes
No ratings yet
DWDV Notes
111 pages
ML CH-1 Introduction To Machine Learning
No ratings yet
ML CH-1 Introduction To Machine Learning
12 pages
Unit 7 ML
No ratings yet
Unit 7 ML
33 pages
02 DP
No ratings yet
02 DP
31 pages
Unit 4
No ratings yet
Unit 4
60 pages
Session 4 Machine Learning Process
No ratings yet
Session 4 Machine Learning Process
28 pages
ML Revision
No ratings yet
ML Revision
207 pages
Week5 Modified
No ratings yet
Week5 Modified
25 pages
Data Mining Basics
No ratings yet
Data Mining Basics
52 pages
Workflow of A Machine Learning Project
No ratings yet
Workflow of A Machine Learning Project
12 pages
Air Quality Prediction Using Machine Learning
No ratings yet
Air Quality Prediction Using Machine Learning
29 pages
Data Mining Basics
No ratings yet
Data Mining Basics
38 pages
Chapter I - 231127 - 093902
No ratings yet
Chapter I - 231127 - 093902
22 pages
Data Processing in AI
No ratings yet
Data Processing in AI
7 pages
Unit 2 - Data Munging PDF
No ratings yet
Unit 2 - Data Munging PDF
54 pages
Lecture No 2 Data Preparation
No ratings yet
Lecture No 2 Data Preparation
23 pages
1708443470801
No ratings yet
1708443470801
71 pages
Data Analytics - Module-1.1
No ratings yet
Data Analytics - Module-1.1
42 pages
Lesson 5 Data Wrangling in Data Science.
100% (1)
Lesson 5 Data Wrangling in Data Science.
11 pages
Shanthi ML
No ratings yet
Shanthi ML
26 pages
ML Passing Package - 1
No ratings yet
ML Passing Package - 1
43 pages
Data Preparation
No ratings yet
Data Preparation
15 pages
SML Updated UNIT-2
No ratings yet
SML Updated UNIT-2
43 pages
Learning Progress Review Week 10
No ratings yet
Learning Progress Review Week 10
35 pages
Unit-1 Introduction To Machine Learning (5hrs)
No ratings yet
Unit-1 Introduction To Machine Learning (5hrs)
8 pages
1984, A Molecular Dynamics Method For Simulations in The Canonical Ensemblet by SHUICHI NOSE
No ratings yet
1984, A Molecular Dynamics Method For Simulations in The Canonical Ensemblet by SHUICHI NOSE
14 pages
Grade 7 History Term 1 Worksheets 2023
No ratings yet
Grade 7 History Term 1 Worksheets 2023
23 pages
Srs Report
No ratings yet
Srs Report
24 pages
Eggd Parking
No ratings yet
Eggd Parking
1 page
ISO 9001 Clauses Simply Explained Rev.1
No ratings yet
ISO 9001 Clauses Simply Explained Rev.1
26 pages
Frequently Asked Questions (Faqs) About The: Symmetry454 and Symmetry010 Calendars
No ratings yet
Frequently Asked Questions (Faqs) About The: Symmetry454 and Symmetry010 Calendars
17 pages
Cardiosync Corporate Business Plan
No ratings yet
Cardiosync Corporate Business Plan
7 pages
Gulfood Exhibitor List N 1
No ratings yet
Gulfood Exhibitor List N 1
19 pages
Chapter 2 Searching and Sorting
No ratings yet
Chapter 2 Searching and Sorting
19 pages
A Review On Lifting Beams: July 2017
No ratings yet
A Review On Lifting Beams: July 2017
14 pages
Case Study Synopsis Lpu Ums
No ratings yet
Case Study Synopsis Lpu Ums
5 pages
Chapter 18: C++ As A Better C Introducing Object Technology
No ratings yet
Chapter 18: C++ As A Better C Introducing Object Technology
23 pages
Cidam Layout
No ratings yet
Cidam Layout
40 pages
LDB MP2020 FRMWRK
No ratings yet
LDB MP2020 FRMWRK
77 pages
Pro Proctor User Guide
No ratings yet
Pro Proctor User Guide
24 pages
7º Basico B (7 Grade) : I.-Listening Test (Questions 1 To 10) Listen and Answer
No ratings yet
7º Basico B (7 Grade) : I.-Listening Test (Questions 1 To 10) Listen and Answer
3 pages
IELTS Simon Speaking Part 3 9dee133876
No ratings yet
IELTS Simon Speaking Part 3 9dee133876
37 pages
Simple Carburetor Operation
100% (2)
Simple Carburetor Operation
6 pages
Daylighting Streams Text
No ratings yet
Daylighting Streams Text
6 pages
Quiksam PDF
No ratings yet
Quiksam PDF
6 pages
DSP For MATLAB & LabVIEW I Fundamentals of Discrete Signal Processing
100% (1)
DSP For MATLAB & LabVIEW I Fundamentals of Discrete Signal Processing
233 pages
AI-Powered Course Recommendation System
No ratings yet
AI-Powered Course Recommendation System
11 pages
Battery Room Gas Monitoring Application Note WSA Datasheet
No ratings yet
Battery Room Gas Monitoring Application Note WSA Datasheet
3 pages
Lecture 8 - Transport Layer
No ratings yet
Lecture 8 - Transport Layer
50 pages
Week 11 Probability and Statistics
No ratings yet
Week 11 Probability and Statistics
27 pages
Egyptian Heaven and Hell Volume II
No ratings yet
Egyptian Heaven and Hell Volume II
314 pages
D2R Season 9 Charger Paladin Build (D2R 2.8)
No ratings yet
D2R Season 9 Charger Paladin Build (D2R 2.8)
22 pages
Freelnace Programmer and Ethical Hacking Know How
No ratings yet
Freelnace Programmer and Ethical Hacking Know How
7 pages

Big Data

Uploaded by

Big Data

Uploaded by

Chapter 8.

1. Executing a Data Analysis Project

▪ Standardization centers the variables at 0 and scales them as

▪ Unlike normalization, standardization is not sensitive to

▪ At this point, the unstructured text data are converted to structured

▪ F1 score is more appropriate (than accuracy) when unequal class

You might also like