Big Data
Big Data
Big Data
Table of Content
2
1. Executing a Data Analysis Project
▪ In the era of big data, firms treat data like they do important
assets.
▪ Traditionally, financial forecasting relied on various financial and
accounting numbers, ratios, and metrics coupled with statistical
or mathematical models.
▪ However, with the proliferation of textual big data (e.g., online
news articles, internet financial forums, social networking
platforms), such unstructured data have been shown to offer
insights faster (as they are real-time) and have enhanced
predictive power.
3
1. Executing a Data Analysis Project
▪ ML model building steps for structured data – Data that are
organized in a systematic format that is readily searchable and
readable by computer operations:
▪ 1. Conceptualization of the modeling task: Define the
problem
▪ 2. Data collection: The researcher has to determine which
sources (internal or external) to use to collect this data.
▪ 3. Data preparation and wrangling:
▪ Data preparation (Cleansing): Clean the data set and
prepare it for the model. Cleaning the data set includes
addressing any missing values or verification of any
out-of-range values.
4
1. Executing a Data Analysis Project
▪ 3. Data preparation and wrangling:
▪ Data wrangling (Preprocessing): Preprocessing data
may involve aggregating, filtering, or extracting
relevant variables.
▪ 4. Data exploration: This step involves feature selection and
engineering as well as initial (i.e., exploratory) data analysis.
▪ 5. Model training: Select appropriate ML algorithm to use,
evaluate the algorithm using a training data set, and tuning
the model.
5
1. Executing a Data Analysis Project
▪ ML model building steps for unstructured data (text data) – Data
are not organized into any systematic format that can be
processed by computers directly:
▪ 1. Text problem formulation: Determine the problem and
identify the exact inputs and output of the model.
▪ 2. Data curation (collection): Determine the sources of data to
be used (e.g., web scouring, specific social media sites). If
using supervised learning, annotating a reliable target
variable is also necessary.
6
1. Executing a Data Analysis Project
▪ 3. Text preparation and wrangling. Convert unstructured
data into structured data.
▪ 4. Text exploration. This involves text visualization as well
as text feature selection and engineering.
▪ 5. Model training: Select appropriate ML algorithm to use,
evaluate the algorithm using a training data set, and tuning
the model.
7
2. Data Preparation and Wrangling
▪ Data preparation and wrangling involve cleansing and
organizing raw data into a consolidated format. The resulting
dataset is suitable to use for further analyses and training a
machine learning (ML) model.
▪ This stage involves two important tasks: cleansing and
preprocessing, respectively.
8
2. Data Preparation and Wrangling
▪ Data preparation (Cleansing): Data cleansing is the process of
examining, identifying, and mitigating errors in raw data.
Normally, the raw data are neither sufficiently complete nor
sufficiently clean to directly train the ML model.
▪ Data Wrangling (Preprocessing): This task performs
transformations and critical processing steps on the cleansed
data to make the data ready for ML model training. Data need to
be processed by dealing with outliers, extracting useful variables
from existing data points, and scaling the data
9
2. Data Preparation and Wrangling – Structured Data
▪ The table below is an example of structured data.
10
2. Data Preparation and Wrangling – Structured Data
▪ Data preparation (Cleansing): In structured data, data errors can
be in the form of incomplete, invalid, inaccurate, inconsistent,
non-uniform,and duplicate data observations. The data cleansing
process mainly deals with identifying and mitigating all such
errors:
▪ Incompleteness error: missing data → omitted or replaced
with “NA” for deletion or substitution. Example: rows 4 (ID 3),
5 (ID4), 6 (ID 5), and 7 (ID6)
▪ Invalidity error: data are outside of a meaningful range →
verifying other administrative data records. Example: Date of
birth in Row 5.
11
2. Data Preparation and Wrangling – Structured Data
▪ Inaccuracy error: The data are not a measure of true value →
Check with the help of business records and administrators.
Example: Credit card in Row 5.
▪ Inconsistency error: The data conflict with the corresponding
data points or reality → This contradiction should be eliminated
by clarifying with another source. Example: Gender in row 3 (ID
2).
▪ Non-uniformity error: The data are not present in an identical
format → Converting the data points into a preferable standard
format. Example: Monetary unit in Salary and Other Income.
▪ Duplication error: duplicate observations are present →
removing the duplicate entries. Example: Row 6 and row 3 are
identical. 12
2. Data Preparation and Wrangling – Structured Data
▪ Data wrangling (Preprocessing) primarily includes transformations
and scaling of the data.
▪ The following transformations are common in practice:
▪ Extraction (e.g., extracting number of years employed based on
dates provided).
▪ Aggregation, which involves consolidating two related
variables into one, using appropriate weights.
▪ Filtration, which involves removing irrelevant observations.
▪ Selection, which involves removing features (i.e., data
columns) not needed for processing.
▪ Conversion of data of diverse types (e.g., nominal, ordinal).
13
2. Data Preparation and Wrangling – Structured Data
▪ Outliers may be present in the data. Any outliers that are present must
first be identified.
▪ Detect:
▪ Use standard deviation: data value that is outside of 3 standard
deviations from the mean may be considered an outlier.
▪ Use interquartile range: IQR is the difference between the 75th and
the 25th percentile values of the data.
▪ Handle:
▪ Trimming: The highest and lowest x% of observations are excluded.
▪ Winsorization: Extreme values may be replaced by the maximum
value allowable for that variable.
14
2. Data Preparation and Wrangling – Structured Data
▪ Scaling is a process of adjusting the range of a feature by shifting
and changing the scale of data:
▪ Normalization scales variable values between 0 and 1.
16
2. Data Preparation and Wrangling – Unstructured Data
▪ Steps in Text Preparation (Cleansing):
▪ 1. Remove HTML tags. Text collected from web pages has
embedded HTML tags, which may need to be removed before
processing. A regular expression (regex) is a text string used to
identify characters in a particular order.
▪ 2. Remove punctuations. Text analysis usually does not need
punctuations, so these need to be removed as well. Some
punctuations (e.g., %, $, ?) may be needed for analysis, and if so,
they are replaced with annotations (i.e., textual expressions) for
model training.
17
2. Data Preparation and Wrangling – Unstructured Data
▪ Steps in Text Preparation (Cleansing):
▪ 3. Remove numbers. Numbers are removed or replaced with
annotations to let the ML program know that a number is
present, but its value is not important in the analysis. If the
value of a number is important for analysis, such values are
first extracted via text applications.
▪ 4. Remove white spaces. Extra formatting-related white spaces
(e.g., tabs, indents) do not serve any purpose in text processing
and are removed.
18
2. Data Preparation and Wrangling – Unstructured Data
19
2. Data Preparation and Wrangling – Unstructured Data
20
2. Data Preparation and Wrangling – Unstructured Data
▪ Steps in Text Wrangling (Preprocessing):
▪ 1. Tokenization: In text wrangling, a token is a word, and
tokenization is the process of splitting a sentence into tokens.
21
2. Data Preparation and Wrangling – Unstructured Data
▪ Steps in Text Wrangling (Preprocessing):
▪ 2. Normalization:
▪ Lowercasing. So as to not discriminate between market
and Market.
▪ Removal of stop words. In some ML applications, stop
words such as the, is, and so on do not carry any semantic
meaning; hence, they are removed to reduce the number
of tokens in the training data.
22
2. Data Preparation and Wrangling – Unstructured Data
▪ 2. Normalization:
▪ Stemming. The process of converting inflected forms of a
word into its base word (known as stem). For example,
the stem of the words “analyzed” and “analyzing” is
“analyz.” While stemming makes the text confusing for
human processing, it is ideally suited for machines.
▪ Lemmatization. This involves the conversion of inflected
forms of a word into its lemma (i.e., morphological root).
Lemmatization is similar to stemming, but is more
computationally advanced and resource intensive. For
example: analyzed, analyzing, analyzes should be noted
as analyze.
23
2. Data Preparation and Wrangling – Unstructured Data
▪ Steps in Text Wrangling (Preprocessing):
▪ 3. Create a bag-of-word (BOW): After the data is cleansed text
is normalized, a bag-of-words (BOW) procedure is applied,
which simply collects all the words or tokens without regard to
the sequence of occurrence.
24
2. Data Preparation and Wrangling – Unstructured Data
25
2. Data Preparation and Wrangling – Unstructured Data
▪ Steps in Text Wrangling (Preprocessing):
▪ 4. Create a document term matrix (DTM): DTM is then used to
convert the unstructured data into structured data. In this
matrix, each text document is a row, and the columns are
represented by tokens. The cell value represents the number of
occurrences of a token in a document (i.e., row).
26
2. Data Preparation and Wrangling – Unstructured Data
▪ The table below is an example of DTM.
27
3. Data Exploration Objectives and Methods
▪ Data exploration is a crucial part of big data projects. The prepared
data are explored to investigate and comprehend data distributions
and relationships.
▪ Data exploration involves three important tasks: exploratory data
analysis, feature selection, and feature engineering.
28
3. Data Exploration Objectives and Methods
▪ Exploratory graphs, charts, and other visualizations, such as heat
maps and word clouds, are designed to summarize and observe data.
▪ Data can also be summarized and examined using quantitative
methods, such as descriptive statistics and central tendency
measures.
▪ An important objective of exploratory data analysis is to
understanding data properties, distributions, and other
characteristics, or find patterns or relationships, and evaluate basic
questions and hypotheses.
29
3. Data Exploration Objectives and Methods
▪ Feature selection is a process whereby only pertinent features from
the dataset are selected for ML model training. Selecting fewer
features decreases ML model complexity and training time.
▪ Feature engineering is a process of creating new features by
changing or transforming existing features. Model performance
heavily depends on feature selection and engineering.
30
3. Data Exploration – Structured Data
▪ Exploratory Data Analysis:
▪ Summary statistics, such as mean, median, quartiles, ranges,
standard deviations, skewness, and kurtosis, of a feature can
be computed.
▪ One-dimension visualization summarizes each feature in the
dataset. The basic one-dimension exploratory visualizations
are histograms, bar charts, scatterplots …
31
3. Data Exploration – Structured Data
▪ Feature Selection:
▪ We try to select only the features that contribute to the out-of-
sample predictive power of the model.
▪ A parsimonious model (i.e., a model with fewer features)
reduces feature-induced noise and improves the model’s
prediction accuracy.
▪ Feature selection must not be confused with the data
preprocessing steps during data preparation. Feature selection
requires a good understanding of the business environment
and the interrelationships among the features identified in the
EDA. 32
3. Data Exploration – Structured Data
▪ Feature Engineering:
▪ FE involves optimizing and improving the selected features:
▪ Feature engineering involves either decomposing a feature into
multiple features or converting an existing feature into a new
feature.
33
3. Data Exploration – Unstructured Data
▪ Exploratory Data Analysis: Unstructured text can be tokenized, and
summary statistics include:
▪ Term frequency: number of times the word appears in the text.
▪ Co-occurrence: where two or more words appear together.
▪ A word cloud: a visual representation of all the words in a
BOW, such that words with higher frequency have a larger font
size. This allows the analyst to determine which words are
contextually more important. Figure below shows an example
of a word cloud.
34
3. Data Exploration – Unstructured Data
35
3. Data Exploration – Unstructured Data
▪ Feature Selection: Involves selecting a subset of tokens in the BOW.
Reduction in BOW size makes the model more parsimonious and
reduces feature-induced noise.
▪ Noisy features are both the most frequent and most sparse (or rare)
tokens in the dataset. Noisy features should be removed.
▪ High-frequency words tend to be stop words or common
vocabulary words are typically present frequently in all the
texts across the dataset.
▪ Low-frequency words may be irrelevant and are present in
only a few text cases
36
3. Data Exploration – Unstructured Data
▪ Feature Engineering: The following are some techniques for feature
engineering:
▪ Numbers. Tokens with standard lengths are identified and
converted into a token such as /numberX/. Four-digit numbers
may be associated with years and are assigned a value of
/number4/.
▪ N-grams. These are multiword patterns, and if they are useful,
the order is preserved. For example, the words monetary policy
may be best kept as a sequence rather than broken into 2
different tokens, and therefore would be replaced by a single
token, monetary_policy 37
3. Data Exploration – Unstructured Data
▪ Feature Engineering: The following are some techniques for feature
engineering:
▪ Name entity recognition (NER). NER algorithms search for
token values, in the context it was used, against their internal
library and assign a NER tag to the token. For example,
Microsoft would be assigned a NER tag of ORG and Europe
would be assigned a NER tag of Place.
▪ Parts of speech (POS). This uses language structure
dictionaries to contextually assign tags (POS) to text. For
example, market could be a verb as in “to market” or a noun as
in “in the market”. 38
4. Model Training
▪ The three tasks of ML model training are method selection,
performance evaluation, and tuning.
39
4. Model Training – Method Selection
▪ Method selection is the art and science of choosing the appropriate
ML method (i.e., algorithm) given the objectives and data
characteristics. Method selection is based on the following factors:
▪ Supervised or unsupervised learning. Supervised learning is
used when the training data contains the ground truth or the
known outcome (i.e., the target variable). Unsupervised
learning occurs when there is no target variable.
40
4. Model Training – Method Selection
▪ Type of data. For numerical data (e.g., predicting earnings) we
may use classification and regression tree (CART) methods.
For text data, we can use generalized linear models (GLMs) and
SVMs. For image data, neural networks and deep learning
methods can be employed.
▪ Size of data. Large data sets with many observations and
features can be handled with SVMs. Neural networks work
better with a large number of observations, but few features.
41
4. Model Training – Method Selection
▪ Once a method is selected, the researcher has to specify appropriate
hyperparameters. For example, KNN.
▪ Class imbalance occurs when one class has a large number of
observations relative to other classes. For example, in a model for
predicting default of customer, if the data set has a large number of
high-score customer, then the model would be more likely to predict
nondefault for a new customer →The training data set should have a
variety of customers so as to have enough diversity to make correct
predictions.
42
4. Model Training – Method Selection
▪ One way to overcome class imbalance is to undersample the
overrepresented class and oversample the underrepresented class.
43
4. Model Training – Performance Valuation
▪ It is important to measure the model training performance or
goodness of fit for validation of the model.
▪ For classification problems, we can use error analysis.
▪ For predicting continous variable (like stock price), Root mean
square error (RMSE) can be used.
▪ Errors in classification problems can be false positives (type I error)
or false negatives (type II error). A confusion matrix shows the
results of a classification problem as below. Tip: Think confusion
matrix like a Covid test: 0 = No covid, 1 = Have covid.
44
4. Model Training – Performance Valuation
▪ For example, if someone actually has Covid (Actual label = 1) but the
kit test return “Negative” (Predict result = 0, that is no Covid), then
the test resut is False Negative or Type II error.
45
4. Model Training – Performance Valuation
▪ Metrics such as precision (the ratio of true positives to all predicted
positives) and recall (the ratio of TPs to all actual positives) can be
used.
▪ While FP and FN are both errors, they may not be equally important
and are subject to business decision. For example, a lender may
want to avoid lending to potential defaulters, and so will want to
maximize recall.
▪ High precision is valued when the cost of a type I error is large,
while high recall is valued when the cost of a type II error is large.
46
4. Model Training – Performance Valuation
▪ Trading off precision and recall is subject to business decisions and
model application → additional evaluation metrics that provide the
overall performance of the model are generally used.
▪ Accuracy is the percentage of correctly predicted classes out of total
predictions. F1 score is the harmonic mean of precision and recall.
47
4. Model Training – Performance Valuation
48
4. Model Training – Performance Valuation
49
4. Model Training – Performance Valuation
▪ For predicting continous variable (like stock price), Root mean
square error (RMSE) can be used. This is useful for data predictions
that are continuous, such as regression models. The RMSE is a
single metric summarizing the prediction error in a sample.
50
4. Model Training – Model Tuning
▪ After model evaluation, the model needs to be revised until it
reaches an acceptable performance level. It is necessary to find an
optimum trade-off between bias and variance errors, such that the
model is neither underfitting nor overfitting.
▪ Tuning involves altering the hyperparameters until a desirable level
of model performance is achieved.
51