Sia2206 Data Analytics Notes
Sia2206 Data Analytics Notes
Topics Covered:
TOPIC 1.
Involves analyzing non-numerical data like text, audio, or video. It’s used to
interpret meanings, opinions, and themes.
Example:
NVivo
Manual coding
Thematic analysis
Content analysis
Quantitative Analysis
Focuses on numeric data, using statistical or mathematical models to
analyze and interpret measurable variables.
Example:
Common Techniques:
Regression analysis
Probability distributions
Hypothesis testing
Data Democratization: Making data more accessible and usable across all
departments.
TOPIC 2.
Types of Data
1. Structured Data
Examples:
2. Semi-Structured Data
Doesn’t conform to rigid table formats but has organizational markers (tags,
keys).
Examples:
NoSQL databases
Web logs
3. Unstructured Data
No predefined structure.
Examples:
Emails
Images
Videos
Audio recordings
1. Extract – Pulling raw data from various sources (databases, APIs, files).
2. Transform – Cleaning and formatting data to make it usable. This
includes:
Removing duplicates
Normalizing values
Example:
Extract product data from Shopify → Clean in Python (remove missing prices)
→ Load into Power BI for visualization.
TOPIC 3.
Types:
Example: Simulating customer arrival times and service times at a call center
to optimize staffing.
Classification:
Deterministic Models: The outcome is fully determined by the initial
conditions, and there is no randomness involved.
Simulation Tools
1. Monte Carlo Tools: Tools like @Risk and Crystal Ball can be used to
perform Monte Carlo simulations for risk analysis and decision support.
TOPIC 4
Sampling
Definition: Sampling is the process of selecting a subset of individuals or
observations from a larger population to estimate characteristics of the
whole population.
Why Sampling?
Time-saving: Sampling allows for quicker results since collecting data from
the entire population can be time-consuming.
Types of Sampling:
Example: Selecting every 10th customer from a queue to survey about their
experience.
Example: Randomly selecting a few schools from a district and surveying all
students in those schools.
Data Collection
2. Interviews:
4. Existing Data:
5. Experiments:
1. Data Cleaning:
Using visual and statistical methods to explore and understand the data
before applying complex modeling techniques.
4. Predictive Analysis:
Example: Using past sales data to forecast future demand for a product.
5. Descriptive Analysis:
1. Bar Charts:
Example: A bar chart showing the number of sales for each product category.
2. Histograms:
Similar to bar charts but used for continuous data. They show the distribution
of a single variable.
4. Line Charts:
Used to display data trends over time, where each point represents a data
value at a specific time.
Example: A line chart showing the stock price of a company over the past
year.
5. Scatter Plots:
7. Box Plots:
TOPIC 5.
Statistical Analysis
Sampling Distributions
Central Limit Theorem (CLT): The Central Limit Theorem states that the
distribution of the sample mean will be approximately normal (bell-shaped) if
the sample size is sufficiently large, regardless of the shape of the population
distribution. This is fundamental in statistical inference.
Example: If you take many random samples from a population and calculate
the mean for each sample, the distribution of those sample means will
approach a normal distribution as the sample size increases.
Standard Error: The standard error is the standard deviation of the sampling
distribution. It represents the variability of the sample statistic.
Formula:
Where:
= sample size
Example: If the population mean income is $50,000 with a standard
deviation of $5,000, and a sample of 100 people is taken, the standard error
of the mean is .
1. Random Variable:
Types:
Continuous Random Variable: Can take on any value within a given range.
E.g., height of a person or temperature at a specific location.
2. Probability Distribution:
A probability distribution describes the likelihood of each possible outcome of
a random variable.
Types of Distributions:
Binomial Distribution (for discrete variables): Used when there are two
possible outcomes, such as success/failure in trials.
3. Statistical Functions:
Mean: The average of all values.
Formula:
Formula:
Standard Deviation: The square root of the variance, it gives a sense of how
spread out the data is.
Formula:
Statistical Inference
1. Point Estimation:
2. Confidence Intervals:
Formula:
3. Hypothesis Testing:
Steps:
5. Decision: If the p-value is less than the significance level (α), reject the
null hypothesis.
Example: Testing whether a new drug is effective. Null hypothesis: The drug
has no effect. Alternative hypothesis: The drug is effective. If the p-value is
less than 0.05, reject the null hypothesis.
4. Types of Tests:
Z-test: Used when the sample size is large (n > 30) and the population
variance is known.
T-test: Used when the sample size is small (n < 30) and the population
variance is unknown.
______________________________________________________________________________
______
TOPIC 6.
Text Analytics
Goal of NLP: To convert human language into a form that machines can
process, and then extract useful insights from that data.
1. Tokenization:
The process of splitting text into individual units, such as words or phrases,
known as tokens.
2. Stop Words:
Common words (such as “the”, “and”, “is”) that do not carry significant
meaning and are often removed in text analysis to focus on the more
important terms.
3. Stemming:
The process of reducing words to their root form. For example, “running”
becomes “run” or “better” becomes “good”.
4. Lemmatization:
Example: In the sentence “The dog runs fast”, the POS tags would be:
[(“The”, “DT”), (“dog”, “NN”), (“runs”, “VBZ”), (“fast”, “RB”)].
Text Analytics Methods
Definition: Text analytics methods are the techniques and algorithms used to
analyze and extract meaningful patterns and insights from text data.
1. Sentiment Analysis:
2. Topic Modeling:
3. Text Classification:
Example: “Apple Inc. Was founded by Steve Jobs in 1976” would result in
entities like “Apple Inc.” (organization), “Steve Jobs” (person), and “1976”
(date).
5. Text Summarization:
Key Applications:
Text analytics can be applied to legal documents, contracts, and case law to
identify key terms, clauses, and precedents that are relevant to a case.
Definition: Text analytics tools are software platforms that facilitate the
process of analyzing, processing, and extracting insights from textual data.
A fast and efficient NLP library for Python, often used for large-scale text
processing tasks such as tokenization, named entity recognition, and
dependency parsing.
3. TextBlob:
4. Apache OpenNLP:
An open-source machine learning-based toolkit for processing natural
language text. It supports various tasks such as tokenization, POS tagging,
and parsing.
Example: Analyzing customer feedback from surveys using the Google Cloud
Natural Language API.
TOPIC 7.
Predictive Analytics
1. Customer Segmentation:
o Businesses use predictive analytics to segment customers into
groups based on their likelihood to respond to certain offers,
make purchases, or engage with the brand.
o Example: Retailers predicting which customers are most likely to
buy specific products during a sale.
2. Risk Management:
o Predictive analytics helps organizations assess and manage risks
by forecasting potential threats or financial losses.
o Example: Banks using predictive models to assess the likelihood
of loan defaults based on borrower history.
4. Fraud Detection:
o Predictive analytics models can analyze patterns in transactional
data to identify anomalies that may indicate fraudulent activities.
o Example: Credit card companies predicting fraudulent
transactions by analyzing spending behavior patterns.
Predictive analytics relies on various models, each suited for different types
of problems. Some common predictive models include:
1. Linear Regression:
o A statistical method used to model the relationship between a
dependent variable and one or more independent variables. It is
often used for predicting continuous outcomes.
o Example: Predicting the sales revenue of a product based on its
advertising spend.
2. Logistic Regression:
o A type of regression used for binary classification problems,
where the outcome is one of two categories (e.g., yes/no,
win/lose).
o Example: Predicting whether a customer will purchase a product
(yes/no) based on past behavior and demographics.
3. Decision Trees:
o A machine learning model that splits data into subsets based on
the value of input features to make predictions. It’s a popular
method for classification and regression tasks.
o Example: Predicting whether a loan application will be approved
based on features such as credit score, income, and loan
amount.
4. Random Forest:
o An ensemble method that uses multiple decision trees to
improve accuracy and reduce overfitting. It aggregates the
predictions from multiple trees to make a final prediction.
o Example: Predicting whether a customer will churn using various
behavioral and demographic features.
6. Neural Networks:
o A model inspired by the human brain, neural networks consist of
layers of nodes that process data through activation functions.
They are particularly useful for handling complex patterns and
large datasets.
o Example: Predicting stock market trends based on historical
prices and technical indicators.
8. Ensemble Methods:
o These methods combine multiple models to improve predictive
performance. Popular ensemble techniques include boosting,
bagging, and stacking.
o Example: Using an ensemble of decision trees and logistic
regression models to predict customer behavior.