0% found this document useful (0 votes)
12 views24 pages

Week 7

The document outlines the data life cycle, which includes phases from planning to destruction of data. It discusses various types of data bias, such as sampling and observer bias, and emphasizes the importance of data ethics, including ownership, consent, and privacy. Additionally, it covers exploratory data analysis (EDA) techniques for analyzing data patterns and relationships.

Uploaded by

Sobica Noor
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views24 pages

Week 7

The document outlines the data life cycle, which includes phases from planning to destruction of data. It discusses various types of data bias, such as sampling and observer bias, and emphasizes the importance of data ethics, including ownership, consent, and privacy. Additionally, it covers exploratory data analysis (EDA) techniques for analyzing data patterns and relationships.

Uploaded by

Sobica Noor
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 24

Data Science and Big Data

Analytics

Week 7
Data life cycle
“Data Life Cycle refers to the journey of data from its inception
to its eventual destruction.”
Phases:
● Plan: Decide what kind of data is needed, how it will be
managed, and who will be responsible for it.
● Capture: Collect from a variety of different sources.
● Manage: Care for and maintain the data. (storage too)
● Analyze: Use the data to solve problems, make
decisions, and support business goals.
● Archive: Keep relevant data stored for long-term and
future reference.
● Destroy: Remove data from storage and delete any
shared copies of the data.
01
Data Bias
Data bias is a type of error that systematically skews
results in a certain direction.
Sampling Bias
Sampling bias is when a sample isn't representative of the population as a
whole.
● Randomize in choosing sample to avoid
Observer Bias
Observer bias, which is sometimes referred to as experimenter bias or
research bias. Basically, it's the tendency for different people to observe
things differently
How to minimize observer bias
Interpretation Bias
The tendency to always interpret ambiguous situations in a positive, or
negative way
● Interpretation bias, can lead to two people seeing or hearing the exact
same thing, and interpreting it in a variety of different ways, because
they have different backgrounds, and experiences
Confirmation Bias
Tendency to search for, or interpret information in a way that confirms
preexisting beliefs
● People see what they want to see.
02
Good Data
Quality Standard
If you follow the "R-O-C-C-C" method, you will have an organized strategy for
locating and selecting acceptable data sources, which may improve your decision-
making and analysis.

R O C C C

Reliable Original Comprehensiv Current Cite


e d
What would be then bad data?
03 Data Ethics
Data Ethics
Data ethics covers the ethical and moral obligations of collecting, sharing, and
using data, focused on ensuring that data is used fairly, for good.

● Ownership,
● Transaction transparency
● Consent
● Currency
● Privacy
● Openness.
Ownership:
who owns data?
It isn't the organization that invested time and money collecting, storing,
processing, and analyzing it. It's individuals who own the raw data they provide, and
they have primary control over its usage, how it's processed and how it's shared

Transaction Transparency:

The idea that all data processing activities and algorithms should be completely
explainable and understood by the individual who provides their dat a and how it's
shared
Consent:
This is an individual's right to know explicit details about how and why their data
will be used before agreeing to provide it.
They should know answers to questions like
● why is the data being collected?
● How will it be used?
● How long will it be stored?

Currency:
Individuals should be aware of financial transactions resulting from the use of their
personal data and the scale of these transactions. If your data is helping to fund a
company's efforts, you should know what those efforts are all about and be given
the opportunity to opt out
Privacy:
Privacy means preserving a data subject's information and activity any time a data
transaction occurs.

This means someone like you or me should have protection from unauthorized
access to our private data, freedom from inappropriate use of our data, the right to
inspect, update, or correct our data, ability to give consent to use our data, and
legal right to access our data.

Openness:
When referring to data, openness refers to free access, usage and sharing
of data.
Data Anonymization
Data anonymization is the process of protecting people's private or sensitive data
by eliminating personally identifiable information

-> involves blanking, hashing, or masking personal information, often by using


fixed-length codes to represent data columns, or hiding data with altered values.

● Personally identifiable information, or PII, is information that can


be used by itself or with other data to track down a person's identity.
De-identified data

● Telephone numbers
● Names
● License plates and license
numbers
● Social security numbers
● IP addresses
● Medical records
● Email addresses
● Photographs
● Account numbers
Balancing Data Security and Access

Data security means protecting data from unauthorized access or


corruption by putting safety measures in place

● Encryption
● Tokenization
● Version Control
04
Exploratory Data
Analysis
EDA
Exploratory Data Analysis (EDA) is an analysis approach that identifies
general patterns in the data ,investigate data sets and summarize their
main characteristics, often employing data visualization methods

● Distribution of Data
● Graphical Representations
● Outlier Detection
● Correlation Analysis
● Handling Missing Values
● Summary Statistics
● Testing Assumptions
Univariate Analysis:
Focuses on analyzing a single variable at a time.
● Purpose: To understand the variable's distribution, central tendency, and
spread.
● Techniques:
○ Descriptive statistics (mean, median, mode, variance, standard
deviation).
○ Visualizations (histograms, bar charts).
Bivariate Analysis:
Examines the relationship between two variables.
● Purpose: To understand how one variable affects or is associated with another.
● Techniques:
○ Correlation coefficients (Pearson, Spearman).
○ Cross-tabulations and contingency tables.
○ Visualizations (line plots, scatter plots, pair plots).
Multivariate Analysis

Investigates interactions between three or more variables.


● Purpose: To understand the complex relationships and interactions in
the data.
● Techniques:
○ Multivariate plots (pair plots, parallel coordinates plots).
○ Dimensionality reduction techniques (PCA, t-SNE).
○ Cluster analysis.
○ Heatmaps and correlation matrices.

You might also like