0% found this document useful (0 votes)

5 views58 pages

Module 2 - Final

Uploaded by

aarchanasingh20

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views58 pages

Module 2 - Final

Uploaded by

aarchanasingh20

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 58

MODULE - 2

DATA COLLECTION & PRE-PROCESSING

DATA
Ms. Kaipa Sandhya
Presidency university
Bangalore.
Module-2
• Data Collection
• Data Cleaning
• Data Munging
• Web Scrapping
• Rescaling and Dimensionality Reduction
• Feature Selection
• Feature Extraction
• PCA

2
DATA COLLECTION
• Why Collect Data : DC is defined as the “process
of gathering and measuring information on
variables of interest, in an established systematic
fashion that enables one to answer queries,
stated research questions, test hypotheses, and
evaluate outcomes.”
• There are numerous reasons for data collection

3
DATA COLLECTION
• To identify DV and IV
• Relationship between DV and IV / amongst DV
and IV
• As part of DC, this step also involves elimination
of NULL and OUTLIER values

4
DATA COLLECTION
• Step 1 in D.S Process
• Consumes 50-55% of project time
• Methods :
• Questionnaire
• Survey
• Interview
• Observations
• Focus group discussion

5
Questionaire
• “What is your favorite product?”
• “Why did you purchase this product?”
• “How satisfied are you with [product]?”
• "Would you recommend [product] to a friend?"
• “Would you recommend [company name] to a
friend?”

6
SURVEY
• How to Write Good Survey Questions
• Write questions that are simple and to the point. .
• Use words with clear meanings.
• Limit the number of ranking options.
• In a multiple choice question, cover all options
without overlapping.
• Avoid double-barreled questions.
• Offer an “out” for questions that don't apply.
• Avoid offering too few or too many options

7
INTERVIEW
• Tell me about yourself.
• What are your strengths?
• What are your weaknesses?
• Why do you want this job?
• Where would you like to be in your career five
years from now?
• What's your ideal company?
• What attracted you to this company?
• Why should we hire you?

8
OBSERVATION
• Observe,describe,question

9
FOCUS GROUP DISCUSSION
• A Focus Group Discussion (FGD) is a qualitative
research method and data collection technique in
which a selected group of people discusses a
given topic or issue in-depth, facilitated by a
professional, external moderator. This method
serves to solicit participants’ attitudes and
perceptions, knowledge and experiences, and
practices, shared in the course of interaction with
different people

10
FOCUS GROUP DISCUSSION

11
FOCUS GROUP DISCUSSION

12
DATA COLLECTION
• Tools (Real Time)
• Kafka
• Azure Event Hubs
• AWS Kinesis
• Google Cloud Dataflow

13
DATA COLLECTION - SKILLS
• SQL

14
DATA CLEANING
• Pythonic Data Cleaning With Pandas and NumPy
• Dropping Columns in a DataFrame.
• Changing the Index of a DataFrame.
• Tidying up Fields in the Data.
• Combining str Methods with NumPy to CleanColumns.
• Cleaning the Entire Dataset Using the applymap Function.
• Renaming Columns and Skipping Rows.

15
DATA CLEANING

16
SOURCES OF MISSING VALUES
• Here’s some typical reasons why data is missing:
• User forgot to fill in a field.
• Data was lost while transferring manually from a
legacy database.
• There was a programming error.
• Users chose not to fill out a field tied to their beliefs
about how the results would be used or interpreted.

17
DATA CLEANING - AN EXAMPLE

18
DATA CLEANING – PYTHON
CODE

19
LOOKING FOR NULL VALUES
• # Looking at the OWN_OCCUPIED column

print df['OWN_OCCUPIED']
print df['OWN_OCCUPIED'].isnull()

20
SUMMARISING MISSING VALUES

21
REPLACING MISSING VALUES
• fill in missing values with a single value.
• # Replace missing values with a number

df['ST_NUM'].fillna(125, inplace=True)

22
REPLACING MISSING VALUES
• location based imputation. Here’s how you would
do that.
• # Location based replacement

df.loc[2,'ST_NUM'] = 125

23
DATA MUNGING –RECAP OF
THE NEED
• Data wrangling, sometimes referred to as data
munging, is the process of transforming and
mapping data from one "raw" data form into
another format with the intent of making it more
appropriate and valuable for a variety of
downstream purposes such as analytics.

24
DATA MUNGING - STEPS
• Read the data
• df = pd.read_csv('data/ava-all.csv')
• Find data types
• df.dtypes
• Describe data
• df.shape
(number of rows and columns)

25
DATA MUNGING - STEPS
• Fill ‘na’ with zero values
• df['Buried - Fully:'].fillna(0).describe()

26
DATA MUNGING - STEPS
• Here is a function that takes a string as input and tries to coerce it to
a number of inches:
• >>> import re
• >>> def to_inches(orig):
• ... txt = str(orig)
• ... if txt == 'nan':
• ... return orig
• ... reg = r'''(((\d*\.)?\d*)')?(((\d*\.)?\d*)")?'''
• ... mo = re.search(reg, txt)
• ... feet = mo.group(2) or 0
• ... inches = mo.group(5) or 0
• ... return float(feet) * 12 + float(inches)

27
DATA MUNGING - STEPS
• The to_inches function returns NaN if that comes
in as the orig parameter.
• Otherwise, it looks for optional feet (numbers
followed by a single quote) and optional inches
(numbers followed by a double quote).
• It casts these to floating point numbers and
multiplies the feet by twelve.
• Finally, it returns the sum.

28
Note
• Regular expressions could fill up a book on their
own.
• A few things to note. We use raw strings to
specifiy them (they have an r at the front), as raw
strings don't interpret backslash as an escape
character.
• This is important because the backslash has
special meaning in regular expressions. \d means
match a digit

29
Note
• The parentheses are used to specify groups. After invoking the search
function, we get match objects as results (mo in the code above).
• The .group method pulls out the match inside of the group.
mo.group(2) looks for the second left parenthesis and returns the
match inside of those parentheses. mo.group(5) looks for the fifth left
parentheses, and the match inside of it.
• Normally Python is zerobased, where we start counting from zero,
but in the case of regular expression groups, we start counting at one.
• The first left parenthesis indicates where the first group starts, group
one, not zero.

30
WEB SCRAPPING
• Web “scraping” (also called “web harvesting,”
“web data extraction,” or even “web data
mining”), can be defined as “the construction of
an agent to download, parse, and organize data
from the web in an automated manner.”

31
WEB SCRAPPING

32
33
WEB SCRAPPING
• Or, in other words: instead of a human end user
clicking away in a web browser and copy-pasting
interesting parts into, say, a spreadsheet, web
scraping offloads this task to a computer
program that can execute it much faster, and
more correctly, than a human can.

34
Why Web Scraping for Data
Science?
• When surfing the web using a normal web browser,
you’ve probably encountered multiple sites where you
considered the possibility of gathering, storing, and
analysing the data presented on the site’s pages.
• Especially for data scientists, whose “raw material” is
data, the web exposes a lot of interesting
opportunities:
• There might be an interesting table on a Wikipedia page (or
pages) you want to retrieve to perform some
statistical analysis.

35
Why Web Scraping for Data
Science?
• Perhaps you want to get a list of reviews from a movie site to
perform text mining, create a recommendation
engine, or build a predictive model to spot fake
reviews.
• You might wish to get a listing of properties on a real-
estate site tobuild an appealing geo-visualization.
• You’d like to gather additional features to enrich your
data set based on information found on the web, say,
weather information to forecast, for example, soft
drink sales.

36
Why Web Scraping for Data
Science?
• You might be wondering about doing social network
analytics using profile data found on a web forum.
• It might be interesting to monitor a news site for
trending new stories on a particular topic of interest.

37
FROM WEB SCRAPPING TO
WEB CRAWLING
• The difference between “web scraping” and “web
crawling” is relatively vague, as many authors and
programmers will use both terms interchangeably.
• In general terms, the term “crawler” indicates a program’s
ability to navigate web pages on its own, perhaps even
without a well-defined end goal or purpose, endlessly
exploring what a site or the web has to offer.
• Web crawlers are heavily used by search engines like
Google to retrieve contents for a URL, examine that page
for other links, retrieve the URLs for those links, and so
on…

38
39
Process of Web Scraping
• Make a request with requests module via a URL.

• Retrieve the HTML content as text.

• Examine the HTML structure closely to identify

the particular HTML element from which to
extract data. To do this, right click on the web
page in the browser and select inspect options to
view the structure.
• Use BeautifulSoup to find the particular element
from the response and extract the text.

40
Dimensionality Reduction
• Dimensionality Reduction (DR) is the pre-
processing step to remove redundant features,
noisy and irrelevant data, in order to improve
learning feature accuracy and reduce the training
time.

41
Benefits of Dimension Reduction
• It helps in data compressing and reducing the storage space
required
• It reduces computation time.
• It fastens the time required for performing same computations. Less
dimensions leads to less computing, also less dimensions can allow
usage of algorithms unfit for a large number of dimensions
• It takes care of multi-collinearity that improves the model
performance.
• It removes redundant features. For example: there is no point in
storing a value in two different units (meters and inches).
• Reducing the dimensions of data to 2D or 3D may allow us to plot
and visualize it precisely. You can then observe patterns more
clearly.

42
Dimensionality reduction

Feature Selection Feature Extraction

43
Feature Selection
• Feature Selection is the process where you
automatically or manually select those features
which contribute most to your prediction
variable or output in which you are interested in.

44
45
46
FEATURE SELECTION
• Feature selection can be done in multiple ways
but there are broadly 3 categories of it:
1. Filter Method
2. Wrapper Method
3. Embedded Method

47
Feature Extraction
• Feature extraction (FE) method extracting new
feature from original dataset
• it is very beneficial when we want to decrease
the number of resources required for processing
without missing relevant feature dataset

48
49
50
51
Principal Component Analysis
(PCA)
Principal components analysis (PCA) is a
dimensionality reduction technique that enables
you to identify correlations and patterns in a data
set so that it can be transformed into a data set of
significantly lower dimension without loss of any
important information.

52
Steps of Principle Component
Analysis
• Standardization of the data
• Computing the covariance matrix
• Calculating the eigenvectors and eigenvalues
• Computing the Principal Components
• Reducing the dimensions of the data set

53
Standardization of the data
Consider an example, let’s say that we have 2
variables in our data set, one has values ranging
between 10-100 and the other has values between
1000-5000. In such a scenario, it is obvious that
the output calculated by using these predictor
variables is going to be biased since the variable
with a larger range will have a more obvious
impact on the outcome

54
Computing the covariance matrix

• PCA helps to identify the correlation and

dependencies among the features in a data set.
• Mathematically, a covariance matrix is a p × p
matrix, where p represents the dimensions of the
data set. Each entry in the matrix represents the
covariance of the corresponding variables.

55
Calculating the Eigenvectors and Eigenvalues
• An eigenvector is a vector whose direction remains
unchanged when a linear transformation is applied to it.
• Every eigenvector there is an eigenvalue. The dimensions in
the data determine the number of eigenvectors that you need
to calculate.
• Consider a 2-Dimensional data set, for which 2 eigenvectors
(and their respective eigenvalues) are computed. The idea
behind eigenvectors is to use the Covariance matrix to
understand where in the data there is the most amount of
variance. Since more variance in the data denotes more
information about the data, eigenvectors are used to identify
and compute Principal Components.

56
Computing the Principal Components

• Once we have computed the Eigenvectors and

eigenvalues, all we have to do is order them in
the descending order, where the eigenvector with
the highest eigenvalue is the most significant and
thus forms the first principal component.
• The principal components of lesser significances
can thus be removed in order to reduce the
dimensions of the data

57
Reducing the dimensions of the data set
• The last step in performing PCA is to re-arrange
the original data with the final principal
components which represent the maximum and
the most significant information of the data set.
In order to replace the original data axis with the
newly formed Principal Components, you simply
multiply the transpose of the original data set by
the transpose of the obtained feature vector.

Basic and Advanced Laboratory Techniques in Histopathology and Cytology
100% (12)
Basic and Advanced Laboratory Techniques in Histopathology and Cytology
275 pages
Pico Bricks Ebook 15
100% (1)
Pico Bricks Ebook 15
234 pages
Data+Science+in+Python+ +Data+Prep+&+EDA
No ratings yet
Data+Science+in+Python+ +Data+Prep+&+EDA
196 pages
Upsc Cse
No ratings yet
Upsc Cse
18 pages
BRS 306 Notes 2923-1
No ratings yet
BRS 306 Notes 2923-1
30 pages
Lecture03 Data II
No ratings yet
Lecture03 Data II
42 pages
Unit I - Data Science
No ratings yet
Unit I - Data Science
161 pages
Internship Report Data Science
100% (1)
Internship Report Data Science
58 pages
Unit 2 - Data Munging PDF
No ratings yet
Unit 2 - Data Munging PDF
54 pages
FDS - Unit 1
No ratings yet
FDS - Unit 1
233 pages
ETOM Processes
No ratings yet
ETOM Processes
45 pages
21css303t Datascience Unit 1 Notes
No ratings yet
21css303t Datascience Unit 1 Notes
246 pages
Amzn1.Tortuga.3.Bc55a94a 9d6d 4883 A54c 888faa4c62c0.T23B1PWXCPIVJP
No ratings yet
Amzn1.Tortuga.3.Bc55a94a 9d6d 4883 A54c 888faa4c62c0.T23B1PWXCPIVJP
392 pages
Data Wrangling
No ratings yet
Data Wrangling
4 pages
Unit I - Data Science
No ratings yet
Unit I - Data Science
161 pages
FDS - Unit 1
No ratings yet
FDS - Unit 1
233 pages
Fods Notes
No ratings yet
Fods Notes
139 pages
Intro
No ratings yet
Intro
144 pages
FMEP Interactive Handbook Gold
0% (1)
FMEP Interactive Handbook Gold
5 pages
Data Sceince - UNIT - 4
No ratings yet
Data Sceince - UNIT - 4
70 pages
Data Science Papers
No ratings yet
Data Science Papers
109 pages
ETCh 2
No ratings yet
ETCh 2
36 pages
IBPS Clerk Previous Year Question Paper 2018: Quantitative Aptitude (Questions & Solutions)
No ratings yet
IBPS Clerk Previous Year Question Paper 2018: Quantitative Aptitude (Questions & Solutions)
18 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
29 pages
Lecture 3 (DS) - Steps in Data Science Process
No ratings yet
Lecture 3 (DS) - Steps in Data Science Process
57 pages
DS Syllabus
No ratings yet
DS Syllabus
29 pages
Data Science Report
No ratings yet
Data Science Report
32 pages
DSC Unit 1
No ratings yet
DSC Unit 1
59 pages
L2 - Data Acquisition
No ratings yet
L2 - Data Acquisition
48 pages
Data Science My Notes
No ratings yet
Data Science My Notes
61 pages
2 Data Science - Managing Data
No ratings yet
2 Data Science - Managing Data
37 pages
08 Gtu TPT Report
No ratings yet
08 Gtu TPT Report
37 pages
06 WebScrapingData
No ratings yet
06 WebScrapingData
39 pages
Data Analysis by Web Scraping Using Python
No ratings yet
Data Analysis by Web Scraping Using Python
6 pages
DCPP Notes
No ratings yet
DCPP Notes
6 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
23 pages
Data Science With Python
No ratings yet
Data Science With Python
16 pages
1.8 Data Scrapping PDF
No ratings yet
1.8 Data Scrapping PDF
42 pages
50 Safety Director Interview Questions and Answers 1734275478
No ratings yet
50 Safety Director Interview Questions and Answers 1734275478
5 pages
Data Science Notes
No ratings yet
Data Science Notes
44 pages
TE Sem1 UNIT 1 (Data Science and Visualization) HONOURS - TE (SEM V)
No ratings yet
TE Sem1 UNIT 1 (Data Science and Visualization) HONOURS - TE (SEM V)
28 pages
UNIT I - Introduction - DataScience - New
No ratings yet
UNIT I - Introduction - DataScience - New
34 pages
Ds Final
No ratings yet
Ds Final
45 pages
Unit 12 Lexis: Commentary
No ratings yet
Unit 12 Lexis: Commentary
5 pages
Data Science Report
No ratings yet
Data Science Report
32 pages
Dsa Report
No ratings yet
Dsa Report
24 pages
Unit-1 .Ds
No ratings yet
Unit-1 .Ds
30 pages
Ds With Py
No ratings yet
Ds With Py
39 pages
Data Science Report
No ratings yet
Data Science Report
32 pages
Unit 1
No ratings yet
Unit 1
11 pages
Fundamentals of Data Science
No ratings yet
Fundamentals of Data Science
54 pages
Beginners Guide To Data Science - A Twics Guide 1
100% (1)
Beginners Guide To Data Science - A Twics Guide 1
41 pages
Unit 2 Data Gathering
No ratings yet
Unit 2 Data Gathering
14 pages
Q1. Explain Data Science Process Along With Detailed Diagram
No ratings yet
Q1. Explain Data Science Process Along With Detailed Diagram
7 pages
Chapter Two
No ratings yet
Chapter Two
14 pages
File
No ratings yet
File
27 pages
Mini Project
No ratings yet
Mini Project
13 pages
Introduction To Data Science and Analytics: Summer School 2015
No ratings yet
Introduction To Data Science and Analytics: Summer School 2015
31 pages
Lecture 2 The Data Science Process and Tools For Each Step
No ratings yet
Lecture 2 The Data Science Process and Tools For Each Step
8 pages
DATASCIENCE (Unit-1) Question Bank
No ratings yet
DATASCIENCE (Unit-1) Question Bank
6 pages
Data Collection
No ratings yet
Data Collection
14 pages
Annual Report-2014 PDF
No ratings yet
Annual Report-2014 PDF
108 pages
Common DS Interview Questions and Answers - 1
No ratings yet
Common DS Interview Questions and Answers - 1
4 pages
DSE 3 Unit 3
No ratings yet
DSE 3 Unit 3
4 pages
Heizer 17-1
No ratings yet
Heizer 17-1
33 pages
Getting and Cleaning Data Course Notes: Xing Su
No ratings yet
Getting and Cleaning Data Course Notes: Xing Su
27 pages
ML Week 6
No ratings yet
ML Week 6
11 pages
Data Analytics
No ratings yet
Data Analytics
4 pages
Assignment Unit I and II
No ratings yet
Assignment Unit I and II
3 pages
Developing A Mapreduce Application: by Dr. K. Venkateswara Rao Professor Department of Cse
No ratings yet
Developing A Mapreduce Application: by Dr. K. Venkateswara Rao Professor Department of Cse
83 pages
Huawei CloudAIR Solution - Deep Insight - GSM, UMTS and LTE Spectrum Concurrency Share Mechanism
No ratings yet
Huawei CloudAIR Solution - Deep Insight - GSM, UMTS and LTE Spectrum Concurrency Share Mechanism
34 pages
Report Zppbapi Upload Indep Reqs
No ratings yet
Report Zppbapi Upload Indep Reqs
9 pages
Clasa A 9a Limba Engleza
No ratings yet
Clasa A 9a Limba Engleza
2 pages
AA'BB' Spectra
No ratings yet
AA'BB' Spectra
11 pages
Kolkata City Accident Report - 2018
No ratings yet
Kolkata City Accident Report - 2018
48 pages
Schedule Check Fisik
No ratings yet
Schedule Check Fisik
17 pages
For Touring Pros - The Secret That Will Make Your Mind Create Any Outrageous Outcome That You Wish - Siddha Performance Golf
No ratings yet
For Touring Pros - The Secret That Will Make Your Mind Create Any Outrageous Outcome That You Wish - Siddha Performance Golf
6 pages
Fake News Detectio3
No ratings yet
Fake News Detectio3
24 pages
138 Modeling Stochastic Wind - Loads - On Vertical - Axis Wind Turbines VEERS SANDIA
No ratings yet
138 Modeling Stochastic Wind - Loads - On Vertical - Axis Wind Turbines VEERS SANDIA
20 pages
Lab Manual - Skull Bones - English - Student - Fill in
No ratings yet
Lab Manual - Skull Bones - English - Student - Fill in
6 pages
Michael Evan Aguelo: Educational Background
No ratings yet
Michael Evan Aguelo: Educational Background
3 pages
Lgep 2: SKF High Load, Extreme Pressure Bearing Grease
No ratings yet
Lgep 2: SKF High Load, Extreme Pressure Bearing Grease
2 pages
Our Walking Drum
No ratings yet
Our Walking Drum
3 pages
NEWS From The Marcum Family!: The Joplin Outreach Trip
No ratings yet
NEWS From The Marcum Family!: The Joplin Outreach Trip
4 pages
SunlightV6 On Top6872265039
No ratings yet
SunlightV6 On Top6872265039
5 pages
Aramid Prepreg Market
No ratings yet
Aramid Prepreg Market
8 pages
Instant Download Wrist Diagnosis and Operative Treatment 2nd Edition The Wei Zhi PDF All Chapter
100% (2)
Instant Download Wrist Diagnosis and Operative Treatment 2nd Edition The Wei Zhi PDF All Chapter
24 pages
SQL Mastery: From Novice Queries to Advanced Database Wizardry
From Everand
SQL Mastery: From Novice Queries to Advanced Database Wizardry
Scott Markham
No ratings yet
Getting Started with SQL Server 2012 Cube Development
From Everand
Getting Started with SQL Server 2012 Cube Development
Simon Lidberg
No ratings yet
Oracle Warehouse Builder 11g: Getting Started
From Everand
Oracle Warehouse Builder 11g: Getting Started
Bob Griesemer
No ratings yet

Module 2 - Final

Uploaded by

Module 2 - Final

Uploaded by

MODULE - 2

DATA COLLECTION & PRE-PROCESSING

• Retrieve the HTML content as text.

• Examine the HTML structure closely to identify

Feature Selection Feature Extraction

• PCA helps to identify the correlation and

• Once we have computed the Eigenvectors and

You might also like