Module 2_final
Module 2_final
2
DATA COLLECTION
• Why Collect Data : DC is defined as the “process
of gathering and measuring information on
variables of interest, in an established systematic
fashion that enables one to answer queries,
stated research questions, test hypotheses, and
evaluate outcomes.”
• There are numerous reasons for data collection
3
DATA COLLECTION
• To identify DV and IV
• Relationship between DV and IV / amongst DV
and IV
• As part of DC, this step also involves elimination
of NULL and OUTLIER values
4
DATA COLLECTION
• Step 1 in D.S Process
• Consumes 50-55% of project time
• Methods :
• Questionnaire
• Survey
• Interview
• Observations
• Focus group discussion
5
Questionaire
• “What is your favorite product?”
• “Why did you purchase this product?”
• “How satisfied are you with [product]?”
• "Would you recommend [product] to a friend?"
• “Would you recommend [company name] to a
friend?”
6
SURVEY
• How to Write Good Survey Questions
• Write questions that are simple and to the point. .
• Use words with clear meanings.
• Limit the number of ranking options.
• In a multiple choice question, cover all options
without overlapping.
• Avoid double-barreled questions.
• Offer an “out” for questions that don't apply.
• Avoid offering too few or too many options
7
INTERVIEW
• Tell me about yourself.
• What are your strengths?
• What are your weaknesses?
• Why do you want this job?
• Where would you like to be in your career five
years from now?
• What's your ideal company?
• What attracted you to this company?
• Why should we hire you?
8
OBSERVATION
• Observe,describe,question
9
FOCUS GROUP DISCUSSION
• A Focus Group Discussion (FGD) is a qualitative
research method and data collection technique in
which a selected group of people discusses a
given topic or issue in-depth, facilitated by a
professional, external moderator. This method
serves to solicit participants’ attitudes and
perceptions, knowledge and experiences, and
practices, shared in the course of interaction with
different people
10
FOCUS GROUP DISCUSSION
11
FOCUS GROUP DISCUSSION
12
DATA COLLECTION
• Tools (Real Time)
• Kafka
• Azure Event Hubs
• AWS Kinesis
• Google Cloud Dataflow
13
DATA COLLECTION - SKILLS
• SQL
14
DATA CLEANING
• Pythonic Data Cleaning With Pandas and NumPy
• Dropping Columns in a DataFrame.
• Changing the Index of a DataFrame.
• Tidying up Fields in the Data.
• Combining str Methods with NumPy to CleanColumns.
• Cleaning the Entire Dataset Using the applymap Function.
• Renaming Columns and Skipping Rows.
15
DATA CLEANING
16
SOURCES OF MISSING VALUES
• Here’s some typical reasons why data is missing:
• User forgot to fill in a field.
• Data was lost while transferring manually from a
legacy database.
• There was a programming error.
• Users chose not to fill out a field tied to their beliefs
about how the results would be used or interpreted.
17
DATA CLEANING - AN EXAMPLE
18
DATA CLEANING – PYTHON
CODE
19
LOOKING FOR NULL VALUES
• # Looking at the OWN_OCCUPIED column
print df['OWN_OCCUPIED']
print df['OWN_OCCUPIED'].isnull()
20
SUMMARISING MISSING VALUES
21
REPLACING MISSING VALUES
• fill in missing values with a single value.
• # Replace missing values with a number
df['ST_NUM'].fillna(125, inplace=True)
22
REPLACING MISSING VALUES
• location based imputation. Here’s how you would
do that.
• # Location based replacement
df.loc[2,'ST_NUM'] = 125
23
DATA MUNGING –RECAP OF
THE NEED
• Data wrangling, sometimes referred to as data
munging, is the process of transforming and
mapping data from one "raw" data form into
another format with the intent of making it more
appropriate and valuable for a variety of
downstream purposes such as analytics.
24
DATA MUNGING - STEPS
• Read the data
• df = pd.read_csv('data/ava-all.csv')
• Find data types
• df.dtypes
• Describe data
• df.shape
(number of rows and columns)
25
DATA MUNGING - STEPS
• Fill ‘na’ with zero values
• df['Buried - Fully:'].fillna(0).describe()
26
DATA MUNGING - STEPS
• Here is a function that takes a string as input and tries to coerce it to
a number of inches:
• >>> import re
• >>> def to_inches(orig):
• ... txt = str(orig)
• ... if txt == 'nan':
• ... return orig
• ... reg = r'''(((\d*\.)?\d*)')?(((\d*\.)?\d*)")?'''
• ... mo = re.search(reg, txt)
• ... feet = mo.group(2) or 0
• ... inches = mo.group(5) or 0
• ... return float(feet) * 12 + float(inches)
27
DATA MUNGING - STEPS
• The to_inches function returns NaN if that comes
in as the orig parameter.
• Otherwise, it looks for optional feet (numbers
followed by a single quote) and optional inches
(numbers followed by a double quote).
• It casts these to floating point numbers and
multiplies the feet by twelve.
• Finally, it returns the sum.
28
Note
• Regular expressions could fill up a book on their
own.
• A few things to note. We use raw strings to
specifiy them (they have an r at the front), as raw
strings don't interpret backslash as an escape
character.
• This is important because the backslash has
special meaning in regular expressions. \d means
match a digit
29
Note
• The parentheses are used to specify groups. After invoking the search
function, we get match objects as results (mo in the code above).
• The .group method pulls out the match inside of the group.
mo.group(2) looks for the second left parenthesis and returns the
match inside of those parentheses. mo.group(5) looks for the fifth left
parentheses, and the match inside of it.
• Normally Python is zerobased, where we start counting from zero,
but in the case of regular expression groups, we start counting at one.
• The first left parenthesis indicates where the first group starts, group
one, not zero.
30
WEB SCRAPPING
• Web “scraping” (also called “web harvesting,”
“web data extraction,” or even “web data
mining”), can be defined as “the construction of
an agent to download, parse, and organize data
from the web in an automated manner.”
31
WEB SCRAPPING
32
33
WEB SCRAPPING
• Or, in other words: instead of a human end user
clicking away in a web browser and copy-pasting
interesting parts into, say, a spreadsheet, web
scraping offloads this task to a computer
program that can execute it much faster, and
more correctly, than a human can.
34
Why Web Scraping for Data
Science?
• When surfing the web using a normal web browser,
you’ve probably encountered multiple sites where you
considered the possibility of gathering, storing, and
analysing the data presented on the site’s pages.
• Especially for data scientists, whose “raw material” is
data, the web exposes a lot of interesting
opportunities:
• There might be an interesting table on a Wikipedia page (or
pages) you want to retrieve to perform some
statistical analysis.
35
Why Web Scraping for Data
Science?
• Perhaps you want to get a list of reviews from a movie site to
perform text mining, create a recommendation
engine, or build a predictive model to spot fake
reviews.
• You might wish to get a listing of properties on a real-
estate site tobuild an appealing geo-visualization.
• You’d like to gather additional features to enrich your
data set based on information found on the web, say,
weather information to forecast, for example, soft
drink sales.
36
Why Web Scraping for Data
Science?
• You might be wondering about doing social network
analytics using profile data found on a web forum.
• It might be interesting to monitor a news site for
trending new stories on a particular topic of interest.
37
FROM WEB SCRAPPING TO
WEB CRAWLING
• The difference between “web scraping” and “web
crawling” is relatively vague, as many authors and
programmers will use both terms interchangeably.
• In general terms, the term “crawler” indicates a program’s
ability to navigate web pages on its own, perhaps even
without a well-defined end goal or purpose, endlessly
exploring what a site or the web has to offer.
• Web crawlers are heavily used by search engines like
Google to retrieve contents for a URL, examine that page
for other links, retrieve the URLs for those links, and so
on…
38
39
Process of Web Scraping
• Make a request with requests module via a URL.
40
Dimensionality Reduction
• Dimensionality Reduction (DR) is the pre-
processing step to remove redundant features,
noisy and irrelevant data, in order to improve
learning feature accuracy and reduce the training
time.
41
Benefits of Dimension Reduction
• It helps in data compressing and reducing the storage space
required
• It reduces computation time.
• It fastens the time required for performing same computations. Less
dimensions leads to less computing, also less dimensions can allow
usage of algorithms unfit for a large number of dimensions
• It takes care of multi-collinearity that improves the model
performance.
• It removes redundant features. For example: there is no point in
storing a value in two different units (meters and inches).
• Reducing the dimensions of data to 2D or 3D may allow us to plot
and visualize it precisely. You can then observe patterns more
clearly.
42
Dimensionality reduction
43
Feature Selection
• Feature Selection is the process where you
automatically or manually select those features
which contribute most to your prediction
variable or output in which you are interested in.
44
45
46
FEATURE SELECTION
• Feature selection can be done in multiple ways
but there are broadly 3 categories of it:
1. Filter Method
2. Wrapper Method
3. Embedded Method
47
Feature Extraction
• Feature extraction (FE) method extracting new
feature from original dataset
• it is very beneficial when we want to decrease
the number of resources required for processing
without missing relevant feature dataset
48
49
50
51
Principal Component Analysis
(PCA)
Principal components analysis (PCA) is a
dimensionality reduction technique that enables
you to identify correlations and patterns in a data
set so that it can be transformed into a data set of
significantly lower dimension without loss of any
important information.
52
Steps of Principle Component
Analysis
• Standardization of the data
• Computing the covariance matrix
• Calculating the eigenvectors and eigenvalues
• Computing the Principal Components
• Reducing the dimensions of the data set
53
Standardization of the data
Consider an example, let’s say that we have 2
variables in our data set, one has values ranging
between 10-100 and the other has values between
1000-5000. In such a scenario, it is obvious that
the output calculated by using these predictor
variables is going to be biased since the variable
with a larger range will have a more obvious
impact on the outcome
54
Computing the covariance matrix
55
Calculating the Eigenvectors and Eigenvalues
• An eigenvector is a vector whose direction remains
unchanged when a linear transformation is applied to it.
• Every eigenvector there is an eigenvalue. The dimensions in
the data determine the number of eigenvectors that you need
to calculate.
• Consider a 2-Dimensional data set, for which 2 eigenvectors
(and their respective eigenvalues) are computed. The idea
behind eigenvectors is to use the Covariance matrix to
understand where in the data there is the most amount of
variance. Since more variance in the data denotes more
information about the data, eigenvectors are used to identify
and compute Principal Components.
56
Computing the Principal Components
57
Reducing the dimensions of the data set
• The last step in performing PCA is to re-arrange
the original data with the final principal
components which represent the maximum and
the most significant information of the data set.
In order to replace the original data axis with the
newly formed Principal Components, you simply
multiply the transpose of the original data set by
the transpose of the obtained feature vector.
58