Python

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 76

1

M3 PROGRAMMING FOR ANALYTICS


LEARNING OBJECTIVES
o Installation of Anaconda Navigator, Data types – string, tuples, set, lists, dictionary, Arrays. Spyder, Importing and Exporting Files,
Data Manipulation, Descriptive Statistics and Documentation with Jupyter.

o INTRODUCTION: DATABASE MANAGEMENT SYSTEMS


o DATA DEFINITION AND MANIPULATION
o BASICS OF SAS
o PYTHON: BASICS OF PYTHON
o R PROGRAMMING
PYTHON
INSTALLATION OF ANACONDA NAVIGATOR

1. Go to the Anaconda Website and choose a Python 3.x graphical installer (A).
PYTHON
INSTALLATION OF ANACONDA NAVIGATOR

2. Locate your download and double click it.


PYTHON
INSTALLATION OF ANACONDA NAVIGATOR
When the screen below appears, click on Next.
PYTHON
INSTALLATION OF ANACONDA NAVIGATOR
3. Read the license agreement and click on I Agree.
PYTHON
INSTALLATION OF ANACONDA NAVIGATOR
4. Click on Next.
PYTHON
INSTALLATION OF ANACONDA NAVIGATOR
5. Note your installation location and then click Next.
PYTHON
INSTALLATION OF ANACONDA NAVIGATOR
6. This is an important part of the installation process. The recommended approach is to not check the box to add Anaconda to your path.
This means you will have to use Anaconda Navigator or the Anaconda Command Prompt (located in the Start Menu under "Anaconda")
when you wish to use Anaconda (you can always add Anaconda to your PATH later if you don't check the box).
PYTHON
INSTALLATION OF ANACONDA NAVIGATOR
7.Click Next
PYTHON
INSTALLATION OF ANACONDA NAVIGATOR
8. You can install Microsoft VSCode if you wish, but it is optional.
PYTHON
INSTALLATION OF ANACONDA NAVIGATOR
9. Click on Finish.
PYTHON
Python Data Types
Python Data Types are used to define the type of a variable
1. Python Data Type – Numeric
2. Python Data Type – String
3. Python Data Type – List
4. Python Data Type – Tuple
5. Dictionary
6. Python Sets
In Python we need not to declare datatype while declaring a
1. Python Data Type – Numeric variable like C or C++. We can simply just assign values in a
variable. But if we want to see what type of numerical value is
Python numeric data type is used to hold numeric values like; it holding right now, we can use type(), like this:
int – holds signed integers of non-limited length.
long- holds long integers(exists in Python 2.x, deprecated in Python 3.x).
float- holds floating precision numbers and it’s accurate upto 15 decimal
places.
complex- holds complex numbers.
PYTHON
Python Data Types
2. Python Data Type – String
You can create Python string using a single or double quote.

mystring = "Hello" mystring = '''Hello''' mystring = """Hello""" mystring = 'Hello"Python"'


print(mystring) print(mystring) print(mystring) print(mystring)

Hello Hello Hello Hello"Python"

How to extract Nth letter or word? To get first word

mystring = 'Hi How are you?' mystring = 'Hi How are you?'
mystring[0] mystring.split(' ')[0]
There shoud be space btw ‘’
‘H’ ‘Hi’

mystring[0] refers to first letter as indexing in python starts from 0. 1. mystring.split(' ') tells Python to use space as a delimiter.
Similarly, mystring[1] refers to second letter. 2. mystring.split(' ')[0] tells to pick first word of a string.
To pull last letter, you can use -1 as index.
PYTHON
Python Data Types
3. Python Data Type – List
Unlike String, List can contain different types of objects such as integer, float, string etc.Formally list is an ordered sequence of some data
written using square brackets([]) and commas(,).

x = [142, 124, 234, 345, 465]


y = [‘A’, ‘C’, ‘E’, ‘M’]
z = [‘AA’, 44, 5.1, ‘KK’]

Get List ITEM


We can extract list item using Indexes. Index starts from 0 and end with (number of elements-)
1. Syntax: list[start: stop : step]
Start: refers to starting position k = [124, 225, 305, 246, 259]
stop : refers to end position k[0]
Step : refers to increment value
124

k[:3] returns [124, 225, 305]


k[0:3] also returns [124, 225, 305]
k[::-1] reverses the whole list and returns [259, 246, 305, 225, 124]
PYTHON
Python Data Types
3. Python Data Type – List
Combine / Join data lists Sum of values of data list Sum of values using numpy
The '+' operator is concatenating data lists. list1 = [1, 2, 3] X = [1, 2, 3]
list2 = [4, 5, 6]
X = [1, 2, 3] Y = [4, 5, 6]
Y = [4, 5, 6] sum_list = [] import numpy as np
Z=X+Y
for (item1, item2) in zip(list1, list2): Z = np.add(X, Y)
sum_list.append(item1+item2) print(Z)
print(Z)
print(sum_list)
[1, 2, 3, 4, 5, 6]
[5,7,9]
[5,7,9]

The ‘*' operator is repeating list N times. Modify / Replace a list item

X = [1, 2, 3] X = [1, 2, 3]
Z=X*3 X[2]=5
print(Z) print(X)

[1, 2, 3, 1,2,3,1,2,3] [1, 2, 5]


PYTHON
Python Data Types
4. Python Data Type – Tuple
Like list, tuple can also contain mixed data. But tuple cannot be mutable or changed once created whereas list can be mutable or modified.

Another difference is a tuple is created inside parentheses ( ). Whereas, list is created inside square brackets [ ]

Examples:

a=(1,2,3,4) b=(“hello”,1,2,3,”go”)
print(a) print(b)

(1,2,3,4) (“hello”,1,2,3,”go”)

Tuple can not be altered

a=(1,2,3,4)
X[2]=5
TypeError:
'tuple' object does not
support item assignment
PYTHON
Python Data Types
5. Dictionary
It works like an address book wherein you can find an address of a person by searching the name. In this example. name of a person is
considered as key and address as value. It is important to note that the key must be unique while values may not be. Keys should not be
duplicate because if it is a duplicate, you cannot find exact values associated with key. Keys can be of any data type such as strings,
numbers, or tuples.

Create a dictionary
It is defined in curly braces {}. Each key is followed by a colon (:) and then values.

teams = {'Dave' : ['teamA','teamAA', 'teamAB'], teams = {'Dave' : ['teamA','teamAA', 'teamAB'],


'Tim' : ['teamB','teamBB','teamBC'], 'Tim' : ['teamB','teamBB','teamBC'],
'Babita' : ['teamC','teamCB','teamCC'] 'Babita' : ['teamC','teamCB','teamCC']
} }

teams.keys() teams.values()

dict_keys(['Dave', 'Tim', 'Babita']) dict_values([['teamA', 'teamAA', 'teamAB'],


['teamB', 'teamBB', 'teamBC'], ['teamC', 'teamCB',
'teamCC']])
PYTHON
Python Data Types
5. Dictionary

Find Values of a particular key

teams = {'Dave' : ['teamA','teamAA', 'teamAB'], del teams['Babita']


'Tim' : ['teamB','teamBB','teamBC'],
'Babita' : ['teamC','teamCB','teamCC']
} {'Dave': ['teamA', 'teamAA', 'teamAB'],
'Tim': ['teamB', 'teamBB', 'teamBC']}
teams['Dave']

['teamA', 'teamAA', 'teamAB']


PYTHON
Python Data Types
6. Python Sets
Set is an unordered collection of unique items. Set is defined by values separated by comma inside braces { }. Items in a set are not ordered.
They are mainly used to check whether an object is present in the set and compute mathematical operations such as intersection, union,
difference etc.
PYTHON
Click Anaconda Navigator
PYTHON
Click Anaconda Navigator
Click Launch Spyder or If you have a terminal window open, you can launch spyder simply by typing spyder and pressing enter. You may
get a pop-up window saying that spyder is not the latest version, which is just because the version within Anaconda is a few revisions
behind.
PYTHON
Spyder(Python)
PYTHON
File Explorer or directory listing.
In this pane, you can find files that you want to edit or create new files and folders to work with.

Script editor or File editor


In this editor, you can work on Python scripts that you want to save to re-run later on. By default, the editor opens a file called temp.py
located in Spyder’s configuration directory. This file is meant as a temporary place to try things out before you save them in a file
somewhere else on your computer.

Console
In the bottom center is the console. Like in MATLAB, the console is where you can run commands to see what they do or when you want to
debug some code. Variables created in the console are not saved if you close Spyder and open it up again. The console is technically
running IPython by default.

History
Any commands that you type in the console will be logged into the history file in the bottom right pane of the window.

Variable Explorer
Furthermore, any variables that you create in the console will be shown in the variable explorer in the top right pane.
PYTHON
Importing and Exporting.

Importing Data
– Process of loading and reading data into Python from various resources.
– Two important properties:
– Format
Various formats: .csv,.jason,.xlsx,...
– File path of dataset
Computer: /Desktop/mydata.csv
people.csv

import csv
with open('people.csv', 'r') as file:
reader = csv.reader(file)
for row in reader:
print(row)
PYTHON
How to import data in Python.
1. Import CSV files
It is important to note that a singlebackslash does not work when specifying the file path. You need to either change it to forward slash or
add one more backslash like below

import pandas as pd
mydata= pd.read_csv("C:\\Users\\likhita\\Documents\\file1.csv")

If no header (title) in raw data file


mydata1 = pd.read_csv("C:\\Users\\ likhita\\Documents\\file1.csv", header = None)
You need to include header = None option to tell Python there is no column name (header) in data.

Add Column Names


We can include column names by using names= option.
mydata2 = pd.read_csv("C:\\Users\\ likhita \\Documents\\file1.csv", header = None, names = ['ID', 'first_name', 'salary'])
The variable names can also be added separately by using the following command.
mydata1.columns = ['ID', 'first_name', 'salary']
PYTHON
How to import data in Python.
2. Import File from URL
You don't need to perform additional steps to fetch data from URL. Simply put URL in read_csv() function (applicable only for CSV files
stored in URL).
mydata = pd.read_csv("http://winterolympicsmedals.com/medals.csv")

3. Read Text File


We can use read_table() function to pull data from text file. We can also use read_csv() with sep= "\t" to read data from tab-separated file.
mydata = pd.read_table("C:\\Users\\ likhita \\Desktop\\example2.txt")
mydata = pd.read_csv("C:\\Users\\ likhita \\Desktop\\example2.txt", sep ="\t")

4. Read Excel File


The read_excel() function can be used to import excel data into Python.

mydata = pd.read_excel("https://www.eia.gov/dnav/pet/hist_xls/RBRTEd.xls",sheetname="Data 1", skiprows=2)


PYTHON
How to import data in Python.
5. Read delimited file
Suppose you need to import a file that is separated with white spaces.
mydata2 = pd.read_table("http://www.ssc.wisc.edu/~bhansen/econometrics/invest.dat", sep="\s+", header = None)
To include variable names, use the names= option like below -
mydata3 = pd.read_table("http://www.ssc.wisc.edu/~bhansen/econometrics/invest.dat", sep="\s+", names=['a', 'b', 'c', 'd’])

6. Read sample of rows and columns


By specifying nrows= and usecols=, you can fetch specified number of rows and columns.
mydata7 = pd.read_csv("http://winterolympicsmedals.com/medals.csv", nrows=5, usecols=(1,5,7))
nrows = 5 implies you want to import only first 5 rows and usecols= refers to specified columns you want to import.

7. Skip rows while importing


Suppose you want to skip first 5 rows and wants to read data from 6th row (6th row would be a header row)
mydata8 = pd.read_csv("http://winterolympicsmedals.com/medals.csv", skiprows=5)
PYTHON
How to import data in Python.
Printing the dataframe in Python
- df prints the entire dataframe (not recommended for large datasets).
- df.head(n) to show the first n rows of data frame.
- df.tail(n) shows the bottom n rows of data frame.
PYTHON
Exporting a Pandas dataframe to CSV
- Preserve progress anytime by saving modified dataset using
PYTHON
Data Manipulation
PYTHON
Data Manipulation

Melt
PYTHON
Data Manipulation

Pivot
PYTHON
Data Manipulation

Cross tab
PYTHON
Data Manipulation

Cut
PYTHON
Data Manipulation

Merge
PYTHON
Data Manipulation

Concat OUTPUT
PYTHON
Data Manipulation

Unique
PYTHON
Descriptive Statistics

Mean Example (Python)

Data
Marital_status: Whether the applicant is married ("Yes") or not ("No").
Dependents: Number of dependents of the applicant.
Is_graduate: Whether the applicant is a graduate ("Yes") or not ("No").
Income: Annual Income of the applicant (in USD).
Loan_amount: Loan amount (in USD) for which the application was submitted.
Term_months: Tenure of the loan (in months).
Credit_score: Whether the applicant's credit score was good ("Satisfactory") or not ("Not_satisfactory").
Age: The applicant’s age in years.
Sex: Whether the applicant is female (F) or male (M).
approval_status: Whether the loan application was approved ("Yes") or not ("No").
PYTHON
Descriptive Statistics

Data vs Information
PYTHON
Descriptive Statistics

Measures of Central Tendency


Measures of central tendency describe the center of the data, and are often represented by the mean, the median, and the mode.

Mean
Arithmetic Mean (called mean) is defined as the sum of all observations in a data set divided by the total number of
observations.
PYTHON
Descriptive Statistics

Mean Example

The inner diameter of a particular grade of tire based on 5 sample measurements are as follows: (figures in millimeters)

565, 570, 572, 568, 585

We get mean = (565+570+572+568+585)/5 =572

Caution: Arithmetic Mean is affected by extreme values or fluctuations in sampling. It is not the best average to use when the data set contains extreme
values (Very high or very low values).
PYTHON
Descriptive Statistics

Mean Example (Python)


PYTHON
Descriptive Statistics

Median
Median is the middle most observation when you arrange data in ascending order of magnitude. Median is such that 50% of the
observations are above the median and 50% of the observations are below the median.

Median is a very useful measure for ranked data in the context of consumer preferences and rating. It is not affected by extreme values (greater resistance to
outliers)

n = Number of observations in the sample


PYTHON
Descriptive Statistics

Median example
Marks obtained by 7 students in Computer Science
Exam are given below: Compute the median.

45 40 60 80 90 65 55

Arranging the data after ranking gives

90 80 65 60 55 45 40

Median = (n+1)/2 th value in this set = (7+1)/2 th

observation= 4th observation=60

Hence Median = 60 for this problem.


PYTHON
Descriptive Statistics

Median example(Python)
PYTHON
Descriptive Statistics

Mode
Mode is that value which occurs most often. It has the maximum frequency of occurrence. Mode also has resistance to outliers.

Mode is a very useful measure when you want to keep in the inventory, the most popular shirt in terms of collar size during festival season.

Caution: In a few problems in real life, there will be more than one mode such as bimodal and multi-modal values. In these cases mode cannot be uniquely
determined.

Mode example
The life in number of hours of 10 flashlight batteries are as follows: Find the mode.

340 350 340 340 320 340 330 330 340 350

340 occurs five times. Hence, mode=340.


PYTHON
Descriptive Statistics

Mode example (Python)


PYTHON
Descriptive Statistics

Measure of Dispersion
In simple terms, measures of dispersion indicate how large the spread of the distribution is around the central tendency. It answers unambiguously the question
" What is the magnitude of departure from the average value for different groups having identical averages?".

Range
Range is the simplest of all measures of dispersion. It is calculated as the difference between maximum and minimum value in the data set.

Range = XMax  XMin


PYTHON
Descriptive Statistics

Range Example
Example for Computing Range

The following data represent the percentage return on investment for 10 mutual funds per annum. Calculate Range.

12, 14, 11, 18, 10.5, 11.3, 12, 14, 11, 9

Range = 18-9=9

Caution: If one of the components of range namely the maximum value or minimum value becomes an extreme value, then range should not be used.
PYTHON
Descriptive Statistics

Inter Quartile Range


IQR= Range computed on middle 50% of the observations after eliminating the highest and lowest 25% of observations in a data set that is arranged in
ascending order.

IQR is less affected by outliers.


IQR = Q3 -Q1

IQR Example
The following data represent the percentage return on investment for 9 mutual funds per annum. Calculate interquartile range.

Data Set: 12, 14, 11, 18, 10.5, 12, 14, 11, 9

Arranging in ascending order, the data set becomes


9, 10.5, 11, 11, 12, 12, 14, 14, 18

IQR=Q3-Q1 =14-10.75=3.25
PYTHON
Descriptive Statistics

Standard Deviation
Standard deviation forms the cornerstone for Inferential Statistics. To define standard deviation, you need to define another term
called variance.

In simple terms, standard deviation is the square root of variance.

Example
The following data represent the percentage return on investment for 10 mutual funds per annum.

Calculate the sample standard deviation.


12, 14, 11, 18, 10.5, 11.3, 12, 14, 11, 9
PYTHON
Descriptive Statistics

Standard Deviation example (Python)


PYTHON
Descriptive Statistics

Variance example (Python)


Variance is another measure of dispersion. It is the square of the standard deviation and the covariance of the random variable with itself. The line of code below
prints the variance of all the numerical variables in the dataset. The interpretation of the variance is similar to that of the standard deviation.
PYTHON
Descriptive Statistics

Coefficient of Variation Relative Dispersion


Coefficient of Variation (CV) is defined as the ratio of Standard Deviation to Mean.

In symbolic form
CV = for the sample data and = for the population data

Consider two Sales Persons working in the same territory.


The sales performance of these two in the context of selling PCs are given below. Comment on the results.
Sales Person 1
Mean Sales (One year
average) 50 units
Standard Deviation
5 units

Sales Person 2
Mean Sales (One year
average)75 units
Standard deviation
25 units
PYTHON
Descriptive Statistics

Coefficient of Variation Relative Dispersion


The CV is 5/50 =0.10 or 10% for the Sales Person1 and 25/75=0.33 or 33% for sales Person2.

The moral of the story is "don't get carried away by absolute number". Look at the scatter. Even though, Sales Person2 has achieved a higher average, his
performance is not consistent and seems erratic.
PYTHON
Descriptive Statistics

The Boxplot
The Boxplot: A Graphical display of the data based on the five-number summary:

Example:
Xsmallest Q1 Median Q3 Xlargest
PYTHON
Descriptive Statistics

The Boxplot
PYTHON
Descriptive Statistics

Skew example (python)

The skewness values can be interpreted in the following manner:

o Highly skewed distribution: If the skewness value is less than −1 or greater than +1.
o Moderately skewed distribution: If the skewness value is between −1 and −½ or between +½ and +1.
o Approximately symmetric distribution: If the skewness value is between −½ and +½.
PYTHON
Descriptive Statistics

The Boxplot
PYTHON
Data Exploration

1. Steps of Data Exploration and Preparation


o Variable Identification
o Univariate Analysis
o Bi-variate Analysis
o Missing values treatment
o Outlier treatment
o Variable transformation
o Variable creation
PYTHON
Data Exploration

1. Variable Identification
First, identify Predictor (Input) and Target (output) variables. Next, identify the data type and category of the variables.

Let’s understand this step more clearly by taking an example.

Example:- Suppose, we want to predict, whether the students will play cricket or not (refer below data set). Here you need to identify predictor variables, target
variable, data type of variables and category of variables.
PYTHON
Data Exploration
PYTHON
Data Exploration

Univariate Analysis
At this stage, we explore variables one by one. Method to perform uni-variate analysis will depend on whether the variable type is categorical or continuous.
Let’s look at these methods and statistical measures for categorical and continuous variables individually:

Continuous Variables:- In case of continuous variables, we need to understand the central tendency and spread of the variable. These are measured using various
statistical metrics visualization methods as shown below:

Note: Univariate analysis is also used to highlight missing and outlier values. In the upcoming part of this series, we will look at methods to handle missing and
outlier values.

Categorical Variables:- For categorical variables, we’ll use frequency table to understand distribution of each category. We can also read as percentage of values
under each category. It can be be measured using two metrics, Count and Count% against each category. Bar chart can be used as visualization.
PYTHON
Data Exploration

Bi-variate Analysis
Bi-variate Analysis finds out the relationship between two variables. Here, we look for association and disassociation between variables at a pre-defined
significance level. We can perform bi-variate analysis for any combination of categorical and continuous variables. The combination can be: Categorical &
Categorical, Categorical & Continuous and Continuous & Continuous. Different methods are used to tackle these combinations during analysis process.

Let’s understand the possible combinations in detail:

Continuous & Continuous: While doing bi-variate analysis between two continuous variables, we should look at scatter plot. It is a nifty way to find out the
relationship between two variables. The pattern of scatter plot indicates the relationship between variables. The relationship can be linear or non-linear.
PYTHON
Data Exploration

Scatter plot shows the relationship between two variable but does not indicates the strength of relationship amongst them. To find the strength of the
relationship, we use Correlation. Correlation varies between -1 and +1.

-1: perfect negative linear correlation


+1:perfect positive linear correlation and
0: No correlation
Correlation can be derived using following formula:

Correlation = Covariance(X,Y) / SQRT( Var(X)* Var(Y))

Various tools have function or functionality to identify correlation between variables. In Excel, function CORREL() is used to return the correlation between
two variables and SAS uses procedure PROC CORR to identify the correlation. These function returns Pearson Correlation value to identify the relationship
between two variables:

In above example, we have good positive relationship(0.65) between two variables X and Y.
PYTHON
Data Exploration

Categorical & Categorical: To find the relationship between two categorical variables, we can use following methods:

Two-way table: We can start analyzing the relationship by creating a two-way table of count and count%. The rows represents the category of one variable and
the columns represent the categories of the other variable. We show count or count% of observations available in each combination of row and column
categories.
Stacked Column Chart: This method is more of a visual form of Two-way table.

Chi-Square Test: This test is used to derive the statistical significance of relationship between the variables. Also, it tests whether the evidence in the sample is
strong enough to generalize that the relationship for a larger population as well. Chi-square is based on the difference between the expected and observed
frequencies in one or more categories in the two-way table. It returns probability for the computed chi-square distribution with the degree of freedom.
PYTHON
Data Exploration

Categorical & Continuous: While exploring relation between categorical and continuous variables, we can draw box plots for each level of categorical
variables. If levels are small in number, it will not show the statistical significance. To look at the statistical significance we can perform Z-test, T-test or
ANOVA.

Z-Test/ T-Test:- Either test assess whether mean of two groups are statistically different from each other or not.
Ztest formula If the probability of Z is small then the difference of two averages is more significant.
The T-test is very similar to Z-test but it is used when number of observation for both categories is less than 30.
Data Exploration, Business Analytics

ANOVA:- It assesses whether the average of more than two groups is statistically different.

Example: Suppose, we want to test the effect of five different exercises. For this, we recruit 20 men and assign one type of exercise to 4 men (5 groups). Their
weights are recorded after a few weeks. We need to find out whether the effect of these exercises on them is significantly different or not. This can be done by
comparing the weights of the 5 groups of 4 men each.

Till here, we have understood the first three stages of Data Exploration, Variable Identification, Uni-Variate and Bi-Variate analysis. We also looked at various
statistical and visual methods to identify the relationship between variables.

Now, we will look at the methods of Missing values Treatment. More importantly, we will also look at why missing values occur in our data and why treating
them is necessary.
PYTHON
Data Exploration
2. Missing Value Treatment
Why missing values treatment is required?
Missing data in the training data set can reduce the power / fit of a model or can lead to a biased model because we have not analyzed the behavior and
relationship with other variables correctly. It can lead to wrong prediction or classification.

Notice the missing values in the image shown above: In the left scenario, we have not treated missing values. The inference from this data set is that the chances
of playing cricket by males is higher than females. On the other hand, if you look at the second table, which shows data after treatment of missing values (based
on gender), we can see that females have higher chances of playing cricket compared to males.
PYTHON
Data Exploration
Methods to treat missing values
Deletion: It is of two types: List Wise Deletion and Pair Wise Deletion

Notice the missing values in the image shown above: In the left scenario, we have not treated missing values. The inference from this data set is that the chances
of playing cricket by males is higher than females. On the other hand, if you look at the second table, which shows data after treatment of missing values (based
on gender), we can see that females have higher chances of playing cricket compared to males.
PYTHON
Data Exploration
Methods to treat missing values
Mean/ Mode/ Median Imputation:
o Generalized Imputation: In this case, we calculate the mean or median for all non missing values of that variable then replace missing value with mean or
median. Like in above table, variable “Manpower” is missing so we take average of all non missing values of “Manpower” (28.33) and then replace missing
value with it.
o Similar case Imputation: In this case, we calculate average for gender “Male” (29.75) and “Female” (25) individually of non missing values then replace the
missing value based on gender. For “Male“, we will replace missing values of manpower with 29.75 and for “Female” with 25.
PYTHON
Data Exploration
3. Techniques of Outlier Detection and Treatment
What is an Outlier?
Outlier is a commonly used terminology by analysts and data scientists as it needs close attention else it can result in wildly wrong estimations. Simply
speaking, Outlier is an observation that appears far away and diverges from an overall pattern in a sample.

Let’s take an example, we do customer profiling and find out that the average annual income of customers is $0.8 million. But, there are two customers having
annual income of $4 and $4.2 million. These two customers annual income is much higher than rest of the population. These two observations will be seen as
Outliers.
PYTHON
Data Exploration
3. Techniques of Outlier Detection and Treatment
PYTHON
Data Exploration
What is the impact of Outliers on a dataset?
Outliers can drastically change the results of the data analysis and statistical modeling. There are numerous unfavourable impacts of outliers in the data set:

It increases the error variance and reduces the power of statistical tests
If the outliers are non-randomly distributed, they can decrease normality
They can bias or influence estimates that may be of substantive interest
They can also impact the basic assumption of Regression, ANOVA and other statistical model assumptions.
To understand the impact deeply, let’s take an example to check what happens to a data set with and without outliers in the data set.
PYTHON
Data Exploration
How to detect Outliers?
Most commonly used method to detect outliers is visualization. We use various visualization methods, like Box-plot, Histogram, Scatter Plot (above, we have
used box plot and scatter plot for visualization).

How to remove Outliers?


Most of the ways to deal with outliers are similar to the methods of missing values like deleting observations, transforming them, binning them, treat them as a
separate group, imputing values and other statistical methods. Here, we will discuss the common techniques used to deal with outliers:

Deleting observations: We delete outlier values if it is due to data entry error, data processing error or outlier observations are very small in numbers. We can
also use trimming at both ends to remove outliers.

Transforming and binning values: Transforming variables can also eliminate outliers. Natural log of a value reduces the variation caused by extreme values.
Binning is also a form of variable transformation. Decision Tree algorithm allows to deal with outliers well due to binning of variable. We can also use the
process of assigning weights to different observations.
PYTHON
Data Exploration
4. The Art of Feature Engineering

What is Feature Engineering?


Feature engineering is the science (and art) of extracting more information from existing data. You are not adding any new data here, but you are actually
making the data you already have more useful.

For example, let’s say you are trying to predict foot fall in a shopping mall based on dates. If you try and use the dates directly, you may not be able to extract
meaningful insights from the data. This is because the foot fall is less affected by the day of the month than it is by the day of the week. Now this information
about day of week is implicit in your data. You need to bring it out to make your model better.

You might also like