0% found this document useful (0 votes)

199 views76 pages

Python

The document provides information on installing Anaconda Navigator and describes various Python data types including numeric, string, list, tuple, dictionary and set. It discusses how to define and manipulate each data type in Python. For lists, it demonstrates how to access list items, combine and modify lists. For tuples, it notes they are immutable unlike lists. The document defines dictionaries as mapping types that contain keys and values, and shows how to access values using keys. Finally, it briefly introduces sets as unordered collections of unique items.

Uploaded by

Rohit

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

199 views76 pages

Python

Uploaded by

Rohit

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 76

1

M3 PROGRAMMING FOR ANALYTICS

LEARNING OBJECTIVES
o Installation of Anaconda Navigator, Data types – string, tuples, set, lists, dictionary, Arrays. Spyder, Importing and Exporting Files,
Data Manipulation, Descriptive Statistics and Documentation with Jupyter.

o INTRODUCTION: DATABASE MANAGEMENT SYSTEMS

o DATA DEFINITION AND MANIPULATION
o BASICS OF SAS
o PYTHON: BASICS OF PYTHON
o R PROGRAMMING
PYTHON
INSTALLATION OF ANACONDA NAVIGATOR

1. Go to the Anaconda Website and choose a Python 3.x graphical installer (A).
PYTHON
INSTALLATION OF ANACONDA NAVIGATOR

2. Locate your download and double click it.

PYTHON
INSTALLATION OF ANACONDA NAVIGATOR
When the screen below appears, click on Next.
PYTHON
INSTALLATION OF ANACONDA NAVIGATOR
3. Read the license agreement and click on I Agree.
PYTHON
INSTALLATION OF ANACONDA NAVIGATOR
4. Click on Next.
PYTHON
INSTALLATION OF ANACONDA NAVIGATOR
5. Note your installation location and then click Next.
PYTHON
INSTALLATION OF ANACONDA NAVIGATOR
6. This is an important part of the installation process. The recommended approach is to not check the box to add Anaconda to your path.
This means you will have to use Anaconda Navigator or the Anaconda Command Prompt (located in the Start Menu under "Anaconda")
when you wish to use Anaconda (you can always add Anaconda to your PATH later if you don't check the box).
PYTHON
INSTALLATION OF ANACONDA NAVIGATOR
7.Click Next
PYTHON
INSTALLATION OF ANACONDA NAVIGATOR
8. You can install Microsoft VSCode if you wish, but it is optional.
PYTHON
INSTALLATION OF ANACONDA NAVIGATOR
9. Click on Finish.
PYTHON
Python Data Types
Python Data Types are used to define the type of a variable
1. Python Data Type – Numeric
2. Python Data Type – String
3. Python Data Type – List
4. Python Data Type – Tuple
5. Dictionary
6. Python Sets
In Python we need not to declare datatype while declaring a
1. Python Data Type – Numeric variable like C or C++. We can simply just assign values in a
variable. But if we want to see what type of numerical value is
Python numeric data type is used to hold numeric values like; it holding right now, we can use type(), like this:
int – holds signed integers of non-limited length.
long- holds long integers(exists in Python 2.x, deprecated in Python 3.x).
float- holds floating precision numbers and it’s accurate upto 15 decimal
places.
complex- holds complex numbers.
PYTHON
Python Data Types
2. Python Data Type – String
You can create Python string using a single or double quote.

mystring = "Hello" mystring = '''Hello''' mystring = """Hello""" mystring = 'Hello"Python"'

print(mystring) print(mystring) print(mystring) print(mystring)

Hello Hello Hello Hello"Python"

How to extract Nth letter or word? To get first word

mystring = 'Hi How are you?' mystring = 'Hi How are you?'
mystring[0] mystring.split(' ')[0]
There shoud be space btw ‘’
‘H’ ‘Hi’

mystring[0] refers to first letter as indexing in python starts from 0. 1. mystring.split(' ') tells Python to use space as a delimiter.
Similarly, mystring[1] refers to second letter. 2. mystring.split(' ')[0] tells to pick first word of a string.
To pull last letter, you can use -1 as index.
PYTHON
Python Data Types
3. Python Data Type – List
Unlike String, List can contain different types of objects such as integer, float, string etc.Formally list is an ordered sequence of some data
written using square brackets([]) and commas(,).

x = [142, 124, 234, 345, 465]

y = [‘A’, ‘C’, ‘E’, ‘M’]
z = [‘AA’, 44, 5.1, ‘KK’]

Get List ITEM

We can extract list item using Indexes. Index starts from 0 and end with (number of elements-)
1. Syntax: list[start: stop : step]
Start: refers to starting position k = [124, 225, 305, 246, 259]
stop : refers to end position k[0]
Step : refers to increment value
124

k[:3] returns [124, 225, 305]

k[0:3] also returns [124, 225, 305]
k[::-1] reverses the whole list and returns [259, 246, 305, 225, 124]
PYTHON
Python Data Types
3. Python Data Type – List
Combine / Join data lists Sum of values of data list Sum of values using numpy
The '+' operator is concatenating data lists. list1 = [1, 2, 3] X = [1, 2, 3]
list2 = [4, 5, 6]
X = [1, 2, 3] Y = [4, 5, 6]
Y = [4, 5, 6] sum_list = [] import numpy as np
Z=X+Y
for (item1, item2) in zip(list1, list2): Z = np.add(X, Y)
sum_list.append(item1+item2) print(Z)
print(Z)
print(sum_list)
[1, 2, 3, 4, 5, 6]
[5,7,9]
[5,7,9]

The ‘*' operator is repeating list N times. Modify / Replace a list item

X = [1, 2, 3] X = [1, 2, 3]
Z=X*3 X[2]=5
print(Z) print(X)

[1, 2, 3, 1,2,3,1,2,3] [1, 2, 5]

PYTHON
Python Data Types
4. Python Data Type – Tuple
Like list, tuple can also contain mixed data. But tuple cannot be mutable or changed once created whereas list can be mutable or modified.

Another difference is a tuple is created inside parentheses ( ). Whereas, list is created inside square brackets [ ]

Examples:

a=(1,2,3,4) b=(“hello”,1,2,3,”go”)
print(a) print(b)

(1,2,3,4) (“hello”,1,2,3,”go”)

Tuple can not be altered

a=(1,2,3,4)
X[2]=5
TypeError:
'tuple' object does not
support item assignment
PYTHON
Python Data Types
5. Dictionary
It works like an address book wherein you can find an address of a person by searching the name. In this example. name of a person is
considered as key and address as value. It is important to note that the key must be unique while values may not be. Keys should not be
duplicate because if it is a duplicate, you cannot find exact values associated with key. Keys can be of any data type such as strings,
numbers, or tuples.

Create a dictionary
It is defined in curly braces {}. Each key is followed by a colon (:) and then values.

teams = {'Dave' : ['teamA','teamAA', 'teamAB'], teams = {'Dave' : ['teamA','teamAA', 'teamAB'],

'Tim' : ['teamB','teamBB','teamBC'], 'Tim' : ['teamB','teamBB','teamBC'],
'Babita' : ['teamC','teamCB','teamCC'] 'Babita' : ['teamC','teamCB','teamCC']
} }

teams.keys() teams.values()

dict_keys(['Dave', 'Tim', 'Babita']) dict_values([['teamA', 'teamAA', 'teamAB'],

['teamB', 'teamBB', 'teamBC'], ['teamC', 'teamCB',
'teamCC']])
PYTHON
Python Data Types
5. Dictionary

Find Values of a particular key

teams = {'Dave' : ['teamA','teamAA', 'teamAB'], del teams['Babita']

'Tim' : ['teamB','teamBB','teamBC'],
'Babita' : ['teamC','teamCB','teamCC']
} {'Dave': ['teamA', 'teamAA', 'teamAB'],
'Tim': ['teamB', 'teamBB', 'teamBC']}
teams['Dave']

['teamA', 'teamAA', 'teamAB']

PYTHON
Python Data Types
6. Python Sets
Set is an unordered collection of unique items. Set is defined by values separated by comma inside braces { }. Items in a set are not ordered.
They are mainly used to check whether an object is present in the set and compute mathematical operations such as intersection, union,
difference etc.
PYTHON
Click Anaconda Navigator
PYTHON
Click Anaconda Navigator
Click Launch Spyder or If you have a terminal window open, you can launch spyder simply by typing spyder and pressing enter. You may
get a pop-up window saying that spyder is not the latest version, which is just because the version within Anaconda is a few revisions
behind.
PYTHON
Spyder(Python)
PYTHON
File Explorer or directory listing.
In this pane, you can find files that you want to edit or create new files and folders to work with.

Script editor or File editor

In this editor, you can work on Python scripts that you want to save to re-run later on. By default, the editor opens a file called temp.py
located in Spyder’s configuration directory. This file is meant as a temporary place to try things out before you save them in a file
somewhere else on your computer.

Console
In the bottom center is the console. Like in MATLAB, the console is where you can run commands to see what they do or when you want to
debug some code. Variables created in the console are not saved if you close Spyder and open it up again. The console is technically
running IPython by default.

History
Any commands that you type in the console will be logged into the history file in the bottom right pane of the window.

Variable Explorer
Furthermore, any variables that you create in the console will be shown in the variable explorer in the top right pane.
PYTHON
Importing and Exporting.

Importing Data
– Process of loading and reading data into Python from various resources.
– Two important properties:
– Format
Various formats: .csv,.jason,.xlsx,...
– File path of dataset
Computer: /Desktop/mydata.csv
people.csv

import csv
with open('people.csv', 'r') as file:
reader = csv.reader(file)
for row in reader:
print(row)
PYTHON
How to import data in Python.
1. Import CSV files
It is important to note that a singlebackslash does not work when specifying the file path. You need to either change it to forward slash or
add one more backslash like below

import pandas as pd
mydata= pd.read_csv("C:\\Users\\likhita\\Documents\\file1.csv")

If no header (title) in raw data file

mydata1 = pd.read_csv("C:\\Users\\ likhita\\Documents\\file1.csv", header = None)
You need to include header = None option to tell Python there is no column name (header) in data.

Add Column Names

We can include column names by using names= option.
mydata2 = pd.read_csv("C:\\Users\\ likhita \\Documents\\file1.csv", header = None, names = ['ID', 'first_name', 'salary'])
The variable names can also be added separately by using the following command.
mydata1.columns = ['ID', 'first_name', 'salary']
PYTHON
How to import data in Python.
2. Import File from URL
You don't need to perform additional steps to fetch data from URL. Simply put URL in read_csv() function (applicable only for CSV files
stored in URL).
mydata = pd.read_csv("http://winterolympicsmedals.com/medals.csv")

3. Read Text File

We can use read_table() function to pull data from text file. We can also use read_csv() with sep= "\t" to read data from tab-separated file.
mydata = pd.read_table("C:\\Users\\ likhita \\Desktop\\example2.txt")
mydata = pd.read_csv("C:\\Users\\ likhita \\Desktop\\example2.txt", sep ="\t")

4. Read Excel File

The read_excel() function can be used to import excel data into Python.

mydata = pd.read_excel("https://www.eia.gov/dnav/pet/hist_xls/RBRTEd.xls",sheetname="Data 1", skiprows=2)

PYTHON
How to import data in Python.
5. Read delimited file
Suppose you need to import a file that is separated with white spaces.
mydata2 = pd.read_table("http://www.ssc.wisc.edu/~bhansen/econometrics/invest.dat", sep="\s+", header = None)
To include variable names, use the names= option like below -
mydata3 = pd.read_table("http://www.ssc.wisc.edu/~bhansen/econometrics/invest.dat", sep="\s+", names=['a', 'b', 'c', 'd’])

6. Read sample of rows and columns

By specifying nrows= and usecols=, you can fetch specified number of rows and columns.
mydata7 = pd.read_csv("http://winterolympicsmedals.com/medals.csv", nrows=5, usecols=(1,5,7))
nrows = 5 implies you want to import only first 5 rows and usecols= refers to specified columns you want to import.

7. Skip rows while importing

Suppose you want to skip first 5 rows and wants to read data from 6th row (6th row would be a header row)
mydata8 = pd.read_csv("http://winterolympicsmedals.com/medals.csv", skiprows=5)
PYTHON
How to import data in Python.
Printing the dataframe in Python
- df prints the entire dataframe (not recommended for large datasets).
- df.head(n) to show the first n rows of data frame.
- df.tail(n) shows the bottom n rows of data frame.
PYTHON
Exporting a Pandas dataframe to CSV
- Preserve progress anytime by saving modified dataset using
PYTHON
Data Manipulation
PYTHON
Data Manipulation

Melt
PYTHON
Data Manipulation

Pivot
PYTHON
Data Manipulation

Cross tab
PYTHON
Data Manipulation

Cut
PYTHON
Data Manipulation

Merge
PYTHON
Data Manipulation

Concat OUTPUT
PYTHON
Data Manipulation

Unique
PYTHON
Descriptive Statistics

Mean Example (Python)

Data
Marital_status: Whether the applicant is married ("Yes") or not ("No").
Dependents: Number of dependents of the applicant.
Is_graduate: Whether the applicant is a graduate ("Yes") or not ("No").
Income: Annual Income of the applicant (in USD).
Loan_amount: Loan amount (in USD) for which the application was submitted.
Term_months: Tenure of the loan (in months).
Credit_score: Whether the applicant's credit score was good ("Satisfactory") or not ("Not_satisfactory").
Age: The applicant’s age in years.
Sex: Whether the applicant is female (F) or male (M).
approval_status: Whether the loan application was approved ("Yes") or not ("No").
PYTHON
Descriptive Statistics

Data vs Information
PYTHON
Descriptive Statistics

Measures of Central Tendency

Measures of central tendency describe the center of the data, and are often represented by the mean, the median, and the mode.

Mean
Arithmetic Mean (called mean) is defined as the sum of all observations in a data set divided by the total number of
observations.
PYTHON
Descriptive Statistics

Mean Example

The inner diameter of a particular grade of tire based on 5 sample measurements are as follows: (figures in millimeters)

565, 570, 572, 568, 585

We get mean = (565+570+572+568+585)/5 =572

Caution: Arithmetic Mean is affected by extreme values or fluctuations in sampling. It is not the best average to use when the data set contains extreme
values (Very high or very low values).
PYTHON
Descriptive Statistics

Mean Example (Python)

PYTHON
Descriptive Statistics

Median
Median is the middle most observation when you arrange data in ascending order of magnitude. Median is such that 50% of the
observations are above the median and 50% of the observations are below the median.

Median is a very useful measure for ranked data in the context of consumer preferences and rating. It is not affected by extreme values (greater resistance to
outliers)

n = Number of observations in the sample

PYTHON
Descriptive Statistics

Median example
Marks obtained by 7 students in Computer Science
Exam are given below: Compute the median.

45 40 60 80 90 65 55

Arranging the data after ranking gives

90 80 65 60 55 45 40

Median = (n+1)/2 th value in this set = (7+1)/2 th

observation= 4th observation=60

Hence Median = 60 for this problem.

PYTHON
Descriptive Statistics

Median example(Python)
PYTHON
Descriptive Statistics

Mode
Mode is that value which occurs most often. It has the maximum frequency of occurrence. Mode also has resistance to outliers.

Mode is a very useful measure when you want to keep in the inventory, the most popular shirt in terms of collar size during festival season.

Caution: In a few problems in real life, there will be more than one mode such as bimodal and multi-modal values. In these cases mode cannot be uniquely
determined.

Mode example
The life in number of hours of 10 flashlight batteries are as follows: Find the mode.

340 350 340 340 320 340 330 330 340 350

340 occurs five times. Hence, mode=340.

PYTHON
Descriptive Statistics

Mode example (Python)

PYTHON
Descriptive Statistics

Measure of Dispersion
In simple terms, measures of dispersion indicate how large the spread of the distribution is around the central tendency. It answers unambiguously the question
" What is the magnitude of departure from the average value for different groups having identical averages?".

Range
Range is the simplest of all measures of dispersion. It is calculated as the difference between maximum and minimum value in the data set.

Range = XMax  XMin

PYTHON
Descriptive Statistics

Range Example
Example for Computing Range

The following data represent the percentage return on investment for 10 mutual funds per annum. Calculate Range.

12, 14, 11, 18, 10.5, 11.3, 12, 14, 11, 9

Range = 18-9=9

Caution: If one of the components of range namely the maximum value or minimum value becomes an extreme value, then range should not be used.
PYTHON
Descriptive Statistics

Inter Quartile Range

IQR= Range computed on middle 50% of the observations after eliminating the highest and lowest 25% of observations in a data set that is arranged in
ascending order.

IQR is less affected by outliers.

IQR = Q3 -Q1

IQR Example
The following data represent the percentage return on investment for 9 mutual funds per annum. Calculate interquartile range.

Data Set: 12, 14, 11, 18, 10.5, 12, 14, 11, 9

Arranging in ascending order, the data set becomes

9, 10.5, 11, 11, 12, 12, 14, 14, 18

IQR=Q3-Q1 =14-10.75=3.25
PYTHON
Descriptive Statistics

Standard Deviation
Standard deviation forms the cornerstone for Inferential Statistics. To define standard deviation, you need to define another term
called variance.

In simple terms, standard deviation is the square root of variance.

Example
The following data represent the percentage return on investment for 10 mutual funds per annum.

Calculate the sample standard deviation.

12, 14, 11, 18, 10.5, 11.3, 12, 14, 11, 9
PYTHON
Descriptive Statistics

Standard Deviation example (Python)

PYTHON
Descriptive Statistics

Variance example (Python)

Variance is another measure of dispersion. It is the square of the standard deviation and the covariance of the random variable with itself. The line of code below
prints the variance of all the numerical variables in the dataset. The interpretation of the variance is similar to that of the standard deviation.
PYTHON
Descriptive Statistics

Coefficient of Variation Relative Dispersion

Coefficient of Variation (CV) is defined as the ratio of Standard Deviation to Mean.

In symbolic form
CV = for the sample data and = for the population data

Consider two Sales Persons working in the same territory.

The sales performance of these two in the context of selling PCs are given below. Comment on the results.
Sales Person 1
Mean Sales (One year
average) 50 units
Standard Deviation
5 units

Sales Person 2
Mean Sales (One year
average)75 units
Standard deviation
25 units
PYTHON
Descriptive Statistics

Coefficient of Variation Relative Dispersion

The CV is 5/50 =0.10 or 10% for the Sales Person1 and 25/75=0.33 or 33% for sales Person2.

The moral of the story is "don't get carried away by absolute number". Look at the scatter. Even though, Sales Person2 has achieved a higher average, his
performance is not consistent and seems erratic.
PYTHON
Descriptive Statistics

The Boxplot
The Boxplot: A Graphical display of the data based on the five-number summary:

Example:
Xsmallest Q1 Median Q3 Xlargest
PYTHON
Descriptive Statistics

The Boxplot
PYTHON
Descriptive Statistics

Skew example (python)

The skewness values can be interpreted in the following manner:

o Highly skewed distribution: If the skewness value is less than −1 or greater than +1.
o Moderately skewed distribution: If the skewness value is between −1 and −½ or between +½ and +1.
o Approximately symmetric distribution: If the skewness value is between −½ and +½.
PYTHON
Descriptive Statistics

The Boxplot
PYTHON
Data Exploration

1. Steps of Data Exploration and Preparation

o Variable Identification
o Univariate Analysis
o Bi-variate Analysis
o Missing values treatment
o Outlier treatment
o Variable transformation
o Variable creation
PYTHON
Data Exploration

1. Variable Identification
First, identify Predictor (Input) and Target (output) variables. Next, identify the data type and category of the variables.

Let’s understand this step more clearly by taking an example.

Example:- Suppose, we want to predict, whether the students will play cricket or not (refer below data set). Here you need to identify predictor variables, target
variable, data type of variables and category of variables.
PYTHON
Data Exploration
PYTHON
Data Exploration

Univariate Analysis
At this stage, we explore variables one by one. Method to perform uni-variate analysis will depend on whether the variable type is categorical or continuous.
Let’s look at these methods and statistical measures for categorical and continuous variables individually:

Continuous Variables:- In case of continuous variables, we need to understand the central tendency and spread of the variable. These are measured using various
statistical metrics visualization methods as shown below:

Note: Univariate analysis is also used to highlight missing and outlier values. In the upcoming part of this series, we will look at methods to handle missing and
outlier values.

Categorical Variables:- For categorical variables, we’ll use frequency table to understand distribution of each category. We can also read as percentage of values
under each category. It can be be measured using two metrics, Count and Count% against each category. Bar chart can be used as visualization.
PYTHON
Data Exploration

Bi-variate Analysis
Bi-variate Analysis finds out the relationship between two variables. Here, we look for association and disassociation between variables at a pre-defined
significance level. We can perform bi-variate analysis for any combination of categorical and continuous variables. The combination can be: Categorical &
Categorical, Categorical & Continuous and Continuous & Continuous. Different methods are used to tackle these combinations during analysis process.

Let’s understand the possible combinations in detail:

Continuous & Continuous: While doing bi-variate analysis between two continuous variables, we should look at scatter plot. It is a nifty way to find out the
relationship between two variables. The pattern of scatter plot indicates the relationship between variables. The relationship can be linear or non-linear.
PYTHON
Data Exploration

Scatter plot shows the relationship between two variable but does not indicates the strength of relationship amongst them. To find the strength of the
relationship, we use Correlation. Correlation varies between -1 and +1.

-1: perfect negative linear correlation

+1:perfect positive linear correlation and
0: No correlation
Correlation can be derived using following formula:

Correlation = Covariance(X,Y) / SQRT( Var(X)* Var(Y))

Various tools have function or functionality to identify correlation between variables. In Excel, function CORREL() is used to return the correlation between
two variables and SAS uses procedure PROC CORR to identify the correlation. These function returns Pearson Correlation value to identify the relationship
between two variables:

In above example, we have good positive relationship(0.65) between two variables X and Y.
PYTHON
Data Exploration

Categorical & Categorical: To find the relationship between two categorical variables, we can use following methods:

Two-way table: We can start analyzing the relationship by creating a two-way table of count and count%. The rows represents the category of one variable and
the columns represent the categories of the other variable. We show count or count% of observations available in each combination of row and column
categories.
Stacked Column Chart: This method is more of a visual form of Two-way table.

Chi-Square Test: This test is used to derive the statistical significance of relationship between the variables. Also, it tests whether the evidence in the sample is
strong enough to generalize that the relationship for a larger population as well. Chi-square is based on the difference between the expected and observed
frequencies in one or more categories in the two-way table. It returns probability for the computed chi-square distribution with the degree of freedom.
PYTHON
Data Exploration

Categorical & Continuous: While exploring relation between categorical and continuous variables, we can draw box plots for each level of categorical
variables. If levels are small in number, it will not show the statistical significance. To look at the statistical significance we can perform Z-test, T-test or
ANOVA.

Z-Test/ T-Test:- Either test assess whether mean of two groups are statistically different from each other or not.
Ztest formula If the probability of Z is small then the difference of two averages is more significant.
The T-test is very similar to Z-test but it is used when number of observation for both categories is less than 30.
Data Exploration, Business Analytics

ANOVA:- It assesses whether the average of more than two groups is statistically different.

Example: Suppose, we want to test the effect of five different exercises. For this, we recruit 20 men and assign one type of exercise to 4 men (5 groups). Their
weights are recorded after a few weeks. We need to find out whether the effect of these exercises on them is significantly different or not. This can be done by
comparing the weights of the 5 groups of 4 men each.

Till here, we have understood the first three stages of Data Exploration, Variable Identification, Uni-Variate and Bi-Variate analysis. We also looked at various
statistical and visual methods to identify the relationship between variables.

Now, we will look at the methods of Missing values Treatment. More importantly, we will also look at why missing values occur in our data and why treating
them is necessary.
PYTHON
Data Exploration
2. Missing Value Treatment
Why missing values treatment is required?
Missing data in the training data set can reduce the power / fit of a model or can lead to a biased model because we have not analyzed the behavior and
relationship with other variables correctly. It can lead to wrong prediction or classification.

Notice the missing values in the image shown above: In the left scenario, we have not treated missing values. The inference from this data set is that the chances
of playing cricket by males is higher than females. On the other hand, if you look at the second table, which shows data after treatment of missing values (based
on gender), we can see that females have higher chances of playing cricket compared to males.
PYTHON
Data Exploration
Methods to treat missing values
Mean/ Mode/ Median Imputation:
o Generalized Imputation: In this case, we calculate the mean or median for all non missing values of that variable then replace missing value with mean or
median. Like in above table, variable “Manpower” is missing so we take average of all non missing values of “Manpower” (28.33) and then replace missing
value with it.
o Similar case Imputation: In this case, we calculate average for gender “Male” (29.75) and “Female” (25) individually of non missing values then replace the
missing value based on gender. For “Male“, we will replace missing values of manpower with 29.75 and for “Female” with 25.
PYTHON
Data Exploration
3. Techniques of Outlier Detection and Treatment
What is an Outlier?
Outlier is a commonly used terminology by analysts and data scientists as it needs close attention else it can result in wildly wrong estimations. Simply
speaking, Outlier is an observation that appears far away and diverges from an overall pattern in a sample.

Let’s take an example, we do customer profiling and find out that the average annual income of customers is $0.8 million. But, there are two customers having
annual income of $4 and $4.2 million. These two customers annual income is much higher than rest of the population. These two observations will be seen as
Outliers.
PYTHON
Data Exploration
3. Techniques of Outlier Detection and Treatment
PYTHON
Data Exploration
What is the impact of Outliers on a dataset?
Outliers can drastically change the results of the data analysis and statistical modeling. There are numerous unfavourable impacts of outliers in the data set:

It increases the error variance and reduces the power of statistical tests
If the outliers are non-randomly distributed, they can decrease normality
They can bias or influence estimates that may be of substantive interest
They can also impact the basic assumption of Regression, ANOVA and other statistical model assumptions.
To understand the impact deeply, let’s take an example to check what happens to a data set with and without outliers in the data set.
PYTHON
Data Exploration
How to detect Outliers?
Most commonly used method to detect outliers is visualization. We use various visualization methods, like Box-plot, Histogram, Scatter Plot (above, we have
used box plot and scatter plot for visualization).

How to remove Outliers?

Most of the ways to deal with outliers are similar to the methods of missing values like deleting observations, transforming them, binning them, treat them as a
separate group, imputing values and other statistical methods. Here, we will discuss the common techniques used to deal with outliers:

Deleting observations: We delete outlier values if it is due to data entry error, data processing error or outlier observations are very small in numbers. We can
also use trimming at both ends to remove outliers.

Transforming and binning values: Transforming variables can also eliminate outliers. Natural log of a value reduces the variation caused by extreme values.
Binning is also a form of variable transformation. Decision Tree algorithm allows to deal with outliers well due to binning of variable. We can also use the
process of assigning weights to different observations.
PYTHON
Data Exploration
4. The Art of Feature Engineering

What is Feature Engineering?

Feature engineering is the science (and art) of extracting more information from existing data. You are not adding any new data here, but you are actually
making the data you already have more useful.

For example, let’s say you are trying to predict foot fall in a shopping mall based on dates. If you try and use the dates directly, you may not be able to extract
meaningful insights from the data. This is because the foot fall is less affected by the day of the month than it is by the day of the week. Now this information
about day of week is implicit in your data. You need to bring it out to make your model better.

1745516832930-Pandas-Handbook
No ratings yet
1745516832930-Pandas-Handbook
33 pages
Assignment
85% (33)
Assignment
13 pages
CH 22 Analytical Decision Making
No ratings yet
CH 22 Analytical Decision Making
26 pages
R-Codes SCS1621
No ratings yet
R-Codes SCS1621
151 pages
Python: Duration: 2 Months
No ratings yet
Python: Duration: 2 Months
3 pages
Basics of Statistics1
No ratings yet
Basics of Statistics1
63 pages
Big Data Analytics
No ratings yet
Big Data Analytics
3 pages
Basics of Statistics
No ratings yet
Basics of Statistics
8 pages
Introduction
No ratings yet
Introduction
52 pages
Slides 02 Python
No ratings yet
Slides 02 Python
24 pages
9-3 Basics of Statistics: Unit 9 Probability and Mathematical Induction
No ratings yet
9-3 Basics of Statistics: Unit 9 Probability and Mathematical Induction
16 pages
Case Study 4
No ratings yet
Case Study 4
10 pages
Nature of Statistics Part 2
No ratings yet
Nature of Statistics Part 2
48 pages
Class Xi Python
100% (2)
Class Xi Python
138 pages
Data Analytics With Python-1
No ratings yet
Data Analytics With Python-1
12 pages
Chapter 8 B - Trendlines and Regression Analysis
No ratings yet
Chapter 8 B - Trendlines and Regression Analysis
73 pages
MITx SCX KeyConcept SC1x FV
No ratings yet
MITx SCX KeyConcept SC1x FV
70 pages
Statistical Forecasting Models
100% (1)
Statistical Forecasting Models
37 pages
Linear Regression
100% (1)
Linear Regression
51 pages
IPL DATA ANLYSIS (1)
No ratings yet
IPL DATA ANLYSIS (1)
20 pages
Topic 1 Etw3482
100% (2)
Topic 1 Etw3482
69 pages
UNIT - 2 .DataScience 04.09.18
No ratings yet
UNIT - 2 .DataScience 04.09.18
53 pages
BDM Unit I Slides Part 1
No ratings yet
BDM Unit I Slides Part 1
27 pages
MT416 - BCommII - Introduction To Business Analytics - MBA - 10039 - 19 - PratyayDas
No ratings yet
MT416 - BCommII - Introduction To Business Analytics - MBA - 10039 - 19 - PratyayDas
44 pages
List Comprehension in Python
No ratings yet
List Comprehension in Python
8 pages
Ch02 DSS BI
No ratings yet
Ch02 DSS BI
91 pages
Python Pandas2 PDF
No ratings yet
Python Pandas2 PDF
38 pages
Advanced Data Analytics Using Python - Unit II
No ratings yet
Advanced Data Analytics Using Python - Unit II
57 pages
Statistics and Machine Learning in Python
No ratings yet
Statistics and Machine Learning in Python
218 pages
Sets in Python
No ratings yet
Sets in Python
7 pages
Session 18 Time Series Forecasting
No ratings yet
Session 18 Time Series Forecasting
30 pages
02-03 ASAP Business Analytics-2 Descriptive Statistics
No ratings yet
02-03 ASAP Business Analytics-2 Descriptive Statistics
109 pages
ANL252 SU2 Jul2022
No ratings yet
ANL252 SU2 Jul2022
52 pages
Chapter 08 Advanced SQL
No ratings yet
Chapter 08 Advanced SQL
28 pages
2 - LinearProg 1 PDF
No ratings yet
2 - LinearProg 1 PDF
21 pages
Chapter 7 - TThe Box-Jenkins Methodology For ARIMA Models
100% (1)
Chapter 7 - TThe Box-Jenkins Methodology For ARIMA Models
205 pages
Neural Networks Cheat Sheet - 2020 PDF
No ratings yet
Neural Networks Cheat Sheet - 2020 PDF
14 pages
Marko Grobelnik, Blaz Fortuna, Dunja Mladenic Jozef Stefan Institute, Slovenia
100% (1)
Marko Grobelnik, Blaz Fortuna, Dunja Mladenic Jozef Stefan Institute, Slovenia
107 pages
Data Mining: Concepts and Techniques: - Chapter 6
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 6
172 pages
Data Wrangling in R PDF
No ratings yet
Data Wrangling in R PDF
12 pages
Unit - 1: Supply-Chain Network Design (Edit)
No ratings yet
Unit - 1: Supply-Chain Network Design (Edit)
18 pages
R Advbeginner v5
No ratings yet
R Advbeginner v5
73 pages
Full Statistics
No ratings yet
Full Statistics
108 pages
Customer Segmentation Clustering
No ratings yet
Customer Segmentation Clustering
35 pages
Python Data Structures
No ratings yet
Python Data Structures
8 pages
Data Analytics Using R (DA-R)
100% (1)
Data Analytics Using R (DA-R)
67 pages
Introduction To Data Engineering
No ratings yet
Introduction To Data Engineering
8 pages
In-Class Practices - Session 1 - Answers
No ratings yet
In-Class Practices - Session 1 - Answers
19 pages
1-Big Data Analytics
No ratings yet
1-Big Data Analytics
37 pages
Numpy Cheat Sheet & Quick Reference
100% (1)
Numpy Cheat Sheet & Quick Reference
6 pages
0802 Python Tutorial
100% (1)
0802 Python Tutorial
155 pages
8 Best Python Cheat Sheets For Beginners and Intermediate Learners
100% (1)
8 Best Python Cheat Sheets For Beginners and Intermediate Learners
17 pages
Introduction To STATISTICS-new
100% (1)
Introduction To STATISTICS-new
46 pages
100 Data Scientist Interview Questions by DataInterview 1688929352
No ratings yet
100 Data Scientist Interview Questions by DataInterview 1688929352
7 pages
Statistical Modeling
No ratings yet
Statistical Modeling
22 pages
1 The Role of Statistics and The Data Analysis Process
100% (1)
1 The Role of Statistics and The Data Analysis Process
30 pages
Tutor
100% (1)
Tutor
309 pages
ANL252 SU6 Jul2022
No ratings yet
ANL252 SU6 Jul2022
51 pages
ANL252 SU3 Jul2022
No ratings yet
ANL252 SU3 Jul2022
23 pages
Data Analytics Program Training
No ratings yet
Data Analytics Program Training
13 pages
Dsa (Week 1) - Python
No ratings yet
Dsa (Week 1) - Python
57 pages
DESCRIPTIVE STAT, CONCEPT AND HEALTHCARE APP_082302
No ratings yet
DESCRIPTIVE STAT, CONCEPT AND HEALTHCARE APP_082302
4 pages
MTH145 pyq4
No ratings yet
MTH145 pyq4
4 pages
Saints 2020 Maths Paper 2 Past Paper
No ratings yet
Saints 2020 Maths Paper 2 Past Paper
20 pages
Assistant Statistical Officer in a.p. Economic and Statistical
No ratings yet
Assistant Statistical Officer in a.p. Economic and Statistical
5 pages
ST 101 Exam 1 ReviewSp
No ratings yet
ST 101 Exam 1 ReviewSp
4 pages
Dot Plots Practice: A B C D
No ratings yet
Dot Plots Practice: A B C D
2 pages
Paper Code: 304 3nd Term Individual Assignment Case Study: Business Statistics
No ratings yet
Paper Code: 304 3nd Term Individual Assignment Case Study: Business Statistics
11 pages
11-Measures of Skewness
No ratings yet
11-Measures of Skewness
11 pages
Biostatistics and Research Methodology
No ratings yet
Biostatistics and Research Methodology
128 pages
CS2 CMP Upgrade 2022
No ratings yet
CS2 CMP Upgrade 2022
128 pages
Lab 5
No ratings yet
Lab 5
32 pages
Final Exam (Bus. Statistics)
No ratings yet
Final Exam (Bus. Statistics)
1 page
AEC 51 Quiz 2 Xavier-Ateneo Mathematics Department Nov 8,2021 Name: Czarina Jane A. Pacturan
No ratings yet
AEC 51 Quiz 2 Xavier-Ateneo Mathematics Department Nov 8,2021 Name: Czarina Jane A. Pacturan
2 pages
Mosconi W1
No ratings yet
Mosconi W1
14 pages
Worksheets - Chapter04 KEY
No ratings yet
Worksheets - Chapter04 KEY
6 pages
2023 PLS
No ratings yet
2023 PLS
21 pages
A Research Proposal Paper On Hepatitis C in India
67% (6)
A Research Proposal Paper On Hepatitis C in India
32 pages
SQQS1013 Ch2 A122
No ratings yet
SQQS1013 Ch2 A122
44 pages
R Programming Swirl
No ratings yet
R Programming Swirl
22 pages
Who Gambles in The Stock Market
No ratings yet
Who Gambles in The Stock Market
45 pages
Biostatistics: DR Priyanka N Maiya
No ratings yet
Biostatistics: DR Priyanka N Maiya
85 pages
STAT-2104 Probability and Statistics
No ratings yet
STAT-2104 Probability and Statistics
4 pages
ElemStat - Module 3 - Introduction To Statistics - W3 Portrait
No ratings yet
ElemStat - Module 3 - Introduction To Statistics - W3 Portrait
21 pages
sheet math 6 v1.0.5
No ratings yet
sheet math 6 v1.0.5
62 pages
Frequency Analysis of Rainfall For Flood Control in Patani, Delta State of Nigeria
No ratings yet
Frequency Analysis of Rainfall For Flood Control in Patani, Delta State of Nigeria
8 pages
SPSS Descriptive Statistics - Mathematics - Learning and Teaching at University of Suffolk
No ratings yet
SPSS Descriptive Statistics - Mathematics - Learning and Teaching at University of Suffolk
7 pages
Business Statistics Assignment 2 & 3
No ratings yet
Business Statistics Assignment 2 & 3
6 pages
Muhammad Palize Qazi - 24027 - Assignment 2
No ratings yet
Muhammad Palize Qazi - 24027 - Assignment 2
5 pages
Statistics
No ratings yet
Statistics
41 pages

Python

Uploaded by

Python

Uploaded by

1

M3 PROGRAMMING FOR ANALYTICS

o INTRODUCTION: DATABASE MANAGEMENT SYSTEMS

2. Locate your download and double click it.

mystring = "Hello" mystring = '''Hello''' mystring = """Hello""" mystring = 'Hello"Python"'

Hello Hello Hello Hello"Python"

How to extract Nth letter or word? To get first word

x = [142, 124, 234, 345, 465]

Get List ITEM

k[:3] returns [124, 225, 305]

[1, 2, 3, 1,2,3,1,2,3] [1, 2, 5]

Tuple can not be altered

teams = {'Dave' : ['teamA','teamAA', 'teamAB'], teams = {'Dave' : ['teamA','teamAA', 'teamAB'],

dict_keys(['Dave', 'Tim', 'Babita']) dict_values([['teamA', 'teamAA', 'teamAB'],

Find Values of a particular key

teams = {'Dave' : ['teamA','teamAA', 'teamAB'], del teams['Babita']

['teamA', 'teamAA', 'teamAB']

Script editor or File editor

If no header (title) in raw data file

Add Column Names

3. Read Text File

4. Read Excel File

mydata = pd.read_excel("https://www.eia.gov/dnav/pet/hist_xls/RBRTEd.xls",sheetname="Data 1", skiprows=2)

6. Read sample of rows and columns

7. Skip rows while importing

Mean Example (Python)

Measures of Central Tendency

565, 570, 572, 568, 585

We get mean = (565+570+572+568+585)/5 =572

Mean Example (Python)

n = Number of observations in the sample

Arranging the data after ranking gives

Median = (n+1)/2 th value in this set = (7+1)/2 th

observation= 4th observation=60

Hence Median = 60 for this problem.

340 occurs five times. Hence, mode=340.

Mode example (Python)

Range = XMax  XMin

12, 14, 11, 18, 10.5, 11.3, 12, 14, 11, 9

Inter Quartile Range

IQR is less affected by outliers.

Arranging in ascending order, the data set becomes

In simple terms, standard deviation is the square root of variance.

Calculate the sample standard deviation.

Standard Deviation example (Python)

Variance example (Python)

Coefficient of Variation Relative Dispersion

Consider two Sales Persons working in the same territory.

Coefficient of Variation Relative Dispersion

Skew example (python)

The skewness values can be interpreted in the following manner:

1. Steps of Data Exploration and Preparation

Let’s understand this step more clearly by taking an example.

Let’s understand the possible combinations in detail:

-1: perfect negative linear correlation

Correlation = Covariance(X,Y) / SQRT( Var(X)* Var(Y))

How to remove Outliers?

What is Feature Engineering?

You might also like