Python
Python
Python
1. Go to the Anaconda Website and choose a Python 3.x graphical installer (A).
PYTHON
INSTALLATION OF ANACONDA NAVIGATOR
mystring = 'Hi How are you?' mystring = 'Hi How are you?'
mystring[0] mystring.split(' ')[0]
There shoud be space btw ‘’
‘H’ ‘Hi’
mystring[0] refers to first letter as indexing in python starts from 0. 1. mystring.split(' ') tells Python to use space as a delimiter.
Similarly, mystring[1] refers to second letter. 2. mystring.split(' ')[0] tells to pick first word of a string.
To pull last letter, you can use -1 as index.
PYTHON
Python Data Types
3. Python Data Type – List
Unlike String, List can contain different types of objects such as integer, float, string etc.Formally list is an ordered sequence of some data
written using square brackets([]) and commas(,).
The ‘*' operator is repeating list N times. Modify / Replace a list item
X = [1, 2, 3] X = [1, 2, 3]
Z=X*3 X[2]=5
print(Z) print(X)
Another difference is a tuple is created inside parentheses ( ). Whereas, list is created inside square brackets [ ]
Examples:
a=(1,2,3,4) b=(“hello”,1,2,3,”go”)
print(a) print(b)
(1,2,3,4) (“hello”,1,2,3,”go”)
a=(1,2,3,4)
X[2]=5
TypeError:
'tuple' object does not
support item assignment
PYTHON
Python Data Types
5. Dictionary
It works like an address book wherein you can find an address of a person by searching the name. In this example. name of a person is
considered as key and address as value. It is important to note that the key must be unique while values may not be. Keys should not be
duplicate because if it is a duplicate, you cannot find exact values associated with key. Keys can be of any data type such as strings,
numbers, or tuples.
Create a dictionary
It is defined in curly braces {}. Each key is followed by a colon (:) and then values.
teams.keys() teams.values()
Console
In the bottom center is the console. Like in MATLAB, the console is where you can run commands to see what they do or when you want to
debug some code. Variables created in the console are not saved if you close Spyder and open it up again. The console is technically
running IPython by default.
History
Any commands that you type in the console will be logged into the history file in the bottom right pane of the window.
Variable Explorer
Furthermore, any variables that you create in the console will be shown in the variable explorer in the top right pane.
PYTHON
Importing and Exporting.
Importing Data
– Process of loading and reading data into Python from various resources.
– Two important properties:
– Format
Various formats: .csv,.jason,.xlsx,...
– File path of dataset
Computer: /Desktop/mydata.csv
people.csv
import csv
with open('people.csv', 'r') as file:
reader = csv.reader(file)
for row in reader:
print(row)
PYTHON
How to import data in Python.
1. Import CSV files
It is important to note that a singlebackslash does not work when specifying the file path. You need to either change it to forward slash or
add one more backslash like below
import pandas as pd
mydata= pd.read_csv("C:\\Users\\likhita\\Documents\\file1.csv")
Melt
PYTHON
Data Manipulation
Pivot
PYTHON
Data Manipulation
Cross tab
PYTHON
Data Manipulation
Cut
PYTHON
Data Manipulation
Merge
PYTHON
Data Manipulation
Concat OUTPUT
PYTHON
Data Manipulation
Unique
PYTHON
Descriptive Statistics
Data
Marital_status: Whether the applicant is married ("Yes") or not ("No").
Dependents: Number of dependents of the applicant.
Is_graduate: Whether the applicant is a graduate ("Yes") or not ("No").
Income: Annual Income of the applicant (in USD).
Loan_amount: Loan amount (in USD) for which the application was submitted.
Term_months: Tenure of the loan (in months).
Credit_score: Whether the applicant's credit score was good ("Satisfactory") or not ("Not_satisfactory").
Age: The applicant’s age in years.
Sex: Whether the applicant is female (F) or male (M).
approval_status: Whether the loan application was approved ("Yes") or not ("No").
PYTHON
Descriptive Statistics
Data vs Information
PYTHON
Descriptive Statistics
Mean
Arithmetic Mean (called mean) is defined as the sum of all observations in a data set divided by the total number of
observations.
PYTHON
Descriptive Statistics
Mean Example
The inner diameter of a particular grade of tire based on 5 sample measurements are as follows: (figures in millimeters)
Caution: Arithmetic Mean is affected by extreme values or fluctuations in sampling. It is not the best average to use when the data set contains extreme
values (Very high or very low values).
PYTHON
Descriptive Statistics
Median
Median is the middle most observation when you arrange data in ascending order of magnitude. Median is such that 50% of the
observations are above the median and 50% of the observations are below the median.
Median is a very useful measure for ranked data in the context of consumer preferences and rating. It is not affected by extreme values (greater resistance to
outliers)
Median example
Marks obtained by 7 students in Computer Science
Exam are given below: Compute the median.
45 40 60 80 90 65 55
90 80 65 60 55 45 40
Median example(Python)
PYTHON
Descriptive Statistics
Mode
Mode is that value which occurs most often. It has the maximum frequency of occurrence. Mode also has resistance to outliers.
Mode is a very useful measure when you want to keep in the inventory, the most popular shirt in terms of collar size during festival season.
Caution: In a few problems in real life, there will be more than one mode such as bimodal and multi-modal values. In these cases mode cannot be uniquely
determined.
Mode example
The life in number of hours of 10 flashlight batteries are as follows: Find the mode.
340 350 340 340 320 340 330 330 340 350
Measure of Dispersion
In simple terms, measures of dispersion indicate how large the spread of the distribution is around the central tendency. It answers unambiguously the question
" What is the magnitude of departure from the average value for different groups having identical averages?".
Range
Range is the simplest of all measures of dispersion. It is calculated as the difference between maximum and minimum value in the data set.
Range Example
Example for Computing Range
The following data represent the percentage return on investment for 10 mutual funds per annum. Calculate Range.
Range = 18-9=9
Caution: If one of the components of range namely the maximum value or minimum value becomes an extreme value, then range should not be used.
PYTHON
Descriptive Statistics
IQR Example
The following data represent the percentage return on investment for 9 mutual funds per annum. Calculate interquartile range.
Data Set: 12, 14, 11, 18, 10.5, 12, 14, 11, 9
IQR=Q3-Q1 =14-10.75=3.25
PYTHON
Descriptive Statistics
Standard Deviation
Standard deviation forms the cornerstone for Inferential Statistics. To define standard deviation, you need to define another term
called variance.
Example
The following data represent the percentage return on investment for 10 mutual funds per annum.
In symbolic form
CV = for the sample data and = for the population data
Sales Person 2
Mean Sales (One year
average)75 units
Standard deviation
25 units
PYTHON
Descriptive Statistics
The moral of the story is "don't get carried away by absolute number". Look at the scatter. Even though, Sales Person2 has achieved a higher average, his
performance is not consistent and seems erratic.
PYTHON
Descriptive Statistics
The Boxplot
The Boxplot: A Graphical display of the data based on the five-number summary:
Example:
Xsmallest Q1 Median Q3 Xlargest
PYTHON
Descriptive Statistics
The Boxplot
PYTHON
Descriptive Statistics
o Highly skewed distribution: If the skewness value is less than −1 or greater than +1.
o Moderately skewed distribution: If the skewness value is between −1 and −½ or between +½ and +1.
o Approximately symmetric distribution: If the skewness value is between −½ and +½.
PYTHON
Descriptive Statistics
The Boxplot
PYTHON
Data Exploration
1. Variable Identification
First, identify Predictor (Input) and Target (output) variables. Next, identify the data type and category of the variables.
Example:- Suppose, we want to predict, whether the students will play cricket or not (refer below data set). Here you need to identify predictor variables, target
variable, data type of variables and category of variables.
PYTHON
Data Exploration
PYTHON
Data Exploration
Univariate Analysis
At this stage, we explore variables one by one. Method to perform uni-variate analysis will depend on whether the variable type is categorical or continuous.
Let’s look at these methods and statistical measures for categorical and continuous variables individually:
Continuous Variables:- In case of continuous variables, we need to understand the central tendency and spread of the variable. These are measured using various
statistical metrics visualization methods as shown below:
Note: Univariate analysis is also used to highlight missing and outlier values. In the upcoming part of this series, we will look at methods to handle missing and
outlier values.
Categorical Variables:- For categorical variables, we’ll use frequency table to understand distribution of each category. We can also read as percentage of values
under each category. It can be be measured using two metrics, Count and Count% against each category. Bar chart can be used as visualization.
PYTHON
Data Exploration
Bi-variate Analysis
Bi-variate Analysis finds out the relationship between two variables. Here, we look for association and disassociation between variables at a pre-defined
significance level. We can perform bi-variate analysis for any combination of categorical and continuous variables. The combination can be: Categorical &
Categorical, Categorical & Continuous and Continuous & Continuous. Different methods are used to tackle these combinations during analysis process.
Continuous & Continuous: While doing bi-variate analysis between two continuous variables, we should look at scatter plot. It is a nifty way to find out the
relationship between two variables. The pattern of scatter plot indicates the relationship between variables. The relationship can be linear or non-linear.
PYTHON
Data Exploration
Scatter plot shows the relationship between two variable but does not indicates the strength of relationship amongst them. To find the strength of the
relationship, we use Correlation. Correlation varies between -1 and +1.
Various tools have function or functionality to identify correlation between variables. In Excel, function CORREL() is used to return the correlation between
two variables and SAS uses procedure PROC CORR to identify the correlation. These function returns Pearson Correlation value to identify the relationship
between two variables:
In above example, we have good positive relationship(0.65) between two variables X and Y.
PYTHON
Data Exploration
Categorical & Categorical: To find the relationship between two categorical variables, we can use following methods:
Two-way table: We can start analyzing the relationship by creating a two-way table of count and count%. The rows represents the category of one variable and
the columns represent the categories of the other variable. We show count or count% of observations available in each combination of row and column
categories.
Stacked Column Chart: This method is more of a visual form of Two-way table.
Chi-Square Test: This test is used to derive the statistical significance of relationship between the variables. Also, it tests whether the evidence in the sample is
strong enough to generalize that the relationship for a larger population as well. Chi-square is based on the difference between the expected and observed
frequencies in one or more categories in the two-way table. It returns probability for the computed chi-square distribution with the degree of freedom.
PYTHON
Data Exploration
Categorical & Continuous: While exploring relation between categorical and continuous variables, we can draw box plots for each level of categorical
variables. If levels are small in number, it will not show the statistical significance. To look at the statistical significance we can perform Z-test, T-test or
ANOVA.
Z-Test/ T-Test:- Either test assess whether mean of two groups are statistically different from each other or not.
Ztest formula If the probability of Z is small then the difference of two averages is more significant.
The T-test is very similar to Z-test but it is used when number of observation for both categories is less than 30.
Data Exploration, Business Analytics
ANOVA:- It assesses whether the average of more than two groups is statistically different.
Example: Suppose, we want to test the effect of five different exercises. For this, we recruit 20 men and assign one type of exercise to 4 men (5 groups). Their
weights are recorded after a few weeks. We need to find out whether the effect of these exercises on them is significantly different or not. This can be done by
comparing the weights of the 5 groups of 4 men each.
Till here, we have understood the first three stages of Data Exploration, Variable Identification, Uni-Variate and Bi-Variate analysis. We also looked at various
statistical and visual methods to identify the relationship between variables.
Now, we will look at the methods of Missing values Treatment. More importantly, we will also look at why missing values occur in our data and why treating
them is necessary.
PYTHON
Data Exploration
2. Missing Value Treatment
Why missing values treatment is required?
Missing data in the training data set can reduce the power / fit of a model or can lead to a biased model because we have not analyzed the behavior and
relationship with other variables correctly. It can lead to wrong prediction or classification.
Notice the missing values in the image shown above: In the left scenario, we have not treated missing values. The inference from this data set is that the chances
of playing cricket by males is higher than females. On the other hand, if you look at the second table, which shows data after treatment of missing values (based
on gender), we can see that females have higher chances of playing cricket compared to males.
PYTHON
Data Exploration
Methods to treat missing values
Deletion: It is of two types: List Wise Deletion and Pair Wise Deletion
Notice the missing values in the image shown above: In the left scenario, we have not treated missing values. The inference from this data set is that the chances
of playing cricket by males is higher than females. On the other hand, if you look at the second table, which shows data after treatment of missing values (based
on gender), we can see that females have higher chances of playing cricket compared to males.
PYTHON
Data Exploration
Methods to treat missing values
Mean/ Mode/ Median Imputation:
o Generalized Imputation: In this case, we calculate the mean or median for all non missing values of that variable then replace missing value with mean or
median. Like in above table, variable “Manpower” is missing so we take average of all non missing values of “Manpower” (28.33) and then replace missing
value with it.
o Similar case Imputation: In this case, we calculate average for gender “Male” (29.75) and “Female” (25) individually of non missing values then replace the
missing value based on gender. For “Male“, we will replace missing values of manpower with 29.75 and for “Female” with 25.
PYTHON
Data Exploration
3. Techniques of Outlier Detection and Treatment
What is an Outlier?
Outlier is a commonly used terminology by analysts and data scientists as it needs close attention else it can result in wildly wrong estimations. Simply
speaking, Outlier is an observation that appears far away and diverges from an overall pattern in a sample.
Let’s take an example, we do customer profiling and find out that the average annual income of customers is $0.8 million. But, there are two customers having
annual income of $4 and $4.2 million. These two customers annual income is much higher than rest of the population. These two observations will be seen as
Outliers.
PYTHON
Data Exploration
3. Techniques of Outlier Detection and Treatment
PYTHON
Data Exploration
What is the impact of Outliers on a dataset?
Outliers can drastically change the results of the data analysis and statistical modeling. There are numerous unfavourable impacts of outliers in the data set:
It increases the error variance and reduces the power of statistical tests
If the outliers are non-randomly distributed, they can decrease normality
They can bias or influence estimates that may be of substantive interest
They can also impact the basic assumption of Regression, ANOVA and other statistical model assumptions.
To understand the impact deeply, let’s take an example to check what happens to a data set with and without outliers in the data set.
PYTHON
Data Exploration
How to detect Outliers?
Most commonly used method to detect outliers is visualization. We use various visualization methods, like Box-plot, Histogram, Scatter Plot (above, we have
used box plot and scatter plot for visualization).
Deleting observations: We delete outlier values if it is due to data entry error, data processing error or outlier observations are very small in numbers. We can
also use trimming at both ends to remove outliers.
Transforming and binning values: Transforming variables can also eliminate outliers. Natural log of a value reduces the variation caused by extreme values.
Binning is also a form of variable transformation. Decision Tree algorithm allows to deal with outliers well due to binning of variable. We can also use the
process of assigning weights to different observations.
PYTHON
Data Exploration
4. The Art of Feature Engineering
For example, let’s say you are trying to predict foot fall in a shopping mall based on dates. If you try and use the dates directly, you may not be able to extract
meaningful insights from the data. This is because the foot fall is less affected by the day of the month than it is by the day of the week. Now this information
about day of week is implicit in your data. You need to bring it out to make your model better.