0% found this document useful (0 votes)
19 views

Data Mining and Predictive Modelling Assignment

*For practice purpose

Uploaded by

rkumar25022000
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views

Data Mining and Predictive Modelling Assignment

*For practice purpose

Uploaded by

rkumar25022000
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

DMDW Lab using PYTHON

5th Semester
Department of Computer Science and
Engineering
GIET University, Gunupur
ASSIGNMENT 1
MEASURES OF CENTRAL TENDENCY
It describes distribution of data focusing on
central location around which all other data
are clustered.
MEASURES OF CENTRAL TENDENCY
It attempts to describe set of data by
identifying the central position within which
data is set.
Measure of central tendency:
1. Mean
2. Median
3. Mode
MEAN

Staff 1 2 3 4 5 6 7 8 9 10
Salary 15k 18k 16k 14k 15k 15k 12k 17k 90k 95k
MEDIAN
The median is the middle score for a set of data that has been
arranged in order of magnitude.
The median is less affected by outliers and skewed data. In order
to calculate the median, suppose we have the data below
Ex-1) 65 55 89 56 35 14 56 55 87 45 92
We first need to rearrange that data into order of magnitude
14 35 45 55 55 56 56 65 87 89 92
Our median mark is the middle mark - in this case is 56
Ex-2) 65 55 89 56 35 14 56 55 87 45

We again rearrange that data into order of magnitude (smallest first):

14 35 45 55 55 56 56 65 87 89
Only now we have to take the 5th and 6th score in our data set and
average them to get a median of 55.5.
MODE
The mode is the most frequent score in our data set. On a histogram it
represents the highest bar in a bar chart or histogram in fig-1 .

Fig-1 Fig-2 Fig-3

Normally, the mode is used for categorical data where we wish to know
which is the most common category, as illustrated in fig-2.
However, one of the problems with the mode is that it is not unique, so it
leaves us with problems when we have two or more values that share
the highest frequency, such as fig-3.
SKEWED DISTRIBUTIONS
An example of a normally distributed set of data is presented
below.

•In any symmetrical distribution the mean, median and mode are
equal.
• Mean is widely preferred as the best measure of central tendency
because it is the measure that includes all the values in the data set
for its calculation.
CONTD.
However, when our data is skewed, for example, as with the right-skewed
data set below:

•Median is generally considered to be the best representative of the


central location of the data.
•The more skewed the distribution, the greater the difference between
the median and mean .
•The greater emphasis should be placed on using the median as opposed
to the mean.
SUMMARY OF WHEN TO USE THE MEAN, MEDIAN AND
MODE

Please use the following summary table to know what the best
measure of central tendency is with respect to the different types of
variable.

Best measure of central


Type of Variable
tendency
Nominal Mode
Ordinal Median
Interval/Ratio (not
Mean
skewed)
Interval/Ratio (skewed) Median
VARIANCE AND STANDARD DEVIATION
EXAMPLE
The ages of you and your friends are 25, 26, 27, 30, and 32.
First, we must find the mean age: (25 + 26 + 27 + 30 + 32) / 5 =
28.
Then, we need to calculate the differences from the mean for
each of the 5 friends.
25 – 28 = -3
26 – 28 = -2
27 – 28 = -1
30 – 28 = 2
32 – 28 = 4
Next, to calculate the variance, we take each difference from
the mean, square it, then average the result.
Variance = ( (-3)2 + (-2)2 + (-1)2 + 22 + 42)/ 5
= (9 + 4 + 1 + 4 + 16 ) / 5 = 6.8
Variance is 6.8. Standard deviation is the square root of the
variance, which is 2.61.
PRACTICE-1
Write the python code for following statistical
operations with and without library function:
✔ Mean

✔ Median

✔ Mode

✔ Standard Deviation and

✔ Variance
MEAN WITHOUT LIBRARY FUNCTION

# Mean without using library


n_num = [1, 2, 3, 4, 5]
n = len(n_num)
get_sum = sum(n_num)
mean = get_sum / n
print("Mean / Average is: " + str(mean))
MEDIAN WITHOUT LIBRARY FUNCTION
# Median without using library
n_num = [1, 2, 3, 4, 5]
n = len(n_num)
n_num.sort()
if n % 2 == 0:
median1 = n_num[n//2]
median2 = n_num[n//2 - 1]
median = (median1 + median2)/2
else:
median = n_num[n//2]
print("Median is: " + str(median))
MODE WITHOUT LIBRARY FUNCTION
# Python program to print mode of elements
from collections import Counter
n_num = [1, 2, 3, 4, 5, 5]
n = len(n_num)
data = Counter(n_num)
get_mode = dict(data)
mode = [k for k, v in get_mode.items() if v==
max(list(data.values()))]
if len(mode) == n:
get_mode = "No mode found"
else:
get_mode = "Mode is / are: " + ', '.join(map(str,
mode))
print(get_mode)
MODE WITH LIBRARY FUNCTION
import numpy
speed = [99,86,87,88,111,86,103,87,94,78,77,85,86]
x = numpy.mean(speed)
y = numpy.median(speed)
s = numpy.std(speed)
v = numpy.var(speed)
print(x)
print(y)
print(s)
print(v)
MODE WITH LIBRARY FUNCTION
from scipy import stats
speed = [99,86,87,88,111,86,103,87,94,78,77,85,86]
x = stats.mode(speed)
print(x)
ANACONDA PLATFORM
Anaconda Individual Edition is the world's most
popular Python distribution platform with over 20 million users
worldwide.

• Anaconda Navigator is a desktop graphical user interface (GUI) included in


Anaconda distribution that allows users to launch applications and manage
conda packages, environments and channels without using command-line
commands.
Anaconda Installation
Steps
BENEFITS OF USING PYTHON ANACONDA
It is free and open-source
It has more than 1500 Python/R data science packages
It creates an environment that is easily manageable for
deploying any project
Download more than 1500 Python/R data science packages
Manage libraries, dependencies, and environments with
conda
Build and train ML and deep learning models with
scikit-learn, TensorFlow and Theano
Use Dask, NumPy, Pandas and Numba to analyze data
scalably and fast
Perform visualization with Matplotlib, Bokeh, Datashader,
and Holoviews
THE JUPYTER NOTEBOOK
The Jupyter Notebook is an open-source web application that
allows you to create and share documents that contain live code,
equations, visualizations and narrative text.
Uses include: data cleaning and transformation, numerical
simulation, statistical modeling, data visualization, machine
learning, and much more.
INTRODUCTION TO GOOGLE-COLAB
Colaboratory, or 'Colab' for short, allows you to write and execute Python in your
browser, with
✔ Zero configuration required
✔ Free access to GPUs
✔ Easy sharing
Advantages
It performs all the tasks and code that Jupyter Notebook executes, using
Python 2 and 3.
It is THE Google Documents of Code. The notebook can be shared and edited in
real-time by different team members, add comments, see the edition history and go
back to previous versions, like in google docs.
No more Anaconda. It is all cloud-based and it doesn't require any main settings
or installations. If the library that you want to use is not on Colab, just pip it as
usual. Being installed in the virtual environment.
Personalization. Add your own shortcuts, night/light/adaptive - mode, and fonts.
Playground mode. With 2 clicks you can enter open a new notebook that won’t be
saved, and try different code options without affecting your original code.
ASSIGNMENT-1 QUESTION
1. Write a python code for finding mean, median
and mode with and without using library
functions.
2. Write a python code for calculating variance and
standard deviation for the set of elements with
and without using library functions..
3. Practice some basic python programs with List,
Tuple, Dictionary & string.

You might also like