0% found this document useful (0 votes)

89 views9 pages

Ss Project With Python

This document is a mid-term examination report for a student named Bhola Kamble enrolled in the third semester of the BCA (DS) program at Ajeenkya D Y Patil University, Pune for the academic year 2022-23. It discusses statistical concepts like data representation and manipulation using Python libraries like pandas and scipy.stats. It provides examples of reading data from CSV files into DataFrames, selecting and grouping data, plotting scatter plots, and performing hypothesis tests like the student's t-test.

Uploaded by

Bhola Kamble

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

89 views9 pages

Ss Project With Python

Uploaded by

Bhola Kamble

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

School of Engineering

Ajeenkya D Y Patil University, Pune

MID-TERM EXAMINATION
REPORT

STUDENT NAME: BHOLA KAMBLE

URN: 2021-B-01082003A
COURSE NAME: STATISTICAL SCIENCE

SEMESTER: III
PROGRAM & SPECIALIZATION: BCA (DS)
ACADEMIC YEAR: 2022-23
Statistic in python

Data representation and interaction

Data as a table
The setting that we consider for statistical analysis is that of multiple observations or samples described by a set
of different attributes or features. The data can than be seen as a 2D table, or matrix, with columns giving the
different attributes of the data, and rows the observations. For instance, the data contained in
examples/brain_size.csv:

"";"Gender";"FSIQ";"VIQ";"PIQ";"Weight";"Height";"MRI_Count"

"1";"Female";133;132;124;"118";"64.5";816932

"2";"Male";140;150;124;".";"72.5";1001121

"3";"Male";139;123;150;"143";"73.3";1038437

"4";"Male";133;129;128;"172";"68.8";965353

"5";"Female";137;132;134;"147";"65.0";951545

The pandas data-frame

We will store and manipulate this data in a pandas.DataFrame, from the pandas module. It is the Python equivalent
of the spreadsheet table. It is different from a 2D numpy array as it has named columns, can contain a mixture of
different data types by column, and has elaborate selection and pivotal mechanisms.

Creating dataframes: reading data files or converting arrays

Separator

It is a CSV file, but the separator is “;”

Reading from a CSV file: Using the above CSV file that gives observations of brain size and weight and IQ
(Willerman et al. 1991), the data are a mixture of numerical and categorical values:

>>>>>> import pandas

>>> data = pandas.read_csv('examples/brain_size.csv', sep=';', na_values=".")

>>> data
Unnamed: 0 Gender FSIQ VIQ PIQ Weight Height MRI_Count

0 1 Female 133 132 124 118.0 64.5 816932

1 2 Male 140 150 124 NaN 72.5 1001121

2 3 Male 139 123 150 143.0 73.3 1038437

3 4 Male 133 129 128 172.0 68.8 965353

4 5 Female 137 132 134 147.0 65.0 951545

...

Missing values

The weight of the second individual is missing in the CSV file. If we don’t specify the missing value (NA = not
available) marker, we will not be able to do statistical analysis.

Creating from arrays: A pandas.DataFrame can also be seen as a dictionary of 1D ‘series’, eg arrays or lists. If
we have 3 numpy arrays:

>>>>>> import numpy as np

>>> t = np.linspace(-6, 6, 20)

>>> sin_t = np.sin(t)

>>> cos_t = np.cos(t)

We can expose them as a pandas.DataFrame:

>>>>>> pandas.DataFrame({'t': t, 'sin': sin_t, 'cos': cos_t})

t sin cos
0 -6.000000 0.279415 0.960170

1 -5.368421 0.792419 0.609977

2 -4.736842 0.999701 0.024451

3 -4.105263 0.821291 -0.570509

4 -3.473684 0.326021 -0.945363

5 -2.842105 -0.295030 -0.955488

6 -2.210526 -0.802257 -0.596979

7 -1.578947 -0.999967 -0.008151

8 -0.947368 -0.811882 0.583822

...
Other inputs: pandas can input data from SQL, excel files, or other formats. See the pandas documentation.

Manipulating data
data is a pandas.DataFrame, that resembles R’s dataframe:

>>>>>> data.shape # 40 rows and 8 columns

(40, 8)

>>> data.columns # It has columns

Index([u'Unnamed: 0', u'Gender', u'FSIQ', u'VIQ', u'PIQ', u'Weight', u'Height',

u'MRI_Count'], dtype='object')
>>> print(data['Gender']) # Columns can be addressed by name

0 Female
1 Male
2 Male
3 Male
4 Female
...
>>> # Simpler selector

>>> data[data['Gender'] == 'Female']['VIQ'].mean()

109.45

Note

For a quick view on a large dataframe, use its describe method: pandas.DataFrame.describe().

groupby: splitting a dataframe on values of categorical variables:

>>>>>> groupby_gender = data.groupby('Gender')

>>> for gender, value in groupby_gender['VIQ']:

... print((gender, value.mean()))

('Female', 109.45)

('Male', 115.25)

groupby_gender is a powerful object that exposes many operations on the resulting group of dataframes:

>>>>>> groupby_gender.mean()

Unnamed: 0 FSIQ VIQ PIQ Weight Height MRI_Count

Gender

Female 19.65 111.9 109.45 110.45 137.200000 65.765000 862654.6

Male 21.35 115.0 115.25 111.60 166.444444 71.431579 954855.4

Use tab-completion on groupby_gender to find more. Other common grouping functions are median, count (useful for
checking to see the amount of missing values in different subsets) or sum. Groupby evaluation is lazy, no work is done
until an aggregation function is applied.

Exercise

• What is the mean value for VIQ for the full population?
• How many males/females were included in this study?

• Hint use ‘tab completion’ to find out the methods that can be called, instead of ‘mean’ in the above
example.

• What is the average value of MRI counts expressed in log units, for males and females?

Note

groupby_gender.boxplot is used for the plots above (see this example).

Plotting data
Pandas comes with some plotting tools (pandas.tools.plotting, using matplotlib behind the scene) to
display statistics of the data in dataframes:

Scatter matrices:

>>>>>> from pandas.tools import plotting

>>> plotting.scatter_matrix(data[['Weight', 'Height', 'MRI_Count']])

>>>>>> plotting.scatter_matrix(data[['PIQ', 'VIQ', 'FSIQ']])

Two populations

The IQ metrics are bimodal, as if there are 2 sub-populations.

Exercise

Plot the scatter matrix for males only, and for females only. Do you think that the 2 sub-populations correspond
to gender?
Hypothesis testing: comparing two groups
For simple statistical tests, we will use the scipy.stats sub-module of scipy:

>>>>>> from scipy import stats

Student’s t-test: the simplest statistical test

1-sample t-test: testing the value of a population mean

scipy.stats.ttest_1samp() tests if the population mean of data is likely to be equal to a given value
(technically if observations are drawn from a Gaussian distributions of given population mean). It returns the T
statistic, and the p-value (see the function’s help):

>>>>>> stats.ttest_1samp(data['VIQ'], 0)

Ttest_1sampResult(statistic=30.088099970..., pvalue=1.32891964...e-28)

With a p-value of 10^-28 we can claim that the population mean for the IQ (VIQ measure) is not 0.

2-sample t-test: testing for difference across populations

We have seen above that the mean VIQ in the male and female populations were different. To test if this is
significant, we do a 2-sample t-test with scipy.stats.ttest_ind():

>>>>>> female_viq = data[data['Gender'] == 'Female']['VIQ']

>>> male_viq = data[data['Gender'] == 'Male']['VIQ']

>>> stats.ttest_ind(female_viq, male_viq)

Ttest_indResult(statistic=-0.77261617232..., pvalue=0.4445287677858...)

Data Analytics New Quantum AKTU
No ratings yet
Data Analytics New Quantum AKTU
210 pages
Mini Project
No ratings yet
Mini Project
31 pages
Machine Learning Assignment
No ratings yet
Machine Learning Assignment
5 pages
6 - KNN Classifier
No ratings yet
6 - KNN Classifier
10 pages
Higher-Order Finite Element Methods - Pavel Solin
100% (1)
Higher-Order Finite Element Methods - Pavel Solin
388 pages
Day 5 Supervised Technique-Decision Tree For Classification PDF
100% (1)
Day 5 Supervised Technique-Decision Tree For Classification PDF
58 pages
Asthama Disease Prediction Using Machine Learning !!!!: Importing Necessary Libraries
No ratings yet
Asthama Disease Prediction Using Machine Learning !!!!: Importing Necessary Libraries
55 pages
Simple Linear Regression - Assign3
No ratings yet
Simple Linear Regression - Assign3
8 pages
Clickstream Analysis
No ratings yet
Clickstream Analysis
25 pages
Assignment-2 Data Visualization and Data Preprocessing
No ratings yet
Assignment-2 Data Visualization and Data Preprocessing
1 page
DS+C25 PGDDS+Masters
No ratings yet
DS+C25 PGDDS+Masters
13 pages
Practical No-2
No ratings yet
Practical No-2
4 pages
Unit 1 DataScience
No ratings yet
Unit 1 DataScience
105 pages
AIML Lab Manual
No ratings yet
AIML Lab Manual
43 pages
02 Amazon Fine Food Reviews Analysis - TSNE - Slides
No ratings yet
02 Amazon Fine Food Reviews Analysis - TSNE - Slides
1 page
Logistic Regression in R
No ratings yet
Logistic Regression in R
19 pages
Sajjad DS
100% (2)
Sajjad DS
97 pages
Data Science Workshop
No ratings yet
Data Science Workshop
6 pages
18bge14a U4
No ratings yet
18bge14a U4
16 pages
Semi-Automated Exploratory Data Analysis (EDA) in Python - by Destin Gong - Mar, 2021 - Towards Data
No ratings yet
Semi-Automated Exploratory Data Analysis (EDA) in Python - by Destin Gong - Mar, 2021 - Towards Data
3 pages
6 Different Ways To Compensate For Missing Values in A Dataset
No ratings yet
6 Different Ways To Compensate For Missing Values in A Dataset
6 pages
BDA Unit 1-1
No ratings yet
BDA Unit 1-1
21 pages
ccs346 Eda Unit 1 Notes
No ratings yet
ccs346 Eda Unit 1 Notes
20 pages
Manual
No ratings yet
Manual
48 pages
Detailed Curriculum PDF
No ratings yet
Detailed Curriculum PDF
6 pages
Linear Regression - Numpy and Sklearn
No ratings yet
Linear Regression - Numpy and Sklearn
7 pages
Thera Bank - Project
100% (4)
Thera Bank - Project
34 pages
Assignment I Data Analytics
No ratings yet
Assignment I Data Analytics
3 pages
Exploratory Data Analysis (Eda) With Pandas: (Cheatsheet)
No ratings yet
Exploratory Data Analysis (Eda) With Pandas: (Cheatsheet)
7 pages
AL3451 Machine Learning Apr May 2024 Question Paper Download
No ratings yet
AL3451 Machine Learning Apr May 2024 Question Paper Download
3 pages
RMM Unit-I Introdution To Data Mining
No ratings yet
RMM Unit-I Introdution To Data Mining
129 pages
Unit V Big Data Analytics
No ratings yet
Unit V Big Data Analytics
47 pages
Visualization Errors
No ratings yet
Visualization Errors
34 pages
Salary Prediction LinearRegression
100% (1)
Salary Prediction LinearRegression
7 pages
Parallelism of Statistics and Machine Learning & Logistic Regression Versus Random Forest
100% (1)
Parallelism of Statistics and Machine Learning & Logistic Regression Versus Random Forest
72 pages
Hackathon Overall Travel Experience of Traveling in Shinkansen Bullet Train Merging Two Data Set
No ratings yet
Hackathon Overall Travel Experience of Traveling in Shinkansen Bullet Train Merging Two Data Set
59 pages
DSML Curriculum Doc - Google Sheets
0% (1)
DSML Curriculum Doc - Google Sheets
12 pages
Outliers, Hypothesis and Natural Language Processing
100% (1)
Outliers, Hypothesis and Natural Language Processing
7 pages
Data Analytics Question Bank
No ratings yet
Data Analytics Question Bank
4 pages
Credit EDA Assignment PDF
No ratings yet
Credit EDA Assignment PDF
40 pages
Lecture 9 PDF
100% (1)
Lecture 9 PDF
28 pages
Data Science Laboratory Lab Manual: Prepared by Dr. R Obulakonda Reddy, Associate Professor
No ratings yet
Data Science Laboratory Lab Manual: Prepared by Dr. R Obulakonda Reddy, Associate Professor
35 pages
Data Science Lab
No ratings yet
Data Science Lab
28 pages
Chapter 3: Data Preprocessing
100% (1)
Chapter 3: Data Preprocessing
41 pages
EDA Assignment
No ratings yet
EDA Assignment
15 pages
Cns Lessonplan
No ratings yet
Cns Lessonplan
2 pages
Modelling in R
No ratings yet
Modelling in R
47 pages
Pandas - Basics - Practice: Consider The Following Python Dictionary Data and Python List Labels
No ratings yet
Pandas - Basics - Practice: Consider The Following Python Dictionary Data and Python List Labels
6 pages
DL Lab Manual
100% (1)
DL Lab Manual
35 pages
Data Wrangling
No ratings yet
Data Wrangling
13 pages
Clustering & PCA Assignment Questions
No ratings yet
Clustering & PCA Assignment Questions
4 pages
Assignment 02
No ratings yet
Assignment 02
9 pages
AD3301 - Model - Exam - Question Paper1
No ratings yet
AD3301 - Model - Exam - Question Paper1
2 pages
Unit 2 Preparing To Model
No ratings yet
Unit 2 Preparing To Model
49 pages
IIT Madras Notes Machine Learning
No ratings yet
IIT Madras Notes Machine Learning
13 pages
Predictive Analytics: Course Syllabus
No ratings yet
Predictive Analytics: Course Syllabus
8 pages
Linear Regression Analysis. Statistics 2 Notes
No ratings yet
Linear Regression Analysis. Statistics 2 Notes
20 pages
Machine Learning Guide Line
No ratings yet
Machine Learning Guide Line
10 pages
Big Data Data Analytics
No ratings yet
Big Data Data Analytics
5 pages
3.1. Statistics in Python - Scipy Lecture Notes
No ratings yet
3.1. Statistics in Python - Scipy Lecture Notes
20 pages
Statistics in Python
No ratings yet
Statistics in Python
19 pages
Sap Nwds Install and Upgrade
No ratings yet
Sap Nwds Install and Upgrade
14 pages
Sts Reviewer
No ratings yet
Sts Reviewer
14 pages
ES Teaser Example
100% (1)
ES Teaser Example
4 pages
Presentation IT Infrastructure
No ratings yet
Presentation IT Infrastructure
18 pages
NOXON Iradio Manual GB
No ratings yet
NOXON Iradio Manual GB
60 pages
6.DC Motor Interface
No ratings yet
6.DC Motor Interface
51 pages
Internship Report
No ratings yet
Internship Report
20 pages
Latest Log
No ratings yet
Latest Log
14 pages
Lab 2
No ratings yet
Lab 2
4 pages
Overflow Flag: Using Lookup Table. This Uses 7 Output Pins of Microcontroller
No ratings yet
Overflow Flag: Using Lookup Table. This Uses 7 Output Pins of Microcontroller
3 pages
Model School: Mid Term Examination (2024-25) Class-XII
No ratings yet
Model School: Mid Term Examination (2024-25) Class-XII
4 pages
Eg6 Ict TP 3rd Vadazone 2019
No ratings yet
Eg6 Ict TP 3rd Vadazone 2019
6 pages
FALLSEM2021-22 CSE3009 ETH VL2021220103863 Reference Material I 02-Aug-2021 L1-IOT - An Overview of The Course
No ratings yet
FALLSEM2021-22 CSE3009 ETH VL2021220103863 Reference Material I 02-Aug-2021 L1-IOT - An Overview of The Course
24 pages
Industrial Training Report
No ratings yet
Industrial Training Report
17 pages
HTML Tags - Sample Files
No ratings yet
HTML Tags - Sample Files
9 pages
Kako Konfigurisati Rooter
No ratings yet
Kako Konfigurisati Rooter
10 pages
Vijay Resume C C++
No ratings yet
Vijay Resume C C++
3 pages
Installation
No ratings yet
Installation
6 pages
Manual de Usuario GoLabel - II - UM
No ratings yet
Manual de Usuario GoLabel - II - UM
171 pages
2024 Navori Presentation English PDF
No ratings yet
2024 Navori Presentation English PDF
38 pages
Module Handbook Adv Web Engineering-V1 0
No ratings yet
Module Handbook Adv Web Engineering-V1 0
10 pages
Youtube Playlist Link Extractor - Extract To TextExcelURLCSV (Copy 3)
No ratings yet
Youtube Playlist Link Extractor - Extract To TextExcelURLCSV (Copy 3)
3 pages
SQL Server Clustering
No ratings yet
SQL Server Clustering
2 pages
1 Tester Roles and Responsibilities
No ratings yet
1 Tester Roles and Responsibilities
2 pages
Mathematics
No ratings yet
Mathematics
2 pages
Vignesh Kumar Resume
No ratings yet
Vignesh Kumar Resume
1 page
Unit 1
No ratings yet
Unit 1
23 pages
Supply Chain Flowchart
No ratings yet
Supply Chain Flowchart
8 pages
Bachelor of Science in Information Technology
0% (1)
Bachelor of Science in Information Technology
12 pages

Ss Project With Python

Uploaded by

Ss Project With Python

Uploaded by

School of Engineering

Ajeenkya D Y Patil University, Pune

STUDENT NAME: BHOLA KAMBLE

Data representation and interaction

The pandas data-frame

Creating dataframes: reading data files or converting arrays

It is a CSV file, but the separator is “;”

>>>>>> import pandas

>>> data = pandas.read_csv('examples/brain_size.csv', sep=';', na_values=".")

0 1 Female 133 132 124 118.0 64.5 816932

1 2 Male 140 150 124 NaN 72.5 1001121

2 3 Male 139 123 150 143.0 73.3 1038437

3 4 Male 133 129 128 172.0 68.8 965353

4 5 Female 137 132 134 147.0 65.0 951545

>>>>>> import numpy as np

>>> t = np.linspace(-6, 6, 20)

>>> sin_t = np.sin(t)

>>> cos_t = np.cos(t)

We can expose them as a pandas.DataFrame:

>>>>>> pandas.DataFrame({'t': t, 'sin': sin_t, 'cos': cos_t})

1 -5.368421 0.792419 0.609977

2 -4.736842 0.999701 0.024451

3 -4.105263 0.821291 -0.570509

4 -3.473684 0.326021 -0.945363

5 -2.842105 -0.295030 -0.955488

6 -2.210526 -0.802257 -0.596979

7 -1.578947 -0.999967 -0.008151

8 -0.947368 -0.811882 0.583822

>>>>>> data.shape # 40 rows and 8 columns

>>> data.columns # It has columns

Index([u'Unnamed: 0', u'Gender', u'FSIQ', u'VIQ', u'PIQ', u'Weight', u'Height',

>>> data[data['Gender'] == 'Female']['VIQ'].mean()

groupby: splitting a dataframe on values of categorical variables:

>>>>>> groupby_gender = data.groupby('Gender')

>>> for gender, value in groupby_gender['VIQ']:

... print((gender, value.mean()))

Unnamed: 0 FSIQ VIQ PIQ Weight Height MRI_Count

Female 19.65 111.9 109.45 110.45 137.200000 65.765000 862654.6

Male 21.35 115.0 115.25 111.60 166.444444 71.431579 954855.4

groupby_gender.boxplot is used for the plots above (see this example).

>>>>>> from pandas.tools import plotting

>>>>>> plotting.scatter_matrix(data[['PIQ', 'VIQ', 'FSIQ']])

The IQ metrics are bimodal, as if there are 2 sub-populations.

>>>>>> from scipy import stats

Student’s t-test: the simplest statistical test

2-sample t-test: testing for difference across populations

>>>>>> female_viq = data[data['Gender'] == 'Female']['VIQ']

>>> male_viq = data[data['Gender'] == 'Male']['VIQ']

You might also like