0% found this document useful (0 votes)
35 views66 pages

Data Science Notes

Uploaded by

niranjankv05
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views66 pages

Data Science Notes

Uploaded by

niranjankv05
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 66

Starting: - (better do it in windows powershell rather than in command prompt)

● Download and install python


● Install pip and pandas
● Check is python is working properly
● Go to windows powershell terminal
● Cd Desktop (enter) cd Python folder (enter)
● Type ls for listing the contents of the folder

Date - 2nd August 2023


Python: -

1. Open in terminal - unzipped file (starting line should end with file name) - From
abalone folder/iris folder/excel folder (.xlsx - file download nutrition)
2. Go from python (type python in powershell)
3. Import pandas as pd
4. abalone=pd.read_csv(‘abalone.data’,header=None)
5. abalone.head(no. Of rows we want to see)
6. The data will get displayed with headers due to point no.3
7. If our data already has headers, do not need the “header=None” command inside the
bracket.

Page 4 Chapter 1 - DSML textbook - try the commands - Ignore the first line of code as we
have already downloaded it.
check this
out for error correction

9th August 2023


For excel files>-

1. Open with excel folder in terminal


2. Install openpyxl and xlrd
3. Open python
4. Import pandas as pd
5. nutri=pd.read_excel(‘nutrition_elderly.xlsx’)
6. nutri.head(no. Of rows we want to see)
7. Enter and see result

Nutri.info function gives the list of all column variables - their types and other info.
1st column is gender
Assign variables for it male=1,female=2 and thus changing the category of the data
from integers to Attributes

Assignment - Page no.6 (2 set of commands), Page no.7


(another set of commands) in DSLM book - enforce these
commands in python
For Displaying Maximum Columns,
● nutri=pd.read_excel(‘file_name’)
● nutri.head(no. Of rows we want to see)
● pd.set_option(‘display.max_columns’,5)
● nutri.head(5)

For Changing the category of the data (from integer to Category)


● DICT={1:Male’,2:’Female’}
● nutri[‘gender’]=nutri[‘gender’].replace(DICT).astype(‘category’)
● nutri.head(5)

10th August 2023


Open and import nutri csv file
Mean of height calculate:

For getting quantiles:

For other quantities: -

For range = Max-min =


For variance and rounding to 2 decimal places, For standard deviation and for
precision up to 4 decimal places:
To find statistical description of any column in the data:-

Using gnuplot - plotting package - for data visualisation


Download from the website, install by double clicking - click all applicable boxes and
circles.

18th August 2023

Installing gnuplot tutorials


Test data download - given by ma’am (5 test data files downloaded from google
classroom) [test 1 test 2 test 3 test 4 test 5]

Simple commands after opening gnuplot


plot x+2
Plot sin(x)
AFTER plotting every function - type reset. Or else it will use old settings to plot new
function

double asterisk= exponent, single=


multiplication
FOR plotting 2 functions at once use comma and backslash \
To change the range of the x and y axis ( from what value to what value) do the
following commands
Earlier range was = (-1) to 1 in y axis and (-10) to 10 in x axis

To rename the axis: -

To plot our test 1 data


Plot ‘test1.dat’ using 1:2 with lines (NOTE) : Open the terminal from the folder and
THEN use the command

The result should be like this for test 1

● Open terminal with folder


● Use the above command consciously with spaces and words (sensitive)
● Just like when we used python
22 August 2023

Opening gnu plot…


3rd column - 2 input variable X and Y and there is errors associated with Y - Z
column…
We should plot the errors

The error is this long - span along the y axis from either side to the plotted point
Error bars in y are vertical. Error bars in x are horizontal.
Test data2: X,Y,Z1,Z2 errors not uniform on both sides of point this time
Reset

Test 3 3d plot

Splot = 3d plotting instead of just ‘plot’ - sourceplot

Test 4 now

For giving the change in labelling to gnuplot:


meaning replace 0 with 100, replace 200 with 2 and so on…

Install numpy, and matplotlib, pandas - IMPORTANT - pip install …

After that -
Import pandas as pd
Import numpy as np
Import matplotlib.pyplot as plt

25th August 2023

Here, x coordinates are just model

plt.xticks(x,situation_counts.index)
([<matplotlib.axis.XTick object at 0x00000213F9C5BB50>, <matplotlib.axis.XTick
object at 0x00000213F9C5A550>, <matplotlib.axis.XTick object at
0x00000213F9C4FCD0>], [Text(0.0, 0, '2'), Text(0.8, 0, '1'), Text(1.6, 0, '3')])
>>> plt.show()
Of the situation column of the situation columns
29th august 2023

Scatter plots
First enter what the variables are…
Writing a code now…

After running code

Anaconda and python tutorial - Phd scholars - 30th august 2023

Google colab -
- jupiter notebook

1st September 2023


MUshroom data set

mush.info() - give the number of features in the data set

23 quotes for 23 columns among those the columns which we want to change.
As we want to change 2 column names only, but python only allows us to
change the names of all the columns at once.
To change the attributes within the column: - edible and poisonous

Rename the odor elements also

Solve question 1 in exercises in DSML book


what ever you have learnt - do it on the mushroom dataset - DONE ON 5th

SEPTEMBER 2023

In nutri.info(), the values still come in as short forms - use DICT functions to
make them come in their original form as words.
Which odor mushrooms should be avoided for consumption? - analyse from
crosstab
Which proportion of mushrooms are edible? - 96.6%
Crosstab - important
BIG Data SETS 14th September 2023(PPT Refer)
https://www.internetlivestats.com/
BIG data in terms of ROWS, not columns - Massive number of records than column.
eg)1,00,000*20 (Rows * Columns)
Combination of many models
Can’t fit a single curve/ model for LARGE data sets - we need multiple models to
represent the data accurately.

BIG Data as bad data-


1) Massive data sets are typically the result of opportunity, instead of design
2) Social media - not the result of design - not question in mind,but still data
coming in - can’t control the flow of data.
3) Data sc. find ways to make money out of the data set.
Wonderful resources - but setback by biases - missing out on certain observations -
Unrepresentative Participation
Spams - taking care and blocking - machine generated contents - misleading -
unwanted. Paid reviews are also misleading…
Spam filtering is essential

Too Much Redundancy. Replica/Repetition. Remove the duplications


Susceptibility to temporal biases.

Key points: -
Big data is the data we have - result of design - appropriate to the challenge at hand
(only then good data)

Factors that define big data set - 3 Vs


VOLUME: amt of data. More sophisticated analysing infrastructure and
computational techniques needed.
VARIETY: MANY Different types of data. Integrating everything - big task at hand.
VELOCITY: Data coming live. Collecting indexing, visualising through dashboard
systems - dealing with the speed of data and dealing with REAL time results.
Changing every second, with every change in information.

THE FOURTH V: Veracity (Trustability of data)

Algorithmics for Big data.


3 basic problems.

Which type of movies more likely to succeed? Range of movie gross in the US.

15th September 2023


Variety - heterogeneous data
Algorithms for big data sets - Refer PPT
Random Access Machine - RAM - Abstract computer - no. of steps involved is the
amount of memory it will take (small sets)
Input - n

FOR loop runs on an array. (loop - something in repetition in the same fashion)
eg) to calculate BMI - 10 values of height, 10 values of weight-
Run a for loop[i=1 to 10]
{r= (BMI formula for every i)}
Plot r
Nested loop - inside loop, there will be another loop. We can have multiple loops
inside of one loop. If, else and while statements
Algorithm is complicated if it is more and more nested
If one single loop - it will run n times
If nested loop - n*n times it will run (if 2 nested loops)
If 2 sequential loops (one loop after another) - then n+n = 2n Operations.
Need to learn Basic Loop structure algorithms (PPT)
Find nearest neighbour with point p ( in data set) - subtract p with each other number
in the array - for which no. the difference is minimum, that no. is closest.

Finding the closest pair of points in a set


O(d. n*n) operations. - square time algorithm
Matrix Multiplication

While Loops - analysis gets more complicated


Binary search
Mergesort
Algorithms running on big data sets must be linear or near linear. Quadratic
algorithms become impossible to contemplate for n>10,000

Hashing - turning quadratic algorithms into linear algorithms - Cutting down the size
Every value mapping to different integer value
Applications of Hashing
eg) Dictionary maintenance
Frequency Counting
Assigning hash functions and hash tables
Duplication Removal
Canonization - calling a single event by different names/ case sensitive names.
Cryptographic Hashing

19th September 2023


Storage Hierarchy
Can’t use a machine forever to store data forever
As soon as we get data - do the analysis - as soon as it is finished - throw the data
away.
Big data sets - more of storage bound than algorithm bound
(PPT contd.)
Temporary store of data before thrown away - several types of devices
Levels of storage
Cache memory - working data stored for analysis (measured in MB)
Main Memory - size in GB measures - where big data is kept for sometime -
especially for backup.
Main memory in another machine
But access time is slower
Disk Storage - measured in Terabytes
Process files and data structures in streams - (sequentially)
Think big files instead of Directories - organising all the data in one large files instead
of making a separate file for each small data set - keep sorted file
Packing data Concisely - Compressed format - cost of compressing data - rather
than sending the BIG data directly - saves money, size and increases speed

Filtering and sampling of BIG data sets


Data scientists spend 60% of their time on CLEANING THE DATA SET
If we have sufficient volume - we can afford to throw away some amt of data
Filtering: selecting a relevant subset of data based on specific criteria.
But this would have some setback
Sampling
And the types and steps in procedure
2 Approaches

Viz.wtf website
If there are very few points - do not connect them with lines - let them remain as dot
plot.

22nd September 2023

PPT continued…
Sampling by truncation - use first n elements and then stop - as a sample
Deterministic sampling algorithms
Side effect - temporal biases. Outdated sampling
Lexicographic biases.
Numerical biases
Truncation in general is a bad idea
UNIFORM Sampling
Advantages of uniform sampling
Even then , there are temporal biases here - little bit complex
Thus the best way is to do Randomised and streamed sampling - (impt.)
But, they’re not reproducible.
No specific criteria for selecting the sample

Parallelism
Doing analysis side by side instead of sequentially
Parallel processing
Distributed processing - happens on may machines using network communication
Loosely coupled - not strongly related
Complexity of parallelism
Challenges of both parallel and distributed computing…

BIg data Ethics - (last topic of big data)


Bigger the size, bigger the trouble
Results will be used to influence public opinion and decisions
May appear in courts and proceedings
Be an ethical data scientist

Transparency and ownership

Maintain security and privacy

26th September 2023

Tukey’s 5 number summary for all the 4 sheets given in the GDP Data Set
To do it in python

Do it for all the six sheets


Do the plots for all the sheets in python - make line/bar plots to show trend - do not
make scatter plots
Use matplotlib.pyplot in python terminal

29th September 2023

For engineering data set

Use the drop function to drop the columns. Then, nutri.info


To check if there is any duplicated column:
engg.duplicated().sum() If there are no duplicates then the result must be equal to
0.

3rd October 2023

there are certain columns which we can


drop as they’re not necessary for determining things related to salary.
To check if there are any duplicates - the result of the command should be 0

To know the specializations of students - decreasing order


For storing all the values in another word

To choose single value specialisations


Commands to practise and see to increase knowledge about python

Coming to arrays
Research more on this
Come up with more operations
Square brackets usually used when defining array and vectors
To know the dimension of the matrix

To know the datatype of the variable

Float = with decimals, fractions, irrational numbers,- basically not integers


Do ‘help’ command

6th October 2023


Making a linear regression for our data
Model fitting
1,-1 = single row many cols
-1,1 = multiple rows single cols
Data science project
Aastha, archana, me, mithalia, sai keerthi, sirsha
Kaggle - warehouse of data sets.
Search keyword - student
Select a unique data set. Not chosen by another group
What we hv to do - download the data set - perform exploratory data analysis -
download and USE ONLY PYTHON to analyse the data set. - all visualisation,
statistics, description, info, graphical interpretation - present your results in a
Microsoft PPT. describe the data set - sourcing - statistical and graphical analysis -
results, conclusions, future insights, limitations - everything of the data sets.
Deadline: TUESDAY

After returning to python, import everything, call the file, define the variables and
then give this command.

Commands: -
reg=LinearRegression() #this is a function for linear regression
reg.fit(x_train,y_train)
LInearRegression()
y_pred=reg.predict(x_test)
plt.figure(figsize=(10,10))
plt.scatter(x,y,c=’Red’)
plt.plot(x_test,y_pred,c=’Black’,linewidth=2)
Give the x label and title and everything (see previous commands)
plt.show()

13th October 2023


General python commands
two dimensional

Or

row and column element in 1st row 1st col


2nd row 3rd column

for transposing the matrix

Square root of all the elements ofx


squares

17th October 2023

This is for random selection - in data sampling


Loc=50 means the mean deviations of the array will be 0 - basically jiggling
Correlation in NUmpy

Another way to do the same:-

3 different ways of generating the variance


More statistics:-

last 2 commands give the same


result

For generating sequence of the numbers - (start no., ending no., size of the
sequence)

Stopping one number before in the arange command : a-range command


Slice if for strings
H=1, a=1….. And so on the numbers are ordered
For array:-

To index the numbers (python starts indexing from 0) 0th Row and 0th Column

---this is for 2nd row and 3rd column

Extracting 2 independent rows from the matrix


Double sq brackets

2nd row and 4th


row…
Selecting specific columns in the dataset:_

Extracting a submatrices from the bigger matrix - selecting 4 separate elements in


the matrix for example

NEW THING
GNUPLOT
Ma’am sent txt files in google classroom- saved in Practice database - open the
folder in terminal.

wl= with lines


this is a dot plot. Now plotting it with lines

18th OCtober 2023


Training data and test data sets in model fitting
Gnuplot continued….
Three step exercise in gnuplot for model fitting
1) Assign the function - polynomial in this case - coefficients and other letters
and expression can change but everything should be in x. X cannot be
changed:-
2) Initial estimate for the parameters - provide:- (can be any estimate-give it with
thought. 5th order polynomial - value of f should be smaller as the
exponential)

3) To give command to fit the plot in gnuplot

Plotting the 1st column with the 2nd column, through the parameters
a,b,c,d,e,f

4)
this is the final results
Good fit
This is is gaussian - bell shaped
A,b,c - height (peak)(y axis) , centre point (x axis)and width (at centre in y axis) width
at half max

To be continued in the next class…


27th October 2023
1) Define model fn
2) Initial estimates to parameters
3) Final command to gnuplot - fit a model to the dataset
This is the gaussian function:-\

Now give initial estimates


a= peak of gaussian, b= position o the pak, c = width at half maximum
a=0.0406, b=133.12,c=55

Get the result


Now plot the model and plot the original data

The variable should be x, but the name of the fit can be anything
yfit(x), mno(x), hari(x), golmaal(x) and many others of the sort…
After our analysis - before plotting the model, we should make sure that our fit
converges eg) “after 9 iterations, the fit converged” message should come

But this fit we got, is not a good fit. But our error estimate is low. Need to improve the
fit
The tails of the gaussian don’t fit well, to make them fit - we’ll add a polynomial
Further improving
Add +dx^3 + ex^4 to the fit function

Give these values for a,b,c for further improving the initial estimates
Random values for d and e

Values of the d and e parameters shouldn’t be radically big like 100 or something - it
should be sensibly chosen - trial and error method
This is our new fit

With these commands

It is slightly better when it comes to the tails


Add more therms and experiment with it
This is the way we can do model fitting/ curve fitting in gnuplot

GO Back to python
Download the Auto datasets - save it in folder - open python terminal from there
Eliminating the whitespaces which are there in the file

For the column in the auto. DATA dataset, there is a column called horsepower

We need to change the datatype - it should be a float/ integer


Maybe it could have been missing values
There is a question mark in the end - hence the datatype is given as objects
Note the formula (numpy)
Assign the question mark to something
Put all ? as invalid - na values

To drop the na values and confirming that it is dropped

Better to see the data in notebook or in any other platform - to check the data details
before working on it in python
This is a float column now…
Datatype has changed to float
For better convenience in typing:-

To see the names of the columns


To see the first 3 rows - this is the slicing method

To select a subset of the data set - boolean expression

selecting only the subset of dataset


where the manufacturing year>80
Doing the same thing for columns:- Doubling the square brackets
Indexing nos.
Length has reduced, but the index has not changed
Setting the index again
There is no name column here. Because the name column has been allocated as
the index column.

Select the particular rows which we want


Strings - have to be in square brackets and within quotation marks

Slicing - extracting rows and columns from this matrix

4th and 5th row, 1st col and 3rd col and 4th col
INDEXING IN PYTHON STARTS FROM 0 - 0,1,2,3,4,5…

Portion -
after the 1st internal - python and gnuplot - we need to know the relevance, uses and
other details of the commands which we learnt in python. No need to do the things
on the Laptop.

31st October 2023


Selecting subsets of the datasets. Be careful of indentation error
Loop functions

DO the google classroom assignment sent be ma’am.


plt.plot(x,ans,’--’,color=blue,label=’optimised data’ input this command in the
above code to get lines - defining the model

3rd November 2023 - Last class

We’ll do as much as we can with the curve fitting

generate 40 random nos


from 0 to 1 - this is our x array.

Forty randomly selected x values


Now calculating the y values:-

Define the model/ test function


a,b = parameters
Doing analysis on the longley dataset

Last command - to take all the columns as vectors ( as there are relatively more
number of columns)\ it will take the values of the columns into the dataset
Data set has 7 columns
-1 is the last column
It is not taking the first set of values as we’ve not taken header = None
We’ve defined x and y for our data

Import another numpy package


indent space is very very very very
important
dots should come in this - review commands
Now try and fit a polynomial model
Doing it for polynomial
Now try fitting a 5th degree polynomial
Fitting a sine wave in a 2nd degree polynomial

You might also like