Data Science Notes
Data Science Notes
1. Open in terminal - unzipped file (starting line should end with file name) - From
abalone folder/iris folder/excel folder (.xlsx - file download nutrition)
2. Go from python (type python in powershell)
3. Import pandas as pd
4. abalone=pd.read_csv(‘abalone.data’,header=None)
5. abalone.head(no. Of rows we want to see)
6. The data will get displayed with headers due to point no.3
7. If our data already has headers, do not need the “header=None” command inside the
bracket.
Page 4 Chapter 1 - DSML textbook - try the commands - Ignore the first line of code as we
have already downloaded it.
check this
out for error correction
Nutri.info function gives the list of all column variables - their types and other info.
1st column is gender
Assign variables for it male=1,female=2 and thus changing the category of the data
from integers to Attributes
➖
Plot ‘test1.dat’ using 1:2 with lines (NOTE) : Open the terminal from the folder and
THEN use the command
The error is this long - span along the y axis from either side to the plotted point
Error bars in y are vertical. Error bars in x are horizontal.
Test data2: X,Y,Z1,Z2 errors not uniform on both sides of point this time
Reset
Test 3 3d plot
Test 4 now
After that -
Import pandas as pd
Import numpy as np
Import matplotlib.pyplot as plt
plt.xticks(x,situation_counts.index)
([<matplotlib.axis.XTick object at 0x00000213F9C5BB50>, <matplotlib.axis.XTick
object at 0x00000213F9C5A550>, <matplotlib.axis.XTick object at
0x00000213F9C4FCD0>], [Text(0.0, 0, '2'), Text(0.8, 0, '1'), Text(1.6, 0, '3')])
>>> plt.show()
Of the situation column of the situation columns
29th august 2023
Scatter plots
First enter what the variables are…
Writing a code now…
Google colab -
- jupiter notebook
23 quotes for 23 columns among those the columns which we want to change.
As we want to change 2 column names only, but python only allows us to
change the names of all the columns at once.
To change the attributes within the column: - edible and poisonous
SEPTEMBER 2023
In nutri.info(), the values still come in as short forms - use DICT functions to
make them come in their original form as words.
Which odor mushrooms should be avoided for consumption? - analyse from
crosstab
Which proportion of mushrooms are edible? - 96.6%
Crosstab - important
BIG Data SETS 14th September 2023(PPT Refer)
https://www.internetlivestats.com/
BIG data in terms of ROWS, not columns - Massive number of records than column.
eg)1,00,000*20 (Rows * Columns)
Combination of many models
Can’t fit a single curve/ model for LARGE data sets - we need multiple models to
represent the data accurately.
Key points: -
Big data is the data we have - result of design - appropriate to the challenge at hand
(only then good data)
Which type of movies more likely to succeed? Range of movie gross in the US.
FOR loop runs on an array. (loop - something in repetition in the same fashion)
eg) to calculate BMI - 10 values of height, 10 values of weight-
Run a for loop[i=1 to 10]
{r= (BMI formula for every i)}
Plot r
Nested loop - inside loop, there will be another loop. We can have multiple loops
inside of one loop. If, else and while statements
Algorithm is complicated if it is more and more nested
If one single loop - it will run n times
If nested loop - n*n times it will run (if 2 nested loops)
If 2 sequential loops (one loop after another) - then n+n = 2n Operations.
Need to learn Basic Loop structure algorithms (PPT)
Find nearest neighbour with point p ( in data set) - subtract p with each other number
in the array - for which no. the difference is minimum, that no. is closest.
Hashing - turning quadratic algorithms into linear algorithms - Cutting down the size
Every value mapping to different integer value
Applications of Hashing
eg) Dictionary maintenance
Frequency Counting
Assigning hash functions and hash tables
Duplication Removal
Canonization - calling a single event by different names/ case sensitive names.
Cryptographic Hashing
Viz.wtf website
If there are very few points - do not connect them with lines - let them remain as dot
plot.
PPT continued…
Sampling by truncation - use first n elements and then stop - as a sample
Deterministic sampling algorithms
Side effect - temporal biases. Outdated sampling
Lexicographic biases.
Numerical biases
Truncation in general is a bad idea
UNIFORM Sampling
Advantages of uniform sampling
Even then , there are temporal biases here - little bit complex
Thus the best way is to do Randomised and streamed sampling - (impt.)
But, they’re not reproducible.
No specific criteria for selecting the sample
Parallelism
Doing analysis side by side instead of sequentially
Parallel processing
Distributed processing - happens on may machines using network communication
Loosely coupled - not strongly related
Complexity of parallelism
Challenges of both parallel and distributed computing…
Tukey’s 5 number summary for all the 4 sheets given in the GDP Data Set
To do it in python
Coming to arrays
Research more on this
Come up with more operations
Square brackets usually used when defining array and vectors
To know the dimension of the matrix
After returning to python, import everything, call the file, define the variables and
then give this command.
Commands: -
reg=LinearRegression() #this is a function for linear regression
reg.fit(x_train,y_train)
LInearRegression()
y_pred=reg.predict(x_test)
plt.figure(figsize=(10,10))
plt.scatter(x,y,c=’Red’)
plt.plot(x_test,y_pred,c=’Black’,linewidth=2)
Give the x label and title and everything (see previous commands)
plt.show()
Or
For generating sequence of the numbers - (start no., ending no., size of the
sequence)
To index the numbers (python starts indexing from 0) 0th Row and 0th Column
NEW THING
GNUPLOT
Ma’am sent txt files in google classroom- saved in Practice database - open the
folder in terminal.
Plotting the 1st column with the 2nd column, through the parameters
a,b,c,d,e,f
4)
this is the final results
Good fit
This is is gaussian - bell shaped
A,b,c - height (peak)(y axis) , centre point (x axis)and width (at centre in y axis) width
at half max
The variable should be x, but the name of the fit can be anything
yfit(x), mno(x), hari(x), golmaal(x) and many others of the sort…
After our analysis - before plotting the model, we should make sure that our fit
converges eg) “after 9 iterations, the fit converged” message should come
But this fit we got, is not a good fit. But our error estimate is low. Need to improve the
fit
The tails of the gaussian don’t fit well, to make them fit - we’ll add a polynomial
Further improving
Add +dx^3 + ex^4 to the fit function
Give these values for a,b,c for further improving the initial estimates
Random values for d and e
Values of the d and e parameters shouldn’t be radically big like 100 or something - it
should be sensibly chosen - trial and error method
This is our new fit
GO Back to python
Download the Auto datasets - save it in folder - open python terminal from there
Eliminating the whitespaces which are there in the file
For the column in the auto. DATA dataset, there is a column called horsepower
Better to see the data in notebook or in any other platform - to check the data details
before working on it in python
This is a float column now…
Datatype has changed to float
For better convenience in typing:-
4th and 5th row, 1st col and 3rd col and 4th col
INDEXING IN PYTHON STARTS FROM 0 - 0,1,2,3,4,5…
Portion -
after the 1st internal - python and gnuplot - we need to know the relevance, uses and
other details of the commands which we learnt in python. No need to do the things
on the Laptop.
Last command - to take all the columns as vectors ( as there are relatively more
number of columns)\ it will take the values of the columns into the dataset
Data set has 7 columns
-1 is the last column
It is not taking the first set of values as we’ve not taken header = None
We’ve defined x and y for our data