0% found this document useful (0 votes)
44 views14 pages

Exercise and Experiment 3

Uploaded by

h8792670
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views14 pages

Exercise and Experiment 3

Uploaded by

h8792670
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

WORKING ON DATASET

Definition of Dataset:

 A Dataset is a set of data grouped into a collection with which machine learning/ AI
developers work to train the machine
 In a dataset, the rows represent the number of data points and the columns represent the
features of the Dataset.
 They are mostly used in fields like machine learning, business, and government to gain
insights, make informed decisions, or train algorithms.
 Datasets may vary in size and complexity and they mostly require cleaning and
preprocessing to ensure data quality and suitability for analysis or modeling.
 Datasets can be stored in multiple formats. The most common ones are CSV, Excel,
JSON(JavaScript Object Notation) , and zip files for large datasets such as image datasets.

Types of datasets:

1. Numerical Dataset: They include numerical data points that can be solved with equations.
These include temperature, humidity, marks and so on.
2. Categorical Dataset: These include categories such as colour, gender, occupation, games,
sports and so on.
3. Web Dataset: These include datasets created by calling APIs using HTTP requests and
populating them with values for data analysis. These are mostly stored in JSON (JavaScript
Object Notation) formats.
4. Time series Dataset: These include datasets between a period, for example, changes in
geographical terrain over time.
5. Image Dataset: It includes a dataset consisting of images. This is mostly used to
differentiate the types of diseases, heart conditions and so on.
6. Ordered Dataset: These datasets contain data that are ordered in ranks, for example,
customer reviews, movie ratings and so on.
7. Partitioned Dataset: These datasets have data points segregated into different members
or different partitions.
8. File-Based Datasets: These datasets are stored in files, in Excel as .csv, or .xlsx files.
9. Bivariate Dataset: In this dataset, 2 classes or features are directly correlated to each other.
For example, height and weight in a dataset are directly related to each other.
10. Multivariate Dataset: In these types of datasets, as the name suggests 2 or more classes
are directly correlated to each other. For example, attendance, and assignment grades are
directly correlated to a student’s overall grade.

Note : 70% of the data in the dataset is used for training whereas 30% for testing the
model
Python libraries

Numpy:
 NumPy is a Python library.
 NumPy is used for working with arrays.
 NumPy is short for "Numerical Python"
Eg :
import numpy as np # numpy is a library and “as” is alias name
arr = np.array([1, 2, 3, 4, 5])# array() function is used to create a elements of same type
print(arr)

Output :

Pandas:
 Pandas is a Python library used for working with data sets.
 It has functions for analyzing, cleaning, exploring, and manipulating data.
 Pandas allows us to analyze big data and make conclusions based on
statistical theories.
 Pandas can clean messy data sets, and make them readable and relevant.
The main functionality of pandas is

a) To read CSV(Comma Seperated Values) files


b) To check the correlation between two or more columns
c) Average value
d) Max value
e) Min value
f) Standard deviation

Let us consider one CSV file

Eg :

import pandas as pd

a= pd.read_csv(‘eg.csv’)

print(a.to_string())

# to_string() to print the entire DataFrame.

Output :
Matplotlib:

 Matplotlib is a low level graph plotting library in python that serves as a visualization
utility.
 Most of the Matplotlib utilities lies under the pyplot submodule, and are usually imported
under the plt alias
Eg :
import matplotlib.pyplot as plt
import numpy as np
xpoints = np.array([0, 6])
ypoints = np.array([0, 250])
plt.plot(xpoints, ypoints)
plt.show()
Output :
Installing Jupyter Notebook
The Jupyter Notebook is the original web application for creating and sharing
computational documents. It offers a simple, streamlined, document-centric experience.
Procedure to install JupyterNotebook

Install JupyterLab with pip:

pip install jupyterlab

Run JupyterLab using below command in command prompt

jupyter lab
Exercise :
Load iris dataset and calculate its accuracy using KNN algorithm in JupyterLab

Code snippet : (In jupyterlab)

#importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier
from sklearn import preprocessing
from sklearn.model_selection import train_test_split

#Read the dataset


iris=pd.read_csv('iris.csv')
iris.head()# head() function is used to read first five training examples from a dataset

Output :

iris=pd.read_csv('iris.csv')
iris.tail()# tail() function is used to read last five training examples from a dataset

Output :
#To get particular instance information
iris['Species'].value_counts()#value_counts() counts the value of the instances in the
dataset
Output :

#To get all instances information


iris.columns
Output :

#To get all the values of an instances from an dataset


iris.values
Output :

#To get the information of an dataset


iris.info()
Output :
#To get the description about dataset in an statistical way
iris.describe()
Output :

#To get the description about all instances in dataset in an statistical way
iris.describe(include='all')

Output :

#Let us assume SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm as X and


Species as Y
X=iris.iloc[:,:5]#: indicates all rows and :5 indicates first 5 columns
X.head()#head() function is used to display first five records information
Output :
#iloc is index location Syntax is iloc[rows,columns] where : indicates all
#Assuming Species as Y
Y=iris.iloc[:,-1]# -1 indicates last column
Y.head()
Output :

# Train_test_split
from sklearn.model_selection import train_test_split
X_train,X_test,Y_train,Y_test =train_test_split(X,Y,test_size=0.3,random_state=1)
Y_test.shape

#Now make a model for Training and Predicting

knnmodel=KNeighborsClassifier(n_neighbors=3)
knnmodel.fit(X_train,Y_train)
Y_predict1=knnmodel.predict(X_test)#used to predict X_test

#Accuracy
from sklearn.metrics import accuracy_score
acc=accuracy_score(Y_test,Y_predict1)
acc
Output :
Experiment 3 : Implement k-nearest neighbors classification using python
Aim : To implement KNN algorithm using python
Software environment used: Python 3.12
Code snippet:
Procedure to work with KNN algorithm
 In this scikit-learn module is used
About scikit-learn module:
 Scikit-learn is an open-source Python library that implements a range of machine
learning, pre-processing, cross-validation, and visualization algorithms using a
unified interface.
 It is an open-source machine- learning library that provides a plethora of tools for
various machine-learning tasks such as Classification, Regression, Clustering, and
many more.
Installation of scikit-learn module
pip install scikit-learn
In this code, we are going to use iris dataset. And this dataset Split into training(70%) and
test set(30%).

The iris dataset contains the following features

---> sepal length (cm)


---> sepal width (cm)
---> petal length (cm)
---> petal width (cm)

The Sample data in iris dataset format is [5.4 3.4 1.7 0.2]

Where 5.4 ---> sepal length (cm)


3.4 ---> sepal width (cm)
1.7 ---> petal length (cm)
0.2 ---> petal width (cm)
Code snippet :

# Import necessary modules


from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
import random # used to make random numbers.

# Loading data
data_iris = load_iris()

# To get list of target names


label_target = data_iris.target_names
print()
print("Sample Data from Iris Dataset")
print("*"*30)

# to display the sample data from the iris dataset


for i in range(10):
rn = random.randint(0,120)
print(data_iris.data[rn],"=>",label_target[data_iris.target[rn]])

# Create feature and target arrays


X = data_iris.data
y = data_iris.target

# Split into training and test set


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=1)
print("The Training dataset length: ",len(X_train))
print("The Testing dataset length: ",len(X_test))
try:
nn = int(input("Enter number of neighbors :"))
knn = KNeighborsClassifier(nn)
knn.fit(X_train, y_train)

# to display the score


print("The Score is :",knn.score(X_test, y_test))

# To get test data from the user


test_data = input("Enter Test Data :").split(",")
for i in range(len(test_data)):
test_data[i] = float(test_data[i])

print()
v = knn.predict([test_data])
print("Predicted output is :",label_target[v])
except:
print("Please supply valid input......")

# except is a keyword used in control flow statements to handle exceptions that may arise
during program execution. It can be used in a try-except block to catch specific types of
exceptions and schedule statements to run if an exception occurs.

Output :

You might also like