Exercise and Experiment 3
Exercise and Experiment 3
Definition of Dataset:
A Dataset is a set of data grouped into a collection with which machine learning/ AI
developers work to train the machine
In a dataset, the rows represent the number of data points and the columns represent the
features of the Dataset.
They are mostly used in fields like machine learning, business, and government to gain
insights, make informed decisions, or train algorithms.
Datasets may vary in size and complexity and they mostly require cleaning and
preprocessing to ensure data quality and suitability for analysis or modeling.
Datasets can be stored in multiple formats. The most common ones are CSV, Excel,
JSON(JavaScript Object Notation) , and zip files for large datasets such as image datasets.
Types of datasets:
1. Numerical Dataset: They include numerical data points that can be solved with equations.
These include temperature, humidity, marks and so on.
2. Categorical Dataset: These include categories such as colour, gender, occupation, games,
sports and so on.
3. Web Dataset: These include datasets created by calling APIs using HTTP requests and
populating them with values for data analysis. These are mostly stored in JSON (JavaScript
Object Notation) formats.
4. Time series Dataset: These include datasets between a period, for example, changes in
geographical terrain over time.
5. Image Dataset: It includes a dataset consisting of images. This is mostly used to
differentiate the types of diseases, heart conditions and so on.
6. Ordered Dataset: These datasets contain data that are ordered in ranks, for example,
customer reviews, movie ratings and so on.
7. Partitioned Dataset: These datasets have data points segregated into different members
or different partitions.
8. File-Based Datasets: These datasets are stored in files, in Excel as .csv, or .xlsx files.
9. Bivariate Dataset: In this dataset, 2 classes or features are directly correlated to each other.
For example, height and weight in a dataset are directly related to each other.
10. Multivariate Dataset: In these types of datasets, as the name suggests 2 or more classes
are directly correlated to each other. For example, attendance, and assignment grades are
directly correlated to a student’s overall grade.
Note : 70% of the data in the dataset is used for training whereas 30% for testing the
model
Python libraries
Numpy:
NumPy is a Python library.
NumPy is used for working with arrays.
NumPy is short for "Numerical Python"
Eg :
import numpy as np # numpy is a library and “as” is alias name
arr = np.array([1, 2, 3, 4, 5])# array() function is used to create a elements of same type
print(arr)
Output :
Pandas:
Pandas is a Python library used for working with data sets.
It has functions for analyzing, cleaning, exploring, and manipulating data.
Pandas allows us to analyze big data and make conclusions based on
statistical theories.
Pandas can clean messy data sets, and make them readable and relevant.
The main functionality of pandas is
Eg :
import pandas as pd
a= pd.read_csv(‘eg.csv’)
print(a.to_string())
Output :
Matplotlib:
Matplotlib is a low level graph plotting library in python that serves as a visualization
utility.
Most of the Matplotlib utilities lies under the pyplot submodule, and are usually imported
under the plt alias
Eg :
import matplotlib.pyplot as plt
import numpy as np
xpoints = np.array([0, 6])
ypoints = np.array([0, 250])
plt.plot(xpoints, ypoints)
plt.show()
Output :
Installing Jupyter Notebook
The Jupyter Notebook is the original web application for creating and sharing
computational documents. It offers a simple, streamlined, document-centric experience.
Procedure to install JupyterNotebook
jupyter lab
Exercise :
Load iris dataset and calculate its accuracy using KNN algorithm in JupyterLab
#importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
Output :
iris=pd.read_csv('iris.csv')
iris.tail()# tail() function is used to read last five training examples from a dataset
Output :
#To get particular instance information
iris['Species'].value_counts()#value_counts() counts the value of the instances in the
dataset
Output :
#To get the description about all instances in dataset in an statistical way
iris.describe(include='all')
Output :
# Train_test_split
from sklearn.model_selection import train_test_split
X_train,X_test,Y_train,Y_test =train_test_split(X,Y,test_size=0.3,random_state=1)
Y_test.shape
knnmodel=KNeighborsClassifier(n_neighbors=3)
knnmodel.fit(X_train,Y_train)
Y_predict1=knnmodel.predict(X_test)#used to predict X_test
#Accuracy
from sklearn.metrics import accuracy_score
acc=accuracy_score(Y_test,Y_predict1)
acc
Output :
Experiment 3 : Implement k-nearest neighbors classification using python
Aim : To implement KNN algorithm using python
Software environment used: Python 3.12
Code snippet:
Procedure to work with KNN algorithm
In this scikit-learn module is used
About scikit-learn module:
Scikit-learn is an open-source Python library that implements a range of machine
learning, pre-processing, cross-validation, and visualization algorithms using a
unified interface.
It is an open-source machine- learning library that provides a plethora of tools for
various machine-learning tasks such as Classification, Regression, Clustering, and
many more.
Installation of scikit-learn module
pip install scikit-learn
In this code, we are going to use iris dataset. And this dataset Split into training(70%) and
test set(30%).
The Sample data in iris dataset format is [5.4 3.4 1.7 0.2]
# Loading data
data_iris = load_iris()
print()
v = knn.predict([test_data])
print("Predicted output is :",label_target[v])
except:
print("Please supply valid input......")
# except is a keyword used in control flow statements to handle exceptions that may arise
during program execution. It can be used in a try-except block to catch specific types of
exceptions and schedule statements to run if an exception occurs.
Output :