0% found this document useful (0 votes)
14 views14 pages

Task 1

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views14 pages

Task 1

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 14

1

APTECHSOFT
Task 1

Submited By: Syeda Fatima Sajid


Submited To: Mentor Summayyea Salahuddin
Date:29 July,2024

z
2

Notebook 1:
Overview of the Iris Dataset
Dataset:
In machine learning and AI, a dataset is a collection of data used to train and test algorithms and models,
playing a crucial role in their development. Datasets can be structured, like those in spreadsheets or
databases, or unstructured, such as text or images, which require additional processing. These datasets can
be public, proprietary, or generated for specific training and testing purposes, enabling researchers and
developers to evaluate and enhance AI systems effectively.
Iris Dataset :
The Iris dataset is a classic dataset in the field of machine learning and statistics. The famous Iris
database, first used by Sir R.A. Fisher. The dataset is taken from Fisher's paper.It contains measurements
of iris flowers from three different species: Iris setosa, Iris versicolor, and Iris virginica. It has 5
columns and 150 rows.

Import Scikit-learn library:


For accessing the airis dataset first import Sckit-learn library. The Iris dataset is included in the datasets
module of scikit-learn .Itcan be loaded from the scikit-learn library using datasets.load_iris(), which
returns a dictionary with keys such as 'data', 'target', 'frame', 'target_names', 'DESCR',
'feature_names', 'filename', and 'data_module'. This dataset is provided by the sklearn.datasets.data
module, and its filename is accessible via the 'filename' key.
Dataset Structure:
Description of Iris dataset can be get using print(data["DESCR"]).
The Iris dataset contains 150 samples, each representing a single iris flower and 50 instances in each
class. The target values (species) are encoded as integers:
 0: Iris setosa
 1: Iris versicolor
 2: Iris virginica
Each feature and the corresponding target value help in classifying the type of iris flower.

z
3

Features
The dataset includes the following features:
 Sepal Length (cm)
 Sepal Width (cm)
 Petal Length (cm)
 Petal Width (cm)
Each flower in the dataset is described by these four features.

Summary:
In this notebook, I learned about the following:
 The Iris dataset, its classes, and the number of instances it contains.
 How to import the scikit-learn datasets module to access various built-in datasets for machine
learning.
 How to import the Iris dataset from the scikit-learn library.
 How to print the keys of the data dictionary.
 How to print the filename of the Iris dataset.
 How to print a description of the Iris dataset.

Notebook 2
Sections

z
4

This analysis consists of three main sections:


1. Importing Required Libraries
2. Loading the Dataset and Exploring Features and Targets
3. Creating Pandas DataFrame and Visualizing Data
Data Visualization:
Data visualization is the graphical representation of information and data using visual elements like
charts, graphs, and maps. It helps to see and understand trends, outliers, and patterns in data.
Importance: Data visualization makes data accessible, understandable, and helps in making data-driven
decisions. It is valuable across various fields, enhancing communication and understanding of data among
non-technical audiences.
Data Visualization and Big Data: In the era of Big Data, visualization is crucial for analyzing large
amounts of information. Effective data visualization balances form and function, telling a story by
highlighting useful information and removing noise.
Visualization of the Iris Dataset
For visualization of Iris dataset we have to Importing the necessary libraries.
 Scikit-learn: Provides a wide range of machine learning algorithms and tools for evaluating and
tuning models.
 Pandas: Powerful for data analysis, particularly useful for working with tabular data.
 Matplotlib: Comprehensive library for plotting graphs and visualizing data.
 Seaborn: Builds on top of Matplotlib, providing a more concise and expressive way to create
statistical plots

Load the Dataset, check features, and targets


The Iris dataset from scikit-learn was loaded and inspected, revealing features named :
1.sepal length (cm) 2. sepal width (cm)
3.petal length (cm) 4.petal width (cm)
The target values represent three iris species:
 Setosa
 versicolor
 virginica
With target array consisting of 150 entries. By examining the dataset’s features, target values,
and the shape of the target array, one gains an understanding of the data structure. This
understanding allows for the creation of visualizations, such as scatter plots, to show the
relationships between different features and how they distinguish the three iris species.

z
5

Creating Pandas DataFrame and Visualizing Data


DATAFRAME:
A DataFrame is a two-dimensional, size-mutable, and potentially heterogeneous tabular data
structure with labeled axes (rows and columns). It’s a concept used in several programming
environments and libraries, most notably in Python’s pandas library.

Creating a DataFrame:
df = pd.DataFrame(data[“data”], columns=data[“feature_names”])
Explanation:
pd.DataFrame(…): This is a function from the pandas library used to create a DataFrame, which is a 2-
dimensional, size-mutable, and potentially heterogeneous tabular data structure with labeled rows and
columns.
Data[“data”]: This refers to the actual data of the Iris dataset. It is an array containing the feature values
for each sample in the dataset, it is a 2D array with 150 rows and 4 columns .
columns = data[“feature_names”]: This specifies the column names for the DataFrame.
Data[“feature_names”] is a list containing the names of the features.

Summary of the DataFrame:


df.info()
Explanation:
The df.info() method in pandas provides a concise summary of a DataFrame, including the index range,
column names, non-null counts, data types, and memory usage. This summary helps understand the
DataFrame’s structure and data completeness at a glance. This information is essential for understanding
the dataset before performing further analysis.
OUTPUT

Generating Summary Statistics:


df.describe()
The describe() method returns description of the data in the DataFrame.If the DataFrame contains
numerical data, the description contains these information for each column:
count – The number of not-empty values. Mean – The average (mean) value.
std – The standard deviation. Min – the minimum value.

z
6

25% - The 25% percentile*. 50% - The 50% percentile*.


75% - The 75% percentile*. Max – the maximum value.
Percentile meaning: how many of the values are less than the given percentile.
OUTPUT:

Visualizing Attributes Using Matplotlib:

The image shows a 2x2 grid of line graphs representing the four features of the Iris over 150 samples.
Each subplot displays the variation of a specific feature: sepal length and sepal width show fluctuations
between 4.5-8 cm and 2.0-4.5 cm respectively, while petal length and petal width demonstrate significant
increases around the 50th sample, highlighting patterns and differences among the samples.
Histograms

z
7

A histogram is a chart that plots the distribution of a numeric variable’s values as a series of bars. Each
bar typically covers a range of numeric values called a bin or class; a bar’s height indicates the frequency
of data points with a value within the corresponding bin.

Observations from the histograms:


 Sepal length frequency is highest between 5.5 and 6.
 Sepal width frequency is highest around 3.0 and 3.5.
 Petal length frequency is highest between 1 and 2.
 Petal width frequency is highest between 0.0 and 0.5.

Add Target Column and Count Unique Values

Df[“target”] = data[“target”]
df["target"].value_counts()
The code adds a "target" column to the DataFrame with species labels from the Iris dataset and then
counts the occurrences of each species.
Output:

Visualizing the target column:

z
8

The bar chart shows an equal count of samples for each Iris species: Iris-setosa, Iris-versicolor, and Iris-
virginica, with each species having 50 samples.
Relation between variables:

Observations:
 Setosa has smaller sepal lengths but larger sepal widths.
 Versicolor lies in the middle of the other two species in terms of sepal length and width.
 Virginica has larger sepal lengths but smaller sepal widths.
 Setosa has smaller petal lengths and widths.
 Versicolor lies in the middle of the other two species in terms of petal length and width.
 Virginica has the largest petal lengths and widths.

Summary of DataFrame Structure and Integrity:


The DataFrame consists of five columns: 'sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal
width (cm)', and 'target', all of which are numerical with no categorical columns present. The data types of
the columns are predominantly float64, except for the 'target' column which is int32. There are no missing

z
9

values in any of the columns. The first and last five rows of the DataFrame can be displayed using
df.head() and df.tail() respectively.
Handling Correlation:
The Pearson correlation method is used to calculate the correlation matrix of the DataFrame, revealing the
pairwise correlation between numeric attributes. The correlation values with respect to the target column
are extracted, and the top four correlations, both positive and negative, are identified. The features 'petal
width (cm)' and 'petal length (cm)' show the highest positive correlations with the target, at 0.9565 and
0.9490 respectively, indicating that these features are the most significant for predicting the class label.
Additionally, 'sepal length (cm)' has a positive correlation of 0.7826, while 'sepal width (cm)' has a
negative correlation of -0.4267 with the target.

Notebook 3
Machine Learning with Iris Dataset
Machine learning (ML) is a field of study in artificial intelligence concerned with the development and
study of statistical algorithms that can learn from data and generalize to unseen data and thus perform
tasks without explicit instruction.
Steps:
For applying ML on iris dataset follow these steps:

z
10

1. Import required libraries


2. Load the Dataset, check features, and targets.
3. Make Pandas DataFrame and visualise it using Matplotlib
4. Create a Training/Testing Dataset and Peek into it.
5. Scale features between the range [0,1] and Peek into it
6. Apply Machine Learning Algorithms on Scaled Dataset
7. Optimize the Results
8. Run Predictions on Optimized Model

Create a Training/Testing Dataset and Peek into it:


Prepare the dataset for training and testing by first creating X, a DataFrame containing the four
features by removing the "target" column from the original DataFrame, and y, a Series
representing the target class or label. Then split the data into training and testing sets using
train_test_split with 60% of the data for training and 40% for testing, ensuring reproducibility
with a random seed of 42.Verify that the training set (y_train) contains 90 samples and the testing
set (y_test) contains 60 samples. Finally, print the feature values of the training set (X_train) to
peek into the training data.

Scale features between the range [0,1] and Peek into it

Feature Scaling:
“Feature scaling is a technique used to standardize the range of independent variables or features
in a dataset. It is also called data normalization and is usually done as part of data
preprocessing.”
We use the MinMaxScaler to scale the feature values to a range between 0 and 1, improving the
performance of many machine learning algorithms. First, we create an instance of the MinMaxScaler.
Then, we fit the scaler on the training data (X_train) and transform it to the specified range. Next, we use
the same scaling parameters to transform the testing data (X_test), ensuring consistent scaling.

z
11

Apply Machine Learning Algorithms on Scaled Dataset


1.Logistic Regression:
“Logistic regression is a supervised machine learning algorithm that accomplishes binary classification
tasks by predicting the probability of an outcome, event, or observation. The model delivers a binary or
dichotomous outcome limited to two possible outcomes: yes/no, 0/1, or true/false.”
Logistic regression model is applied to the Iris dataset:
1. Create Model Instance: An instance of the LogisticRegression class is created with the
multi_class parameter set to "multinomial" to handle multi-class classification problems.
2. Train Model: The model is trained using the training data (X_train and y_train), allowing it to
learn the relationship between features and target labels.
3. Make Predictions: The trained model predicts labels for the test data (X_test), generating
predictions for evaluation.
4. Calculate Accuracy: The accuracy of the model is calculated by comparing predicted labels
(pred) with actual labels (y_test), resulting in an accuracy score of 91.67%.
5. Confusion Matrix: A confusion matrix is generated to show counts of true positive, true
negative, false positive, and false negative predictions for each class, providing a detailed
performance breakdown.
6. Classification Report: A classification report is printed, including precision, recall, F1-score, and
support for each class, offering a comprehensive assessment of model performance.
Overall, the model achieves high accuracy and performs well across the three classes in the Iris dataset,
with detailed metrics showing excellent performance for Class 0, reasonable performance for Class 1, and
slightly less accuracy for Class 2.

z
12

2.Support Vector Machine (SVM)


To evaluate a Support Vector Machine (SVM) for classifying the Iris dataset, we start by creating a Linear
Support Vector Classifier (LinearSVC) to find the best boundary separating different classes. We train this
model with our training data and then use it to predict labels for the test data. We measure the model's
performance by calculating and printing its accuracy, confusion matrix, and classification report. Initially,
the model achieved 90% accuracy, which improved to 98.33% after fine-tuning with Grid Search. The
SVM is particularly useful for complex datasets with many features.
Random Forest:
“Random forest is a machine learning algorithm that creates an ensemble of multiple decision trees to
reach a singular, more accurate prediction or result.”

 Creating the model using scikit-learn's RandomForestClassifier.


 Training it with data, where X_train holds features and y_train holds labels, to understand the link
between features and the target labels.
 Using the trained model to predict labels for new data (X_test).
 Evaluating the model's performance through accuracy (percentage of correct predictions), a
confusion matrix (detailing true/false positives/negatives), and a classification report (providing
precision, recall, and F1-score for each class).
NOTE:
 Achieved the highest accuracy of approximately 98.33%.
 An ensemble learning method using multiple decision trees to improve classification
performance.
Results Optimization
Optimization was performed using Grid Search on the Linear SVC model, improving its accuracy from
90% to 98.33%.
Summary
Among the three machine learning techniques applied, Random Forest outperformed the others with the
highest accuracy, making it the most effective model for this dataset.
References:
https://en.wikipedia.org/wiki/Machine_learning
https://www.mygreatlearning.com/blog/what-is-machine-learning/
https://www.atoti.io/articles/when-to-perform-a-feature-scaling/
https://www.spiceworks.com/tech/artificial-intelligence/articles/what-is-logistic-regression/
#:~:text=Logistic%20regression%20is%20a%20supervised%20machine%20learning
%20algorithm%20that%20accomplishes,1%2C%20or%20true%2Ffalse
https://www.spiceworks.com/tech/big-data/articles/what-is-support-vector-machine/

z
13

https://www.tableau.com/learn/articles/data-visualization#:~:text=Data%20visualization%20is%20the
%20graphical,outliers%2C%20and%20patterns%20in%20data.
https://www.javatpoint.com/python-pandas-dataframe
https://www.w3schools.com/python/pandas/ref_df_describe.asp#:~:text=The%20describe()%20method
%20returns,The%20average%20(mea
https://www.atlassian.com/data/charts/histogram-complete-guide
https://www.geeksforgeeks.org/exploratory-data-analysis-on-iris-dataset/
https://encord.com/glossary/datasets-definition/

DEEP LRARNIING
 History of AI
o Early attempts at AI aimed to achieve human-level intelligence but were limited by
computational resources.
o Neural networks, inspired by human brains, started in the 1950s but were initially
outperformed by Von Neumann Architecture.
Data Augmentation and Deployment
 Expert Systems
o These were complex systems built by many engineers, with rules programmed by
humans.
o They had limitations, as the computer could only do as much as a human could program.

o Deep learning uses large datasets to learn patterns on its own, similar to how children
learn.
Pre-Trained Models
 The Deep Learning Revolution

z
14

o Two major factors: availability of data and increased computing power (GPUs).

o Deep learning flips traditional programming by letting the model learn the rules instead
of being explicitly programmed.
 When to Choose Deep Learning
o Use traditional programming for clear and straightforward tasks.

o Use deep learning for complex tasks where rules are hard to define.

Applications of Deep Learning


 Computer Vision
o Object detection, self-driving cars, robotics, and manufacturing.

 Natural Language Processing


o Real-time translation, voice recognition, virtual assistants.

 Recommender Systems
o Content curation, targeted advertising, shopping recommendations.

 Reinforcement Learning
o AI beating human experts in games like Go and video games.

 What Python libraries are used in this notebook for data analysis and visualization?
 How do you display the first five rows of a DataFrame?
 What is the shape of the target array in the Iris dataset?
 Write the code to create a Pandas DataFrame from the Iris dataset with feature names as column
names.
 . What is the Pearson correlation coefficient, and how is it useful in this analysis
 ? : How can you generate summary statistics for each numeric column in a DataFrame?
7. Write a code snippet to display a 2x2 grid of histograms for each of the four attributes of the Iris
dataset.
8. Explain the significance of the correlation values between the features and the target in the Iris dataset.
9. How can you create a pairplot of all columns in the DataFrame, distinguishing different target classes
using Seaborn?
10. Based on the scatterplots of Sepal Length vs. Sepal Width and Petal Length vs. Petal Width, what can
you infer about the relationship between these features and the different species of the Iris dataset?

You might also like