Task 1
Task 1
APTECHSOFT
Task 1
z
2
Notebook 1:
Overview of the Iris Dataset
Dataset:
In machine learning and AI, a dataset is a collection of data used to train and test algorithms and models,
playing a crucial role in their development. Datasets can be structured, like those in spreadsheets or
databases, or unstructured, such as text or images, which require additional processing. These datasets can
be public, proprietary, or generated for specific training and testing purposes, enabling researchers and
developers to evaluate and enhance AI systems effectively.
Iris Dataset :
The Iris dataset is a classic dataset in the field of machine learning and statistics. The famous Iris
database, first used by Sir R.A. Fisher. The dataset is taken from Fisher's paper.It contains measurements
of iris flowers from three different species: Iris setosa, Iris versicolor, and Iris virginica. It has 5
columns and 150 rows.
z
3
Features
The dataset includes the following features:
Sepal Length (cm)
Sepal Width (cm)
Petal Length (cm)
Petal Width (cm)
Each flower in the dataset is described by these four features.
Summary:
In this notebook, I learned about the following:
The Iris dataset, its classes, and the number of instances it contains.
How to import the scikit-learn datasets module to access various built-in datasets for machine
learning.
How to import the Iris dataset from the scikit-learn library.
How to print the keys of the data dictionary.
How to print the filename of the Iris dataset.
How to print a description of the Iris dataset.
Notebook 2
Sections
z
4
z
5
Creating a DataFrame:
df = pd.DataFrame(data[“data”], columns=data[“feature_names”])
Explanation:
pd.DataFrame(…): This is a function from the pandas library used to create a DataFrame, which is a 2-
dimensional, size-mutable, and potentially heterogeneous tabular data structure with labeled rows and
columns.
Data[“data”]: This refers to the actual data of the Iris dataset. It is an array containing the feature values
for each sample in the dataset, it is a 2D array with 150 rows and 4 columns .
columns = data[“feature_names”]: This specifies the column names for the DataFrame.
Data[“feature_names”] is a list containing the names of the features.
z
6
The image shows a 2x2 grid of line graphs representing the four features of the Iris over 150 samples.
Each subplot displays the variation of a specific feature: sepal length and sepal width show fluctuations
between 4.5-8 cm and 2.0-4.5 cm respectively, while petal length and petal width demonstrate significant
increases around the 50th sample, highlighting patterns and differences among the samples.
Histograms
z
7
A histogram is a chart that plots the distribution of a numeric variable’s values as a series of bars. Each
bar typically covers a range of numeric values called a bin or class; a bar’s height indicates the frequency
of data points with a value within the corresponding bin.
Df[“target”] = data[“target”]
df["target"].value_counts()
The code adds a "target" column to the DataFrame with species labels from the Iris dataset and then
counts the occurrences of each species.
Output:
z
8
The bar chart shows an equal count of samples for each Iris species: Iris-setosa, Iris-versicolor, and Iris-
virginica, with each species having 50 samples.
Relation between variables:
Observations:
Setosa has smaller sepal lengths but larger sepal widths.
Versicolor lies in the middle of the other two species in terms of sepal length and width.
Virginica has larger sepal lengths but smaller sepal widths.
Setosa has smaller petal lengths and widths.
Versicolor lies in the middle of the other two species in terms of petal length and width.
Virginica has the largest petal lengths and widths.
z
9
values in any of the columns. The first and last five rows of the DataFrame can be displayed using
df.head() and df.tail() respectively.
Handling Correlation:
The Pearson correlation method is used to calculate the correlation matrix of the DataFrame, revealing the
pairwise correlation between numeric attributes. The correlation values with respect to the target column
are extracted, and the top four correlations, both positive and negative, are identified. The features 'petal
width (cm)' and 'petal length (cm)' show the highest positive correlations with the target, at 0.9565 and
0.9490 respectively, indicating that these features are the most significant for predicting the class label.
Additionally, 'sepal length (cm)' has a positive correlation of 0.7826, while 'sepal width (cm)' has a
negative correlation of -0.4267 with the target.
Notebook 3
Machine Learning with Iris Dataset
Machine learning (ML) is a field of study in artificial intelligence concerned with the development and
study of statistical algorithms that can learn from data and generalize to unseen data and thus perform
tasks without explicit instruction.
Steps:
For applying ML on iris dataset follow these steps:
z
10
Feature Scaling:
“Feature scaling is a technique used to standardize the range of independent variables or features
in a dataset. It is also called data normalization and is usually done as part of data
preprocessing.”
We use the MinMaxScaler to scale the feature values to a range between 0 and 1, improving the
performance of many machine learning algorithms. First, we create an instance of the MinMaxScaler.
Then, we fit the scaler on the training data (X_train) and transform it to the specified range. Next, we use
the same scaling parameters to transform the testing data (X_test), ensuring consistent scaling.
z
11
z
12
z
13
https://www.tableau.com/learn/articles/data-visualization#:~:text=Data%20visualization%20is%20the
%20graphical,outliers%2C%20and%20patterns%20in%20data.
https://www.javatpoint.com/python-pandas-dataframe
https://www.w3schools.com/python/pandas/ref_df_describe.asp#:~:text=The%20describe()%20method
%20returns,The%20average%20(mea
https://www.atlassian.com/data/charts/histogram-complete-guide
https://www.geeksforgeeks.org/exploratory-data-analysis-on-iris-dataset/
https://encord.com/glossary/datasets-definition/
DEEP LRARNIING
History of AI
o Early attempts at AI aimed to achieve human-level intelligence but were limited by
computational resources.
o Neural networks, inspired by human brains, started in the 1950s but were initially
outperformed by Von Neumann Architecture.
Data Augmentation and Deployment
Expert Systems
o These were complex systems built by many engineers, with rules programmed by
humans.
o They had limitations, as the computer could only do as much as a human could program.
o Deep learning uses large datasets to learn patterns on its own, similar to how children
learn.
Pre-Trained Models
The Deep Learning Revolution
z
14
o Two major factors: availability of data and increased computing power (GPUs).
o Deep learning flips traditional programming by letting the model learn the rules instead
of being explicitly programmed.
When to Choose Deep Learning
o Use traditional programming for clear and straightforward tasks.
o Use deep learning for complex tasks where rules are hard to define.
Recommender Systems
o Content curation, targeted advertising, shopping recommendations.
Reinforcement Learning
o AI beating human experts in games like Go and video games.
What Python libraries are used in this notebook for data analysis and visualization?
How do you display the first five rows of a DataFrame?
What is the shape of the target array in the Iris dataset?
Write the code to create a Pandas DataFrame from the Iris dataset with feature names as column
names.
. What is the Pearson correlation coefficient, and how is it useful in this analysis
? : How can you generate summary statistics for each numeric column in a DataFrame?
7. Write a code snippet to display a 2x2 grid of histograms for each of the four attributes of the Iris
dataset.
8. Explain the significance of the correlation values between the features and the target in the Iris dataset.
9. How can you create a pairplot of all columns in the DataFrame, distinguishing different target classes
using Seaborn?
10. Based on the scatterplots of Sepal Length vs. Sepal Width and Petal Length vs. Petal Width, what can
you infer about the relationship between these features and the different species of the Iris dataset?