0% found this document useful (0 votes)
41 views63 pages

04 DS 2023

The document describes experiments conducted on various datasets to perform data preparation, exploratory data analysis, data modeling and validation. The experiments involve tasks like data partitioning, outlier analysis, data visualization using plots and tables, model training and validation using statistical tests.

Uploaded by

GolDeN Maniac
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views63 pages

04 DS 2023

The document describes experiments conducted on various datasets to perform data preparation, exploratory data analysis, data modeling and validation. The experiments involve tasks like data partitioning, outlier analysis, data visualization using plots and tables, model training and validation using statistical tests.

Uploaded by

GolDeN Maniac
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 63

Don Bosco Institute of Technology, (DBIT), Mumbai - 400070.

Lab Journal
by
Mr. Suyog Avhad
Roll no : 04
Seat No : 3604

for the Subject


ITC605: DS using Python Lab
T.E. / Sem VI / Jan – May 2023.

Department of Information Technology


Don Bosco Institute of Technology, (DBIT), Mumbai - 400070.
INDEX

1
S. N. Experiment Name Date Page No.
1 26-02-2023 3
Data Preparation
2 08-03-2023 15
Data Visualization – EDA

3 08-03-2023 22
Data Modeling & Hypothetical Testing

4 Mini Project on “ Title “ 25-04-2023 32

5 Assignment - 01 13-03-2023 43

6 Assignment - 02 20-03-2023 59

Experiment - 01
Data preparation using NumPy and Pandas

Problem Statement:
a. Derive an index field and add it to the data set.

2
b. Find out the missing values.
c. Obtain a listing of all records that are outliers according to any field. Print
out a listing of the 10 largest values for that field.
d. Do the following for any field. i. Standardize the variable. ii. Identify how
many outliers there are and identify the most extreme outlier.

Platform Used: Kaggle


Name of Dataset: Superstore marketing campaign dataset (.csv file)
Theory:
1. What is DS, numpy, pandas ,What is data frame, what is EDA

● Data science is the study of data to extract meaningful insights for business.
● NumPy is a Linear Algebra Library for Python
● Pandas is a python library which contains high-level data structures and manipulation
tools designed to make data analysis fast and easy in Python
● A DataFrame represents a tabular, spreadsheet-like data structure containing an ordered
collection of columns, each of which can be a different value type (numeric, string,
boolean, etc.).
● Exploratory data analysis (EDA) is used by data scientists to analyze and investigate
data sets and summarize their main characteristics, often employing data visualization
methods.

2. How to read_csv file (answer this for the platform which you are using)

3. How to create dataframe

3
4. How to select one/multiple column/s from dataset

5. What is index, how to assign index


Index is like an address, that's how any data point across the dataframe or series can be accessed.
The set_index() method is used to assign index. Set_index () for single index , for multiple index

4
6. Use of df.columns, head(),tail(), df.dtypes,value_counts(), isnull(),sum(), df.size,
df.shape, len(df),Hasnans, dropna(),df.count, astype(int), describe(), max(),mean(),
median(), std(), unique()

5
6
7. What is quantile , percentile, meaning of df.count(axis=1).head()

Percentiles: Range from 0 to 100.


Quantiles: Range from any value to any other value.
df.count(axis=1).head() means counting top 5 rows of the dataframe

7
8. Syntax of replacing values in column?

9. What is np.nan and NaN, iloc()

np. nan() is to check if the value is NaN or not.


iloc[] method is used when the index label of a data frame is something other than
numeric series of 0, 1, 2, 3….n or in case the user doesn’t know the index label

8
10. Fillna(), bbfill, ffill

11. What is an outlier ?

Outliers in this case are defined as the observations that are


below (Q1 − 1.5x IQR) or boxplot lower whisker or above (Q3 +
1.5x IQR)

12. nlargest(), nsmallest(), How to find most frequent value in column

13. Ways to standardize -> zscore/ scipy.stats /standardscalar/ Minmaxscalar

9
10
Link of Execution:https://www.kaggle.com/code/suyog045/ai-ds-exp1

11
Screenshots of Code with Output:
a. Derive an index field and add it to the data set.

b. Find out the missing values.

c. Obtain a listing of all records that are outliers according to any field. Print out a listing of the
10 largest values for that field.

● To find records excluding outliers

12
● Outliers

d. Do the following for any field. i. Standardize the variable. ii. Identify how many outliers there
are and identify the most extreme outlier.

13
Conclusion:
Data preparation has been done using Numpy and Pandas.

Experiment - 02

14
Title:
Exp 2 : Data Visualization / Exploratory Data Analysis for the selected data set using Matplotlib
and Seaborn

Problem Statement:
a. Create a bar graph, contingency table using any 2 variables.
b. Create a normalized histogram.
c. Describe what this graph and tables indicate?

Platform Used: Kaggle


Name of Dataset: Superstore marketing campaign dataset (.csv file)
Theory:
Attach screenshot wherever necessary
1. What is visualization
2. Explain what is and use of seaborn and matplotlib
3. Screenshots 2-4 of practice of using seaborn and matplotlib
4. List down Data Visualization Plots
5. Explain what is bar graph, histogram and contingency table and heatmap, mention when
to use these plots also uni/bi/multi variant. Add screenshot.

Link of Execution:
https://www.kaggle.com/code/suyog045/ai-ds2
Screenshots of Code with Output:
a. Create a bar graph, contingency table using any 2 variables.

● Bar Graph

x axis = Marital_Status , y axis = Income

15
● Contingency table
Variable 1 = Year_Birth
Variable 2 = Marital_Status

16
17
18
● Verifying Contingency Table

b. Create a normalized histogram.

● Normalization of Data ie variable of which histogram is to be plotted

19
● Creating Histogram

20
c. Describe what this graph and tables indicate?

● The bar plot graph indicates the plotting of values based on its frequency.

Eg: in our data, we have plotted Marital_Status against Income. As Marital_Status is categorical,
its frequency can be calculated based on the Income each Marital_Status domain gets.

● The contingency table is used to display the frequency distribution of multivariate


distribution of the variables.
● A histogram represents the distribution of numerical data. Looking at a histogram, we can
decide whether the values are normally distributed (a bell-shaped curve), skewed to the
right or skewed left. A histogram of residuals is useful to validate important assumptions
in regression analysis.

Eg: in our data, we have plotted the histogram for Year_Birth, as it will look through the
frequency of each Year_Birth domain and plot the frequency wise Year_Birth count accordingly.

Conclusion:
Data Visualization / Exploratory Data Analysis for the selected data set using Matplotlib and
Seaborn is performed

Experiment - 03

21
Title:
Exp 3 : Data Modeling
Problem Statement:

Screenshots of Code with Output:

a. Partition the data set, for example 75% of the records are included in the training data set and
25% are included in the test data set. Use a bar graph to confirm your proportions.

Dataset : drug200.csv

1. Loading Data set

2. Dividing Data set into 2 parts i.e. x = features and y = target

22
3. Partitioning dataset into training and testing

23
4. Storing x_train and x_test data into all together different column called as ‘isstrain’ by
assigning values as 1 & 0 respectively

5. Plotting it into bar graph

24
b. Identify the total number of records in the training data set.

25
c. Validate your partition by performing a two‐sample Z‐test.

Dataset : marketing_AB.csv

1. Loading Dataset

26
2. Dividing data into input and output variable

3. Splitting Dataset into training and testing

3. Calculating mean and standard deviation for x_train and x_test

27
4. Calculating length of x_train and x_test & calculating difference in their means

28
5. Calculating z_score

29
6. Calculating p_score

p_value for converted = 0.1077656


p_value for total ads = 0.03929745
p_value for most ads hour = 0.66603449

If p_value < 0.05

We reject null hypothesis i.e. means of target variable for both datasets are not equal

else

30
fail to reject null hypothesis i.e means of target variable for both datasets are equal

Thus here p_value for converted and most ads hour > 0.05 thus their null hypothesis is rejected
signifying that means of their target variable for both datasets are equal and p_value for total ads
< 0.05 thus its null hypothesis is accepted signifying their means of their target variable for both
datasets are not equal

Platform Used: Kaggle


Name of Dataset: Drug classification - drug200.csv & marketing_AB Testing dataset
Theory:
Attach screenshot wherever necessary
1. What is machine learning, supervised learning and unsupervised learning with example
2. What is the need of partitioning dataset
3. What is feature selection, any rules to select appropriate features? One example of
selecting features and targets.
4. Use of Sklearn (scikit-learn)
5. Different ways to identify the total number of rows in training dataset
6. What is barplot, what is countplot
7. Explain two sample z tests in detail from need till formula.

Link of Execution:
https://www.kaggle.com/code/suyog045/ai-ds-exp3-1
https://www.kaggle.com/code/suyog045/ai-ds-exp3-2

Conclusion:
Data Modelling is performed

31
AIML Mini - Project

Topic: MNIST Generator (Generate Monet Style Paintings)

Introduction:
The problem that this project aims to address is the generation of new paintings in the style of the
famous artist Claude Monet, which can be a time-consuming and challenging task for human
artists. Generative Adversarial Networks (GANs) can be used to generate new images that mimic
Monet's style, which can save time and effort in creating new paintings in this style.
Additionally, the generated paintings can serve as a source of inspiration for artists, as well as
being used in various applications, such as interior design, fashion, and advertising. However,
there are challenges in training the GAN model to accurately capture the intricate details of
Monet's style and produce high-quality paintings that are visually appealing and aesthetically
pleasing. This project aims to address these challenges and create a GAN model that can
generate high-quality Monet-style paintings. Since our group is taking this project forward as our
final-year project, this semester we have implemented the concept of DCGAN and generated
Handwritten digits using the MNIST dataset. Through this project, we have learned about the
basic concepts of GANs, and we can now use that knowledge to generate Monet-style paintings.

Algorithms:
There are several algorithms that can be used for image generation in machine learning. Some
popular ones are:

1. Generative Adversarial Networks (GANs): GANs consist of two neural networks, a generator
and a discriminator, that work together to generate new images that look similar to the training
data. The generator learns to create new images, while the discriminator learns to differentiate
between real and fake images.
2. Convolutional Neural Networks (CNNs): CNNs are a type of neural network that are often
used for image classification, but they can also be used for image generation. CNNs can learn to
generate new images by learning the patterns in the input data and generating new images that
follow the same patterns.
3. DCGAN: DCGAN (Deep Convolutional Generative Adversarial Network) is a variant of the
generative adversarial network (GAN) architecture, specifically designed for generating high-
quality images. DCGAN uses convolutional neural networks (CNNs) in both the generator and
discriminator networks. The generator network takes a random noise vector as input and

32
generates an image, while the discriminator network takes an image as input and classifies it as
real or fake. The two networks are trained together in an adversarial manner, where the generator
tries to generate more realistic images to fool the discriminator, and the discriminator tries to
accurately classify real and fake images.
There are also many other algorithms and techniques for image generation in machine learning,
and the choice of algorithm depends on the specific task and dataset.

Data Set Specification:


MNIST is a dataset of 60,000 square 28×28 pixel images of handwritten single digits between 0
and 9. The images are in grayscale format.

33
Monet Dataset:

Architecture:
The proposed architecture for this project to create a GAN to generate Monet-style paintings
would involve the following components:

1. Generator network: The generator network would be responsible for generating Monet-style
paintings from random noise vectors. It would consist of multiple layers of convolutional,

34
upsampling, and activation functions, and would output an image with the same dimensions as
the input image.

Figure 1 : Generator

2. Discriminator network: The discriminator network would be responsible for distinguishing


between real Monet paintings and generated Monet-style paintings. It would consist of multiple
layers of convolutional, down sampling, and activation functions, and would output a binary
classification indicating whether the input image is real or fake.

Figure 2 : Discriminator

3. Loss function: The loss function would be responsible for guiding the training process of the
GAN model. It would consist of two parts - the generator loss and the discriminator loss. The
generator loss would encourage the generator network to generate paintings that are similar to
the real Monet paintings, while the discriminator loss would encourage the discriminator
network to correctly distinguish between real and fake paintings. Optimization algorithm: The
optimization algorithm would be responsible for updating the weights of the generator and

35
discriminator networks during training. It would use backpropagation and stochastic gradient
descent techniques to minimize the loss function.

Figure 3 : Loss Function

4. Style transfer and image filtering techniques: These techniques would be used to fine-tune
and optimize the GAN model to improve the quality of the generated paintings. Style transfer
techniques would be used to transfer the style of different Monet paintings to the generated
paintings, while image filtering techniques would be used to enhance the visual quality of the
generated paintings.

5. User interface: The user interface would provide a user-friendly interface for generating
Monet-style paintings. It would allow users to select the input parameters like size and style,
display the generated paintings, and save or download the paintings.

6. Deployment platform: The deployment platform would host the GAN model and user
interface as a web application or a standalone desktop application. It would ensure the security
and reliability of the application, and optimize the performance of the application for real-time
use.

Training the Model:


The first component we will make is the generator. Instead of passing in the image dimension,
we will pass the number of image channels to the generator. This is because with DCGAN, we
use convolutions which don’t depend on the number of pixels on an image. However, the
number of channels is important to determine the size of the filters.
We will build a generator using 4 layers (3 hidden layers + 1 output layer).

36
The second component we need to create is the discriminator.
We will use 3 layers in your discriminator's neural network.

37
We will train our GAN! For each epoch, we will process the entire dataset in batches. For every
batch, we will update the discriminator and generator. Then, we can see DCGAN's results!

38
You Can see the output after every 500 steps

39
40
41
Result:
This project implementing DCGAN to generate MNIST digits was successful in generating
reasonable-quality digit images. The model was able to learn the features of the training data and
generate new samples that closely resemble the real MNIST digits.
The discriminator and generator loss curves show that the model is learning and improving over
time. The discriminator loss decreases as the model becomes better at distinguishing real from
fake images, while the generator loss decreases as the model learns to generate images that better
fool the discriminator.

Figure 4: Loss Curve

42
Assignment no : 01

Installation of Python

● Download the current production version of Python (2.7.1) from the Python Download
site.

● Double click on the icon of the file that you just downloaded.
● Accept the default options given to you until you get to the Finish button. Your
installation is complete.

Setting up the Environment

43
● Starting at My Computer go to the following directory C:\Python27. In that folder you
should see all the Python files.

● Copy that address starting with C: and ending with 27 and close that window.
● Click on Start. Right Click on My Computer.
● Click on Properties. Click on Advanced System Settings or Advanced.
● Click on Environment Variables.
● Under System Variables search for the variable Path.
● Select Path by clicking on it. Click on Edit.

44
● Scroll all the way to the right of the field called Variable value using the right arrow.
● Add a semicolon (;) to the end and paste the path (to the Python folder) that you
previously copied. Click OK.

Writing Your First Python Program

● Create a folder called PythonPrograms on your C:\ drive. You will be storing all your
Python programs in this folder.
● Go to Start and either type Run in the Start Search box at the bottom or click on Run.
● Type in notepad in the field called Open.

45
● In Notepad type in the following program exactly as written:

# File: Hello.py

print "Hello World!"

● Go to File and click on Save as.


● In the field Save in browse for the C: drive and then select the folder PythonPrograms.
● For the field File name remove everything that is there and type in Hello.py.
● In the field Save as type select All Files
● Click on Save. You have just created your first Python program.

Running Your First Program

● Go to Start and click on Run.


● Type cmd in the Open field and click OK.
● A dark window will appear. Type cd C:\ and hit the key Enter.
● If you type dir you will get a listing of all folders in your C: drive. You should see the
folder PythonPrograms that you created.

46
● Type cd PythonPrograms and hit Enter. It should take you to the PythonPrograms folder.
● Type dir and you should see the file Hello.py.
● To run the program, type python Hello.py and hit Enter.
● You should see the line Hello World!

● Congratulations, you have run your first Python program.

Getting Started with Python Programming for Mac Users

Python comes bundled with Mac OS X. But the version that you have is quite likely an older
version. Download the latest binary version of Python that runs on both Power PC and Intel
systems and install it on your system.

Writing Your First Python Program

● Click on File and then New Finder Window.


● Click on Documents.
● Click on File and then New Folder.
● Call the folder PythonPrograms. You will be storing all class related programs there.
● Click on Applications and then TextEdit.
● Click on TextEdit on the menu bar and select Preferences.
● Select Plain Text.

47
● In the empty TextEdit window type in the following program, exactly as given:

# File: Hello.py

print "Hello World!"

● From the File menu in TextEdit click on Save As.


● In the field Save As: type Hello.py.
● Select Documents and the file folder PythonPrograms.
● Click Save.

Running Your First Program

● Select Applications, then Utilities and Terminal.


● In your Terminal window type ls and Return. It should give a listing of all the top level
folders. You should see the Documents folder.
● Type cd Documents and hit Return.
● Type ls and hit Return and you should see the folder PythonPrograms.
● Type cd PythonPrograms and hit Return.
● Type ls and hit return and you should see the file Hello.py.
● To run the program, type python Hello.py and hit Return.
● You should see the line Hello World!
● Congratulations, you have run your first Python program.

Starting IDLE on Mac

● In a Terminal window, type python. This will start the Python shell. The prompt for that
is >>>
● At the Python shell prompt type import idlelib.idle
● This will start the IDLE IDE

Using IDLE on either Windows or Mac

● Start IDLE

● Go to File menu and click on New Window

48
● Type your program in
● Go to the File menu and click on Save. Type in filename.py This will save it as a plain
text file, which can be opened in any editor you choose (like Notepad or TextEdit).
● To run your program go to Run and click Run Module

Google Colab and Drive DB connection

STEP-1: Import Libraries

# Code to read csv file into colaboratory:

!pip install -U -q PyDrive

from pydrive.auth import GoogleAuth

from pydrive.drive import GoogleDrive

from google.colab import auth

from oauth2client.client import GoogleCredentials

STEP-2: Authenticate EMail ID

49
auth.authenticate_user()

gauth = GoogleAuth()

gauth.credentials = GoogleCredentials.get_application_default()

drive = GoogleDrive(gauth)

STEP-3: Get File from Drive using file-ID

#2.1 Get the file

downloaded = drive.CreateFile({'id':'1R9vW5dmox7i8OGOoDql_r9yUrvAJgvhD'}) # replace


the id with id of file you want to access

downloaded.GetContentFile('resources.csv')

STEP-4: Read File

#3.1 Read file as panda dataframe

import pandas as pd

xyz = pd.read_csv('resources.csv')

50
print(xyz.head(1))

#Repeat STEPs 3 & 4 to load as many files as you require.

Installing Python and TensorFlow

It is possible to install and run Python/TensorFlow entirely from your own computer. Google
provides TensorFlow for Windows, Mac and Linux. Previously, TensorFlow did not support
Windows. However, as of December 2016, TensorFlow supports Windows for both CPU and
GPU operation.

The first step is to install Python 3.7. As of August 2019, this is the latest version of Python 3. I
recommend using the Miniconda (Anaconda) release of Python, as it already includes many of
the data science related packages that will be needed by this class. Anaconda directly supports:
Windows, Mac and Linux. Miniconda is the minimal set of features from the very large
Anaconda Python distribution. Download Miniconda from the following URL:
· Miniconda

51
Dealing with TensorFlow incompatibility with Python 3.7

*Note: I will remove this section once all needed libraries add support for Python 3.7.

VERY IMPORTANT Once Miniconda has been downloaded you must create a Python 3.6
environment. Not all TensorFlow 2.0 packages currently (as of August 2019) support Python 3.7.
This is not unusual, usually you will need to stay one version back from the latest Python to
maximize compatibility with common machine learning packages. So you must execute the
following commands:
conda create -y --name tensorflow python=3.6

To enter this environment, you must use the following command (for Windows), this command
must be done every time you open a new Anaconda/Miniconda terminal window:

52
activate tensorflow

For Mac, do this:


source activate tensorflow

Installing Jupyter

it is easy to install Jupyter notebooks with the following command:


conda install -y jupyter

Once Jupyter is installed, it is started with the following command:


jupyter notebook

The following packages are needed for this course:


conda install -y scipy

53
pip install --exists-action i --upgrade sklearn
pip install --exists-action i --upgrade pandas
pip install --exists-action i --upgrade pandas-datareader
pip install --exists-action i --upgrade matplotlib
pip install --exists-action i --upgrade pillow
pip install --exists-action i --upgrade tqdm
pip install --exists-action i --upgrade requests
pip install --exists-action i --upgrade h5py
pip install --exists-action i --upgrade pyyaml
pip install --exists-action i --upgrade tensorflow_hub
pip install --exists-action i --upgrade bayesian-optimization
pip install --exists-action i --upgrade spacy
pip install --exists-action i --upgrade gensim
pip install --exists-action i --upgrade flask
pip install --exists-action i --upgrade boto3
pip install --exists-action i --upgrade gym
pip install --exists-action i --upgrade tensorflow==2.0.0-beta1
pip install --exists-action i --upgrade keras-rl2 --user
conda update -y --all

54
Notice that I am installing a specific version of TensorFlow. As of the current semester, this is
the latest version of TensorFlow. It is very likely that Google will upgrade this during this
semester. The newer version may have some incompatibilities, so it is important that we start
with this version and end with the same.

You should also link your new tensorflow environment to Jupyter so that you can choose it as a
Kernel. Always make sure to run your Jupyter notebooks from your 3.6 kernel. This is
demonstrated in the video.
python -m ipykernel install --user --name tensorflow --display-name "Python 3.6
(tensorflow)"

Python Introduction
● Anaconda v3.6 Scientific Python Distribution, including:
○ § Scikit-Learn

55
○ § Pandas
○ § Others: csv, json, numpy, scipy
● Jupyter Notebooks
● PyCharm IDE
● Cx_Oracle
● MatPlotLib

Jupyter Notebooks

Space matters in Python, indent code to define blocks

Jupyter Notebooks Allow Python and Markdown to coexist.

Even :

Python Versions
● If you see xrange instead of range, you are dealing with Python 2
● If you see print x instead of print(x), you are dealing with Python 2
● This class uses Python 3.6!

In [1]:

# What version of Python do you have?

3.10.0

import sys

import tensorflow.keras

import pandas as pd

import sklearn as sk

import tensorflow as tf

56
print(f"Tensor Flow Version: {tf.__version__}")

print(f"Keras Version: {tensorflow.keras.__version__}")

print()

print(f"Python {sys.version}")

print(f"Pandas {pd.__version__}")

print(f"Scikit-Learn {sk.__version__}")

print("GPU is", "available" if tf.test.is_gpu_available() else "NOT AVAILABLE")

Tensor Flow Version: 2.0.0-beta1

Keras Version: 2.2.4-tf

Python 3.6.9 |Anaconda, Inc.| (default, Jul 30 2019, 13:42:17)

[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)]

Pandas 0.25.0

Scikit-Learn 0.21.3

GPU is NOT AVAILABLE

57
Assignment no : 02

Part 1 - Data Cleaning commands

Load the data frame and check by printing it

58
Print the top 10 rows

To print the columns

To print dtypes

59
To print the shape

To print describe

To find the median and filling the missing values with this median

60
To plot the histogram with equal-frequency bins using equalObs

61
To find quartiles and IQR

To remove outliers and plot the frequency of a column

62
Rescaling data

Normalization

Binarization

63

You might also like