0% found this document useful (0 votes)

112 views13 pages

Assvid

The document discusses data preprocessing techniques for a dataset containing Zomato stock prices. It describes how to load, inspect, clean and preprocess the raw data by handling missing values, outliers, converting data types, feature engineering, normalizing and standardizing data. Code examples in Python demonstrate these steps using pandas and matplotlib libraries.

Uploaded by

diyalap01

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

112 views13 pages

Assvid

Uploaded by

diyalap01

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

Department of Information Technology

Assignment – I

Academic Year : 2023 - 2024 Batch : 2021 - 2025

Year/Semester/Section : III/VI/A Regulation : 2021

Course Code/Name : 21PCS02/Exploratory Total Marks : 25
Data Analysis
Submission Date : 23.02.2024 Given Date : 04.03.2024

CO1 : Make use of modern tools to explore the data and its characteristics. (K3 -
Apply)

Name of the Student : Vidya Janani V

Roll Number : 21ITA24

Marks Split-up :

Allotte Given
Q. No. Description d Marks(2
Marks(2 0)
0)
1 Content & Report 1
5
2 Originality 2
3 Presentation 3
4 Communication 2
5 On-time Submission 3
Total (25)

Student Signature Faculty Signature

V i d y a J a n a n i V – 21ITA24

1. How can modern tools assist in collecting and loading diverse

datasets for analysis? Explain this with a selected realtime example
of your choice
2. For the selected realtime example,

a. Explore and explain the different methods and tools for

importing data from various sources (e.g., databases, APIs,
CSV files).
b. What tools can be employed to clean and preprocess raw
data, handling missing values and outliers?
c. Discuss the significance of data normalization and
standardization using modern techniques.
Solution :
1. Collecting and Loading Diverse Datasets

Modern tools can greatly assist in collecting and loading diverse

datasets for analysis. For instance, Python libraries like pandas and
numpy can be used to load the dataset into a DataFrame, which
provides a flexible data structure for data manipulation and analysis.
In our example, the data scientist could use pandas to load the
Zomato Stock Price dataset from a CSV file.

2. Importing Data from Various Sources There are several methods and
tools for importing data from various sources:

o Databases: Tools like SQL Workbench/J and DBeaver can

be used to connect to databases and export data.

2|Page
V i d y a J a n a n i V – 21ITA24

o APIs: Libraries like requests in Python or axios in JavaScript

can be used to send HTTP requests to APIs and retrieve data.

o CSV Files: Tools like Microsoft Excel or programming

languages like Python (with libraries such as pandas) can be used
to read data from CSV files.

Fig.1 – Raw IMDB Dataset – B/W

3. Cleaning and Preprocessing Raw Data

To clean and preprocess raw data from a Zomato stock price CSV file, you typically
follow several steps. These steps aim to handle missing values, outliers, and ensure the
data is in a suitable format for analysis or modeling. Here's a general guide:

1. Load the Data: Load the Zomato stock price data from the CSV file into your
preferred data analysis environment, such as Python with pandas or R.
2. Inspect the Data: Understand the structure of the data by examining the first few
rows, data types, summary statistics, and identifying any missing values.
3. Handle Missing Values: Check for missing values in the dataset and decide on an
appropriate strategy to handle them. Options include removing rows with missing
values, imputing missing values with a specific value (e.g., mean, median), or using
more advanced techniques such as interpolation.
3|Page
V i d y a J a n a n i V – 21ITA24

4. Handle Outliers: Identify outliers in the data and decide how to handle them. This
could involve removing extreme values, transforming the data, or using robust
statistical methods.
5. Convert Data Types: Ensure that each column has the correct data type. For
example, dates should be in datetime format, and numerical columns should be
numeric.
6. Feature Engineering: Create new features from the existing ones if necessary. For
stock price data, this might involve calculating moving averages, percent changes, or
other indicators.
7. Normalize or Standardize Data: Depending on the analysis or modeling techniques
you plan to use, you may need to normalize or standardize the data to ensure that all
features contribute equally.
8. Check for Duplicates: Look for and remove any duplicate rows in the dataset if they
exist.
9. Check Data Integrity: Verify that the data is consistent and makes sense. For
example, ensure that dates are in chronological order and that stock prices are
realistic.
10. Save Cleaned Data: Once the data cleaning and preprocessing steps are complete,
save the cleaned dataset to a new file for further analysis or modeling.

Here's a Python example demonstrating some of these steps using pandas:

Python Code :

import pandas as pd
# Load data
data = pd.read_csv('zomato_stock_prices.csv')
# Inspect data
print(data.head())
print(data.info())
print(data.describe())
# Handle missing values
data.dropna(inplace=True) # Drop rows with
missing values, you might choose another
strategy
# Handle outliers (e.g., using z-score or IQR
method)
# Convert data types if necessary

4|Page
V i d y a J a n a n i V – 21ITA24

data['Date'] = pd.to_datetime(data['Date'])
# Feature engineering
# Normalize or standardize data if necessary
# Check for duplicates
data.drop_duplicates(inplace=True)
# Save cleaned data
data.to_csv('cleaned_zomato_stock_prices.csv',
index=False)

Explanation : Sure, let's break down the provided Python code step by step:

import pandas as pd

5|Page
V i d y a J a n a n i V – 21ITA24

# Load data
data = pd.read_csv('zomato_stock_prices.csv')

This imports the pandas library and loads the Zomato stock price data from the CSV file into
a pandas DataFrame called data.

# Inspect data
print(data.head()) # Print the first few rows of the DataFrame
print(data.info()) # Print information about the DataFrame, including data types and
missing values
print(data.describe()) # Generate summary statistics for numerical columns in the DataFrame

These lines help to understand the structure and contents of the DataFrame. head() shows the
first few rows, info() provides information about the DataFrame including data types and
missing values, and describe() gives summary statistics such as mean, min, max, etc. for
numerical columns.

# Handle missing values

data.dropna(inplace=True) # Drop rows with missing values, you might choose another
strategy

This line drops rows with any missing values (NaN) from the DataFrame. Dropping missing
values is just one strategy; you might choose to impute missing values instead.

# Convert data types if necessary

data['Date'] = pd.to_datetime(data['Date'])

This converts the 'Date' column to datetime format using pd.to_datetime(). It ensures that
the 'Date' column is treated as a date object rather than a string, which makes it easier to work
with dates.

# Check for duplicates

data.drop_duplicates(inplace=True)

This line removes any duplicate rows from the DataFrame, keeping only the first occurrence
of each unique row.

# Save cleaned data

data.to_csv('cleaned_zomato_stock_prices.csv', index=False)

Finally, this saves the cleaned DataFrame to a new CSV file called
cleaned_zomato_stock_prices.csv, without including the index column. This file can now
be used for further analysis or modeling.

6|Page
V i d y a J a n a n i V – 21ITA24

4. Data Normalization and Standardization

Data normalization and standardization are crucial steps in data preprocessing,

especially in machine learning. Normalization scales numeric data to a range (usually
0-1), while standardization transforms data to have a mean of 0 and a standard
deviation of 1. These techniques help ensure that different features contribute equally
to a model, preventing features with larger scales from dominating the model. Tools
like scikit-learn in Python provide functions like MinMaxScaler for normalization
and StandardScaler for standardization.

➢ Histogram:

Certainly! You can use the matplotlib library to create a histogram for visualizing the
distribution of numerical data in your DataFrame. Here's how you can display a histogram for
the 'Close' prices column from your Zomato stock price dataset:

import pandas as pd
import matplotlib.pyplot as plt

# Load data
data = pd.read_csv('zomato_stock_prices.csv')

# Display histogram for the 'Close' prices

plt.figure(figsize=(10, 6))
plt.hist(data['Close'], bins=20, color='skyblue', edgecolor='black')
plt.title('Distribution of Close Prices')
plt.xlabel('Close Price')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()

7|Page
V i d y a J a n a n i V – 21ITA24

Explanation of the code:

1. We import pandas and matplotlib.pyplot libraries.

2. We load the Zomato stock price data into a pandas DataFrame called data.
3. We create a histogram using plt.hist() function. We pass the 'Close' column of the
DataFrame as the data for which we want to create the histogram.
4. The bins parameter specifies the number of bins or intervals in the histogram. You
can adjust this parameter according to your preference.
5. We set the title, xlabel, and ylabel for the histogram using plt.title(), plt.xlabel(), and
plt.ylabel() functions respectively.
6. We display grid lines using plt.grid(True).
7. Finally, we display the histogram using plt.show().

This code will generate a histogram showing the distribution of 'Close' prices from your
Zomato stock price dataset. Adjustments can be made to customize the appearance of the
histogram according to your preferences.

➢ ScatterPlot:

To create a scatter plot for the Zomato stock price dataset, you can use the matplotlib
library. Here's how you can display a scatter plot for 'Date' vs 'Close' prices:

import pandas as pd
import matplotlib.pyplot as plt

# Load data
data = pd.read_csv('zomato_stock_prices.csv')

# Convert 'Date' column to datetime

data['Date'] = pd.to_datetime(data['Date'])

# Create scatter plot

plt.figure(figsize=(10, 6))
plt.scatter(data['Date'], data['Close'], color='blue', marker='.')
plt.title('Scatter Plot of Date vs Close Price')
plt.xlabel('Date')
plt.ylabel('Close Price')
plt.grid(True)
plt.show()

8|Page
V i d y a J a n a n i V – 21ITA24

Explanation of the code:

1. We import pandas and matplotlib.pyplot libraries.

2. We load the Zomato stock price data into a pandas DataFrame called data.
3. We convert the 'Date' column to datetime format using pd.to_datetime() to ensure
proper plotting.
4. We create a scatter plot using plt.scatter() function. We pass the 'Date' column as the
x-axis values and the 'Close' column as the y-axis values.
5. We set the title, xlabel, and ylabel for the scatter plot using plt.title(), plt.xlabel(),
and plt.ylabel() functions respectively.
6. We display grid lines using plt.grid(True).
7. Finally, we display the scatter plot using plt.show().

This code will generate a scatter plot showing the relationship between 'Date' and 'Close'
prices from your Zomato stock price dataset. Adjustments can be made to customize the
appearance of the scatter plot according to your preferences.

➢ PairPlot:

You can use the seaborn library to create a pair plot for visualizing relationships
between multiple variables in your Zomato stock price dataset. Here's how you can display a
pair plot:

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Load data
data = pd.read_csv('zomato_stock_prices.csv')

9|Page
V i d y a J a n a n i V – 21ITA24

# Convert 'Date' column to datetime

data['Date'] = pd.to_datetime(data['Date'])

# Selecting numerical columns for the pair plot

numerical_columns = ['Open', 'High', 'Low', 'Close', 'Volume']

# Create pair plot

sns.pairplot(data[numerical_columns])
plt.suptitle('Pair Plot of Numerical Variables', y=1.02)
plt.show()

Explanation of the code:

1. We import pandas, seaborn, and matplotlib.pyplot libraries.

2. We load the Zomato stock price data into a pandas DataFrame called data.
3. We convert the 'Date' column to datetime format using pd.to_datetime() to ensure
proper plotting.
4. We select the numerical columns for which we want to create the pair plot. In this
case, we select 'Open', 'High', 'Low', 'Close', and 'Volume'.
5. We create a pair plot using sns.pairplot() function. We pass the selected numerical
columns as the data for which we want to create the pair plot.
6. We set the main title for the pair plot using plt.suptitle().
7. Finally, we display the pair plot using plt.show().

This code will generate a pair plot showing the relationships between numerical variables
('Open', 'High', 'Low', 'Close', 'Volume') from your Zomato stock price dataset. Adjustments
can be made to customize the appearance of the pair plot according to your preferences.

10 | P a g
e
V i d y a J a n a n i V – 21ITA24

Cleaning the dataset can significantly impact the visualizations derived

from that data. Here’s how the visualizations might change after
cleaning the Zomato dataset:
Certainly, cleaning the dataset can have a significant impact on the visualizations
derived from it. Let's explore how the visualizations might change after cleaning the
Zomato dataset:

1. Histogram of Close Prices:

- Before Cleaning: The histogram may show irregular spikes or gaps due to missing
or incorrect data points.
- After Cleaning: The histogram will likely show a smoother distribution without
irregularities, providing a clearer picture of the distribution of close prices.

2. Scatter Plot of Date vs Close Prices:

- Before Cleaning: There might be outliers or missing values in the data, resulting
in gaps or unusual patterns in the scatter plot.
- After Cleaning: The scatter plot will likely show a clearer trend without outliers or
missing values, making it easier to identify any patterns or relationships between date
and close prices.
3. Pair Plot of Numerical Variables:
- Before Cleaning: There may be inconsistencies or anomalies in the data, leading
to misleading relationships between variables in the pair plot.
- After Cleaning: The pair plot will likely show more meaningful relationships
between numerical variables after removing inconsistencies or anomalies, providing
better insights into the dataset.

4. Other Visualizations:
- Line plots, box plots, and heatmaps are other common visualizations used for
analyzing stock price data.
- Before Cleaning: These visualizations may show unexpected patterns, outliers, or
missing data points.
- After Cleaning: These visualizations will provide more accurate and meaningful
insights into the dataset, enabling better analysis and decision-making.

Overall, cleaning the Zomato dataset will lead to more accurate and reliable
visualizations, allowing analysts and stakeholders to make better-informed decisions
based on the data.

11 | P a g
e
V i d y a J a n a n i V – 21ITA24

Project Summary:
Title: Exploratory Data Analysis of Zomato Dataset
Objective:
The objective of this project was to perform exploratory data analysis
(EDA) on the Zomato dataset using modern tools and techniques. This
involved collecting, loading, cleaning, preprocessing, and visualizing the
data to derive insights and understand its characteristics.
Methods and Tools Used:
1. Data Collection and Loading:

The dataset was loaded into a DataFrame using Python libraries like
pandas and numpy. The Zomato dataset was chosen as an example
for analysis.
2. Data Importing from Various Sources:

• Databases: Tools like SQL Workbench/J and DBeaver were

mentioned for connecting to databases.
• APIs: Libraries like requests in Python or axios in JavaScript
were cited for accessing data through APIs.
• CSV Files: Python libraries such as pandas were used to read
data from CSV files.
3. Data Cleaning and Preprocessing:

Tools like OpenRefine, Trifacta Wrangler, and pandas were

employed for cleaning and preprocessing raw data. Techniques like
handling missing values, removing outliers, and standardizing data
were applied.

4. Data Normalization and Standardization:

The significance of data normalization and standardization in data

preprocessing, especially for machine learning, was discussed.
Tools like scikit-learn in Python were mentioned for implementing
normalization and standardization techniques.
5. Visualization:

The impact of cleaning the dataset on visualizations was highlighted

through examples such as histograms, scatter plots, and pair plots.
Project Outcome:
• The project resulted in a cleaned and preprocessed version of the
Zomato dataset, ready for further analysis or modeling.
• Visualizations were used to demonstrate the impact of data

12 | P a g
e
V i d y a J a n a n i V – 21ITA24

cleaning on the distribution and relationships within the dataset.

• The project showcased the importance of modern tools and
techniques in exploring and understanding diverse datasets for
insightful analysis.
Conclusion:
Through this project, the student demonstrated proficiency in using
modern tools and techniques for exploratory data analysis. By applying
methods for data collection, importing, cleaning, preprocessing, and
visualization, valuable insights were gained from the Zomato dataset. The
project highlighted the significance of data quality and preprocessing in
ensuring accurate analysis and modeling. Overall, it was a comprehensive
exploration of data characteristics using contemporary approaches.

13 | P a g
e

Data Cleaning - Cheatsheet
100% (2)
Data Cleaning - Cheatsheet
8 pages
Mastering Python For Data Science With Numpy & Pandas
100% (2)
Mastering Python For Data Science With Numpy & Pandas
136 pages
Data Preprocessing in Machine Learning
No ratings yet
Data Preprocessing in Machine Learning
27 pages
Universal Data Analytics Algorithm
No ratings yet
Universal Data Analytics Algorithm
51 pages
Data Exploration Preparation
No ratings yet
Data Exploration Preparation
12 pages
Data Cleaning
No ratings yet
Data Cleaning
28 pages
Document
No ratings yet
Document
29 pages
Data Analysis
No ratings yet
Data Analysis
4 pages
Data Cleaning and Preparation
No ratings yet
Data Cleaning and Preparation
9 pages
Prac 7
No ratings yet
Prac 7
5 pages
1.2.1. Retrieving Data - 1.2.2. Cleaning Data
No ratings yet
1.2.1. Retrieving Data - 1.2.2. Cleaning Data
35 pages
DAC Phase3
No ratings yet
DAC Phase3
6 pages
Practical No. 01
No ratings yet
Practical No. 01
114 pages
Dataframing in CSV
No ratings yet
Dataframing in CSV
14 pages
Statistical Transform Data Cleaning
No ratings yet
Statistical Transform Data Cleaning
30 pages
Subject - Machine Learning Group - E27-24 Name
No ratings yet
Subject - Machine Learning Group - E27-24 Name
18 pages
Data Cleaning in Python
No ratings yet
Data Cleaning in Python
14 pages
Week 6 - Data Cleaning
No ratings yet
Week 6 - Data Cleaning
8 pages
Data Manipulation in Python Using Pandas
No ratings yet
Data Manipulation in Python Using Pandas
12 pages
Python For Finance - The Complete Beginner's Guide - by Behic Guven - Jul, 2020 - Towards Data Science PDF
100% (1)
Python For Finance - The Complete Beginner's Guide - by Behic Guven - Jul, 2020 - Towards Data Science PDF
12 pages
Unit 4 - Working With Graphs - Python
No ratings yet
Unit 4 - Working With Graphs - Python
49 pages
Learneverythingai
No ratings yet
Learneverythingai
9 pages
DAV Practical 2
No ratings yet
DAV Practical 2
6 pages
Module 3 Notes
No ratings yet
Module 3 Notes
5 pages
DAP Writeups - Merged
No ratings yet
DAP Writeups - Merged
33 pages
Practicals
No ratings yet
Practicals
42 pages
Supermarket Sales Data Analysis
No ratings yet
Supermarket Sales Data Analysis
6 pages
Explorotary Data Analysis
100% (1)
Explorotary Data Analysis
30 pages
III Unit
No ratings yet
III Unit
4 pages
BasicAnalysis Using PYTHON
No ratings yet
BasicAnalysis Using PYTHON
6 pages
Unit - Iii - Eda
No ratings yet
Unit - Iii - Eda
25 pages
Eda Indepth
No ratings yet
Eda Indepth
19 pages
Data Cleaning
No ratings yet
Data Cleaning
119 pages
Important Pandas Operations 1697910759
No ratings yet
Important Pandas Operations 1697910759
6 pages
Pandas 1
No ratings yet
Pandas 1
13 pages
An Extensive Step by Step Guide To Exploratory Data Analysis
No ratings yet
An Extensive Step by Step Guide To Exploratory Data Analysis
26 pages
Task-by-Task Guide - Retail Data Analysis
No ratings yet
Task-by-Task Guide - Retail Data Analysis
6 pages
Data Wrangling
No ratings yet
Data Wrangling
6 pages
How To Perform Common Excel Commands in Python: Reading The Data
No ratings yet
How To Perform Common Excel Commands in Python: Reading The Data
3 pages
Task2 Eda Cleaning
No ratings yet
Task2 Eda Cleaning
33 pages
Unit-2 Bda
No ratings yet
Unit-2 Bda
11 pages
Exp 8 - LM
No ratings yet
Exp 8 - LM
10 pages
Data Mining Basics
No ratings yet
Data Mining Basics
52 pages
Data Mining Basics
No ratings yet
Data Mining Basics
38 pages
Pandas Syntax Revision For ML
No ratings yet
Pandas Syntax Revision For ML
10 pages
10 - Jayesh - Prakash - Rane
No ratings yet
10 - Jayesh - Prakash - Rane
26 pages
Python (Unit - 2)
No ratings yet
Python (Unit - 2)
22 pages
DAP 3 Module
No ratings yet
DAP 3 Module
62 pages
EDA - Zep
No ratings yet
EDA - Zep
33 pages
EDA Cheat Sheet
No ratings yet
EDA Cheat Sheet
7 pages
Data Acquisition Python
No ratings yet
Data Acquisition Python
12 pages
B Tech-AIML-question Bank-2 Answer Key
No ratings yet
B Tech-AIML-question Bank-2 Answer Key
9 pages
Comprehensive EDA Python Guide
No ratings yet
Comprehensive EDA Python Guide
13 pages
DS P2 Tanvi
No ratings yet
DS P2 Tanvi
3 pages
Practical 3
No ratings yet
Practical 3
2 pages
DWM - Exp 1
No ratings yet
DWM - Exp 1
11 pages
1data Cleansing Cheklist
No ratings yet
1data Cleansing Cheklist
2 pages
Data Preprocessing Techniques Cleaning Transformation and Integration
No ratings yet
Data Preprocessing Techniques Cleaning Transformation and Integration
6 pages
Data Science with R: Beginner to Expert
From Everand
Data Science with R: Beginner to Expert
Narayana Nemani
No ratings yet
Data Mining Models: Techniques and Applications
From Everand
Data Mining Models: Techniques and Applications
Ravi Deshpande
No ratings yet
IGNOU MCA Data Science and Big Data Previous Years Unsolved Papers MCS 226
From Everand
IGNOU MCA Data Science and Big Data Previous Years Unsolved Papers MCS 226
Manish Soni
No ratings yet
Vidya PC Lab
No ratings yet
Vidya PC Lab
9 pages
Ex. No. 03 Construct An Application That Draws Basic Graphical Primitives On The Screen Date
No ratings yet
Ex. No. 03 Construct An Application That Draws Basic Graphical Primitives On The Screen Date
4 pages
Ex. No. 03 Construct An Application That Draws Basic Graphical Primitives On The Screen Date
No ratings yet
Ex. No. 03 Construct An Application That Draws Basic Graphical Primitives On The Screen Date
4 pages
Logs
No ratings yet
Logs
7 pages
Vid 4
No ratings yet
Vid 4
6 pages
Handling JSON
No ratings yet
Handling JSON
4 pages
Creating Deep Learning Model in Vs Code
No ratings yet
Creating Deep Learning Model in Vs Code
5 pages
Dsmlusingpython
No ratings yet
Dsmlusingpython
10 pages
Ass 2 DSBDL
No ratings yet
Ass 2 DSBDL
29 pages
DSBDA Lab Manual
No ratings yet
DSBDA Lab Manual
155 pages
Vinayak Pachauri
No ratings yet
Vinayak Pachauri
1 page
GR Xii Ip Pandas Worksheet
No ratings yet
GR Xii Ip Pandas Worksheet
6 pages
Pandas - Introduction
No ratings yet
Pandas - Introduction
10 pages
MSC Data Science 2022
No ratings yet
MSC Data Science 2022
102 pages
DSF Lab Exp Full
No ratings yet
DSF Lab Exp Full
88 pages
Python Course Outline
No ratings yet
Python Course Outline
2 pages
Muhammad - Zahid Resume Meta 1 1
No ratings yet
Muhammad - Zahid Resume Meta 1 1
3 pages
Keshav
No ratings yet
Keshav
21 pages
PEOJECTTTTTTTTTT
No ratings yet
PEOJECTTTTTTTTTT
22 pages
n16 Python Programs
No ratings yet
n16 Python Programs
23 pages
Advanced Certificate Program in Data Science and AI Curriculum v1.0
No ratings yet
Advanced Certificate Program in Data Science and AI Curriculum v1.0
55 pages
Pandas - Series - Introduction
No ratings yet
Pandas - Series - Introduction
19 pages
Xii Ip HHW 2025
No ratings yet
Xii Ip HHW 2025
5 pages
Python in Chemestry
No ratings yet
Python in Chemestry
9 pages
ccs346 Eda Lab Manual
No ratings yet
ccs346 Eda Lab Manual
41 pages
Skill Report
No ratings yet
Skill Report
36 pages
Missing Values - Weather Data
No ratings yet
Missing Values - Weather Data
4 pages
Aiml MCQS
No ratings yet
Aiml MCQS
48 pages
CPE Syllabus
No ratings yet
CPE Syllabus
9 pages
Iternship Final CHANDU
No ratings yet
Iternship Final CHANDU
39 pages
Python Code 6-10 Class X
No ratings yet
Python Code 6-10 Class X
6 pages
CBSE Class 12 Informatics Practices Syllabus Overview
No ratings yet
CBSE Class 12 Informatics Practices Syllabus Overview
6 pages
CH #3 Solved Exercise
No ratings yet
CH #3 Solved Exercise
6 pages
Python Programming Syallabus
No ratings yet
Python Programming Syallabus
3 pages

Assvid

Uploaded by

Assvid

Uploaded by

Department of Information Technology

Academic Year : 2023 - 2024 Batch : 2021 - 2025

Year/Semester/Section : III/VI/A Regulation : 2021

Name of the Student : Vidya Janani V

Student Signature Faculty Signature

1. How can modern tools assist in collecting and loading diverse

a. Explore and explain the different methods and tools for

Modern tools can greatly assist in collecting and loading diverse

o Databases: Tools like SQL Workbench/J and DBeaver can

o APIs: Libraries like requests in Python or axios in JavaScript

o CSV Files: Tools like Microsoft Excel or programming

Fig.1 – Raw IMDB Dataset – B/W

Here's a Python example demonstrating some of these steps using pandas:

# Handle missing values

# Convert data types if necessary

# Check for duplicates

# Save cleaned data

4. Data Normalization and Standardization

Data normalization and standardization are crucial steps in data preprocessing,

# Display histogram for the 'Close' prices

Explanation of the code:

1. We import pandas and matplotlib.pyplot libraries.

# Convert 'Date' column to datetime

# Create scatter plot

Explanation of the code:

1. We import pandas and matplotlib.pyplot libraries.

# Convert 'Date' column to datetime

# Selecting numerical columns for the pair plot

# Create pair plot

Explanation of the code:

1. We import pandas, seaborn, and matplotlib.pyplot libraries.

Cleaning the dataset can significantly impact the visualizations derived

1. Histogram of Close Prices:

2. Scatter Plot of Date vs Close Prices:

• Databases: Tools like SQL Workbench/J and DBeaver were

Tools like OpenRefine, Trifacta Wrangler, and pandas were

4. Data Normalization and Standardization:

The significance of data normalization and standardization in data

The impact of cleaning the dataset on visualizations was highlighted

cleaning on the distribution and relationships within the dataset.

You might also like