Assvid
Assvid
Assignment – I
CO1 : Make use of modern tools to explore the data and its characteristics. (K3 -
Apply)
Marks Split-up :
Allotte Given
Q. No. Description d Marks(2
Marks(2 0)
0)
1 Content & Report 1
5
2 Originality 2
3 Presentation 3
4 Communication 2
5 On-time Submission 3
Total (25)
2. Importing Data from Various Sources There are several methods and
tools for importing data from various sources:
2|Page
V i d y a J a n a n i V – 21ITA24
To clean and preprocess raw data from a Zomato stock price CSV file, you typically
follow several steps. These steps aim to handle missing values, outliers, and ensure the
data is in a suitable format for analysis or modeling. Here's a general guide:
1. Load the Data: Load the Zomato stock price data from the CSV file into your
preferred data analysis environment, such as Python with pandas or R.
2. Inspect the Data: Understand the structure of the data by examining the first few
rows, data types, summary statistics, and identifying any missing values.
3. Handle Missing Values: Check for missing values in the dataset and decide on an
appropriate strategy to handle them. Options include removing rows with missing
values, imputing missing values with a specific value (e.g., mean, median), or using
more advanced techniques such as interpolation.
3|Page
V i d y a J a n a n i V – 21ITA24
4. Handle Outliers: Identify outliers in the data and decide how to handle them. This
could involve removing extreme values, transforming the data, or using robust
statistical methods.
5. Convert Data Types: Ensure that each column has the correct data type. For
example, dates should be in datetime format, and numerical columns should be
numeric.
6. Feature Engineering: Create new features from the existing ones if necessary. For
stock price data, this might involve calculating moving averages, percent changes, or
other indicators.
7. Normalize or Standardize Data: Depending on the analysis or modeling techniques
you plan to use, you may need to normalize or standardize the data to ensure that all
features contribute equally.
8. Check for Duplicates: Look for and remove any duplicate rows in the dataset if they
exist.
9. Check Data Integrity: Verify that the data is consistent and makes sense. For
example, ensure that dates are in chronological order and that stock prices are
realistic.
10. Save Cleaned Data: Once the data cleaning and preprocessing steps are complete,
save the cleaned dataset to a new file for further analysis or modeling.
Python Code :
import pandas as pd
# Load data
data = pd.read_csv('zomato_stock_prices.csv')
# Inspect data
print(data.head())
print(data.info())
print(data.describe())
# Handle missing values
data.dropna(inplace=True) # Drop rows with
missing values, you might choose another
strategy
# Handle outliers (e.g., using z-score or IQR
method)
# Convert data types if necessary
4|Page
V i d y a J a n a n i V – 21ITA24
data['Date'] = pd.to_datetime(data['Date'])
# Feature engineering
# Normalize or standardize data if necessary
# Check for duplicates
data.drop_duplicates(inplace=True)
# Save cleaned data
data.to_csv('cleaned_zomato_stock_prices.csv',
index=False)
Explanation : Sure, let's break down the provided Python code step by step:
import pandas as pd
5|Page
V i d y a J a n a n i V – 21ITA24
# Load data
data = pd.read_csv('zomato_stock_prices.csv')
This imports the pandas library and loads the Zomato stock price data from the CSV file into
a pandas DataFrame called data.
# Inspect data
print(data.head()) # Print the first few rows of the DataFrame
print(data.info()) # Print information about the DataFrame, including data types and
missing values
print(data.describe()) # Generate summary statistics for numerical columns in the DataFrame
These lines help to understand the structure and contents of the DataFrame. head() shows the
first few rows, info() provides information about the DataFrame including data types and
missing values, and describe() gives summary statistics such as mean, min, max, etc. for
numerical columns.
This line drops rows with any missing values (NaN) from the DataFrame. Dropping missing
values is just one strategy; you might choose to impute missing values instead.
This converts the 'Date' column to datetime format using pd.to_datetime(). It ensures that
the 'Date' column is treated as a date object rather than a string, which makes it easier to work
with dates.
This line removes any duplicate rows from the DataFrame, keeping only the first occurrence
of each unique row.
Finally, this saves the cleaned DataFrame to a new CSV file called
cleaned_zomato_stock_prices.csv, without including the index column. This file can now
be used for further analysis or modeling.
6|Page
V i d y a J a n a n i V – 21ITA24
➢ Histogram:
Certainly! You can use the matplotlib library to create a histogram for visualizing the
distribution of numerical data in your DataFrame. Here's how you can display a histogram for
the 'Close' prices column from your Zomato stock price dataset:
import pandas as pd
import matplotlib.pyplot as plt
# Load data
data = pd.read_csv('zomato_stock_prices.csv')
7|Page
V i d y a J a n a n i V – 21ITA24
This code will generate a histogram showing the distribution of 'Close' prices from your
Zomato stock price dataset. Adjustments can be made to customize the appearance of the
histogram according to your preferences.
➢ ScatterPlot:
To create a scatter plot for the Zomato stock price dataset, you can use the matplotlib
library. Here's how you can display a scatter plot for 'Date' vs 'Close' prices:
import pandas as pd
import matplotlib.pyplot as plt
# Load data
data = pd.read_csv('zomato_stock_prices.csv')
8|Page
V i d y a J a n a n i V – 21ITA24
This code will generate a scatter plot showing the relationship between 'Date' and 'Close'
prices from your Zomato stock price dataset. Adjustments can be made to customize the
appearance of the scatter plot according to your preferences.
➢ PairPlot:
You can use the seaborn library to create a pair plot for visualizing relationships
between multiple variables in your Zomato stock price dataset. Here's how you can display a
pair plot:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# Load data
data = pd.read_csv('zomato_stock_prices.csv')
9|Page
V i d y a J a n a n i V – 21ITA24
This code will generate a pair plot showing the relationships between numerical variables
('Open', 'High', 'Low', 'Close', 'Volume') from your Zomato stock price dataset. Adjustments
can be made to customize the appearance of the pair plot according to your preferences.
10 | P a g
e
V i d y a J a n a n i V – 21ITA24
4. Other Visualizations:
- Line plots, box plots, and heatmaps are other common visualizations used for
analyzing stock price data.
- Before Cleaning: These visualizations may show unexpected patterns, outliers, or
missing data points.
- After Cleaning: These visualizations will provide more accurate and meaningful
insights into the dataset, enabling better analysis and decision-making.
Overall, cleaning the Zomato dataset will lead to more accurate and reliable
visualizations, allowing analysts and stakeholders to make better-informed decisions
based on the data.
11 | P a g
e
V i d y a J a n a n i V – 21ITA24
Project Summary:
Title: Exploratory Data Analysis of Zomato Dataset
Objective:
The objective of this project was to perform exploratory data analysis
(EDA) on the Zomato dataset using modern tools and techniques. This
involved collecting, loading, cleaning, preprocessing, and visualizing the
data to derive insights and understand its characteristics.
Methods and Tools Used:
1. Data Collection and Loading:
The dataset was loaded into a DataFrame using Python libraries like
pandas and numpy. The Zomato dataset was chosen as an example
for analysis.
2. Data Importing from Various Sources:
12 | P a g
e
V i d y a J a n a n i V – 21ITA24
13 | P a g
e