0% found this document useful (0 votes)
112 views13 pages

Assvid

The document discusses data preprocessing techniques for a dataset containing Zomato stock prices. It describes how to load, inspect, clean and preprocess the raw data by handling missing values, outliers, converting data types, feature engineering, normalizing and standardizing data. Code examples in Python demonstrate these steps using pandas and matplotlib libraries.

Uploaded by

diyalap01
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
112 views13 pages

Assvid

The document discusses data preprocessing techniques for a dataset containing Zomato stock prices. It describes how to load, inspect, clean and preprocess the raw data by handling missing values, outliers, converting data types, feature engineering, normalizing and standardizing data. Code examples in Python demonstrate these steps using pandas and matplotlib libraries.

Uploaded by

diyalap01
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Department of Information Technology

Assignment – I

Academic Year : 2023 - 2024 Batch : 2021 - 2025

Year/Semester/Section : III/VI/A Regulation : 2021


Course Code/Name : 21PCS02/Exploratory Total Marks : 25
Data Analysis
Submission Date : 23.02.2024 Given Date : 04.03.2024

CO1 : Make use of modern tools to explore the data and its characteristics. (K3 -
Apply)

Name of the Student : Vidya Janani V


Roll Number : 21ITA24

Marks Split-up :

Allotte Given
Q. No. Description d Marks(2
Marks(2 0)
0)
1 Content & Report 1
5
2 Originality 2
3 Presentation 3
4 Communication 2
5 On-time Submission 3
Total (25)

Student Signature Faculty Signature


V i d y a J a n a n i V – 21ITA24

1. How can modern tools assist in collecting and loading diverse


datasets for analysis? Explain this with a selected realtime example
of your choice
2. For the selected realtime example,

a. Explore and explain the different methods and tools for


importing data from various sources (e.g., databases, APIs,
CSV files).
b. What tools can be employed to clean and preprocess raw
data, handling missing values and outliers?
c. Discuss the significance of data normalization and
standardization using modern techniques.
Solution :
1. Collecting and Loading Diverse Datasets

Modern tools can greatly assist in collecting and loading diverse


datasets for analysis. For instance, Python libraries like pandas and
numpy can be used to load the dataset into a DataFrame, which
provides a flexible data structure for data manipulation and analysis.
In our example, the data scientist could use pandas to load the
Zomato Stock Price dataset from a CSV file.

2. Importing Data from Various Sources There are several methods and
tools for importing data from various sources:

o Databases: Tools like SQL Workbench/J and DBeaver can


be used to connect to databases and export data.

2|Page
V i d y a J a n a n i V – 21ITA24

o APIs: Libraries like requests in Python or axios in JavaScript


can be used to send HTTP requests to APIs and retrieve data.

o CSV Files: Tools like Microsoft Excel or programming


languages like Python (with libraries such as pandas) can be used
to read data from CSV files.

Fig.1 – Raw IMDB Dataset – B/W


3. Cleaning and Preprocessing Raw Data

To clean and preprocess raw data from a Zomato stock price CSV file, you typically
follow several steps. These steps aim to handle missing values, outliers, and ensure the
data is in a suitable format for analysis or modeling. Here's a general guide:

1. Load the Data: Load the Zomato stock price data from the CSV file into your
preferred data analysis environment, such as Python with pandas or R.
2. Inspect the Data: Understand the structure of the data by examining the first few
rows, data types, summary statistics, and identifying any missing values.
3. Handle Missing Values: Check for missing values in the dataset and decide on an
appropriate strategy to handle them. Options include removing rows with missing
values, imputing missing values with a specific value (e.g., mean, median), or using
more advanced techniques such as interpolation.
3|Page
V i d y a J a n a n i V – 21ITA24

4. Handle Outliers: Identify outliers in the data and decide how to handle them. This
could involve removing extreme values, transforming the data, or using robust
statistical methods.
5. Convert Data Types: Ensure that each column has the correct data type. For
example, dates should be in datetime format, and numerical columns should be
numeric.
6. Feature Engineering: Create new features from the existing ones if necessary. For
stock price data, this might involve calculating moving averages, percent changes, or
other indicators.
7. Normalize or Standardize Data: Depending on the analysis or modeling techniques
you plan to use, you may need to normalize or standardize the data to ensure that all
features contribute equally.
8. Check for Duplicates: Look for and remove any duplicate rows in the dataset if they
exist.
9. Check Data Integrity: Verify that the data is consistent and makes sense. For
example, ensure that dates are in chronological order and that stock prices are
realistic.
10. Save Cleaned Data: Once the data cleaning and preprocessing steps are complete,
save the cleaned dataset to a new file for further analysis or modeling.

Here's a Python example demonstrating some of these steps using pandas:

Python Code :

import pandas as pd
# Load data
data = pd.read_csv('zomato_stock_prices.csv')
# Inspect data
print(data.head())
print(data.info())
print(data.describe())
# Handle missing values
data.dropna(inplace=True) # Drop rows with
missing values, you might choose another
strategy
# Handle outliers (e.g., using z-score or IQR
method)
# Convert data types if necessary

4|Page
V i d y a J a n a n i V – 21ITA24

data['Date'] = pd.to_datetime(data['Date'])
# Feature engineering
# Normalize or standardize data if necessary
# Check for duplicates
data.drop_duplicates(inplace=True)
# Save cleaned data
data.to_csv('cleaned_zomato_stock_prices.csv',
index=False)

Explanation : Sure, let's break down the provided Python code step by step:

import pandas as pd

5|Page
V i d y a J a n a n i V – 21ITA24

# Load data
data = pd.read_csv('zomato_stock_prices.csv')

This imports the pandas library and loads the Zomato stock price data from the CSV file into
a pandas DataFrame called data.

# Inspect data
print(data.head()) # Print the first few rows of the DataFrame
print(data.info()) # Print information about the DataFrame, including data types and
missing values
print(data.describe()) # Generate summary statistics for numerical columns in the DataFrame

These lines help to understand the structure and contents of the DataFrame. head() shows the
first few rows, info() provides information about the DataFrame including data types and
missing values, and describe() gives summary statistics such as mean, min, max, etc. for
numerical columns.

# Handle missing values


data.dropna(inplace=True) # Drop rows with missing values, you might choose another
strategy

This line drops rows with any missing values (NaN) from the DataFrame. Dropping missing
values is just one strategy; you might choose to impute missing values instead.

# Convert data types if necessary


data['Date'] = pd.to_datetime(data['Date'])

This converts the 'Date' column to datetime format using pd.to_datetime(). It ensures that
the 'Date' column is treated as a date object rather than a string, which makes it easier to work
with dates.

# Check for duplicates


data.drop_duplicates(inplace=True)

This line removes any duplicate rows from the DataFrame, keeping only the first occurrence
of each unique row.

# Save cleaned data


data.to_csv('cleaned_zomato_stock_prices.csv', index=False)

Finally, this saves the cleaned DataFrame to a new CSV file called
cleaned_zomato_stock_prices.csv, without including the index column. This file can now
be used for further analysis or modeling.

6|Page
V i d y a J a n a n i V – 21ITA24

4. Data Normalization and Standardization

Data normalization and standardization are crucial steps in data preprocessing,


especially in machine learning. Normalization scales numeric data to a range (usually
0-1), while standardization transforms data to have a mean of 0 and a standard
deviation of 1. These techniques help ensure that different features contribute equally
to a model, preventing features with larger scales from dominating the model. Tools
like scikit-learn in Python provide functions like MinMaxScaler for normalization
and StandardScaler for standardization.

➢ Histogram:

Certainly! You can use the matplotlib library to create a histogram for visualizing the
distribution of numerical data in your DataFrame. Here's how you can display a histogram for
the 'Close' prices column from your Zomato stock price dataset:

import pandas as pd
import matplotlib.pyplot as plt

# Load data
data = pd.read_csv('zomato_stock_prices.csv')

# Display histogram for the 'Close' prices


plt.figure(figsize=(10, 6))
plt.hist(data['Close'], bins=20, color='skyblue', edgecolor='black')
plt.title('Distribution of Close Prices')
plt.xlabel('Close Price')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()

7|Page
V i d y a J a n a n i V – 21ITA24

Explanation of the code:

1. We import pandas and matplotlib.pyplot libraries.


2. We load the Zomato stock price data into a pandas DataFrame called data.
3. We create a histogram using plt.hist() function. We pass the 'Close' column of the
DataFrame as the data for which we want to create the histogram.
4. The bins parameter specifies the number of bins or intervals in the histogram. You
can adjust this parameter according to your preference.
5. We set the title, xlabel, and ylabel for the histogram using plt.title(), plt.xlabel(), and
plt.ylabel() functions respectively.
6. We display grid lines using plt.grid(True).
7. Finally, we display the histogram using plt.show().

This code will generate a histogram showing the distribution of 'Close' prices from your
Zomato stock price dataset. Adjustments can be made to customize the appearance of the
histogram according to your preferences.

➢ ScatterPlot:

To create a scatter plot for the Zomato stock price dataset, you can use the matplotlib
library. Here's how you can display a scatter plot for 'Date' vs 'Close' prices:

import pandas as pd
import matplotlib.pyplot as plt

# Load data
data = pd.read_csv('zomato_stock_prices.csv')

# Convert 'Date' column to datetime


data['Date'] = pd.to_datetime(data['Date'])

# Create scatter plot


plt.figure(figsize=(10, 6))
plt.scatter(data['Date'], data['Close'], color='blue', marker='.')
plt.title('Scatter Plot of Date vs Close Price')
plt.xlabel('Date')
plt.ylabel('Close Price')
plt.grid(True)
plt.show()

8|Page
V i d y a J a n a n i V – 21ITA24

Explanation of the code:

1. We import pandas and matplotlib.pyplot libraries.


2. We load the Zomato stock price data into a pandas DataFrame called data.
3. We convert the 'Date' column to datetime format using pd.to_datetime() to ensure
proper plotting.
4. We create a scatter plot using plt.scatter() function. We pass the 'Date' column as the
x-axis values and the 'Close' column as the y-axis values.
5. We set the title, xlabel, and ylabel for the scatter plot using plt.title(), plt.xlabel(),
and plt.ylabel() functions respectively.
6. We display grid lines using plt.grid(True).
7. Finally, we display the scatter plot using plt.show().

This code will generate a scatter plot showing the relationship between 'Date' and 'Close'
prices from your Zomato stock price dataset. Adjustments can be made to customize the
appearance of the scatter plot according to your preferences.

➢ PairPlot:

You can use the seaborn library to create a pair plot for visualizing relationships
between multiple variables in your Zomato stock price dataset. Here's how you can display a
pair plot:

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Load data
data = pd.read_csv('zomato_stock_prices.csv')

9|Page
V i d y a J a n a n i V – 21ITA24

# Convert 'Date' column to datetime


data['Date'] = pd.to_datetime(data['Date'])

# Selecting numerical columns for the pair plot


numerical_columns = ['Open', 'High', 'Low', 'Close', 'Volume']

# Create pair plot


sns.pairplot(data[numerical_columns])
plt.suptitle('Pair Plot of Numerical Variables', y=1.02)
plt.show()

Explanation of the code:

1. We import pandas, seaborn, and matplotlib.pyplot libraries.


2. We load the Zomato stock price data into a pandas DataFrame called data.
3. We convert the 'Date' column to datetime format using pd.to_datetime() to ensure
proper plotting.
4. We select the numerical columns for which we want to create the pair plot. In this
case, we select 'Open', 'High', 'Low', 'Close', and 'Volume'.
5. We create a pair plot using sns.pairplot() function. We pass the selected numerical
columns as the data for which we want to create the pair plot.
6. We set the main title for the pair plot using plt.suptitle().
7. Finally, we display the pair plot using plt.show().

This code will generate a pair plot showing the relationships between numerical variables
('Open', 'High', 'Low', 'Close', 'Volume') from your Zomato stock price dataset. Adjustments
can be made to customize the appearance of the pair plot according to your preferences.

10 | P a g
e
V i d y a J a n a n i V – 21ITA24

Cleaning the dataset can significantly impact the visualizations derived


from that data. Here’s how the visualizations might change after
cleaning the Zomato dataset:
Certainly, cleaning the dataset can have a significant impact on the visualizations
derived from it. Let's explore how the visualizations might change after cleaning the
Zomato dataset:

1. Histogram of Close Prices:


- Before Cleaning: The histogram may show irregular spikes or gaps due to missing
or incorrect data points.
- After Cleaning: The histogram will likely show a smoother distribution without
irregularities, providing a clearer picture of the distribution of close prices.

2. Scatter Plot of Date vs Close Prices:


- Before Cleaning: There might be outliers or missing values in the data, resulting
in gaps or unusual patterns in the scatter plot.
- After Cleaning: The scatter plot will likely show a clearer trend without outliers or
missing values, making it easier to identify any patterns or relationships between date
and close prices.
3. Pair Plot of Numerical Variables:
- Before Cleaning: There may be inconsistencies or anomalies in the data, leading
to misleading relationships between variables in the pair plot.
- After Cleaning: The pair plot will likely show more meaningful relationships
between numerical variables after removing inconsistencies or anomalies, providing
better insights into the dataset.

4. Other Visualizations:
- Line plots, box plots, and heatmaps are other common visualizations used for
analyzing stock price data.
- Before Cleaning: These visualizations may show unexpected patterns, outliers, or
missing data points.
- After Cleaning: These visualizations will provide more accurate and meaningful
insights into the dataset, enabling better analysis and decision-making.

Overall, cleaning the Zomato dataset will lead to more accurate and reliable
visualizations, allowing analysts and stakeholders to make better-informed decisions
based on the data.

11 | P a g
e
V i d y a J a n a n i V – 21ITA24

Project Summary:
Title: Exploratory Data Analysis of Zomato Dataset
Objective:
The objective of this project was to perform exploratory data analysis
(EDA) on the Zomato dataset using modern tools and techniques. This
involved collecting, loading, cleaning, preprocessing, and visualizing the
data to derive insights and understand its characteristics.
Methods and Tools Used:
1. Data Collection and Loading:

The dataset was loaded into a DataFrame using Python libraries like
pandas and numpy. The Zomato dataset was chosen as an example
for analysis.
2. Data Importing from Various Sources:

• Databases: Tools like SQL Workbench/J and DBeaver were


mentioned for connecting to databases.
• APIs: Libraries like requests in Python or axios in JavaScript
were cited for accessing data through APIs.
• CSV Files: Python libraries such as pandas were used to read
data from CSV files.
3. Data Cleaning and Preprocessing:

Tools like OpenRefine, Trifacta Wrangler, and pandas were


employed for cleaning and preprocessing raw data. Techniques like
handling missing values, removing outliers, and standardizing data
were applied.

4. Data Normalization and Standardization:

The significance of data normalization and standardization in data


preprocessing, especially for machine learning, was discussed.
Tools like scikit-learn in Python were mentioned for implementing
normalization and standardization techniques.
5. Visualization:

The impact of cleaning the dataset on visualizations was highlighted


through examples such as histograms, scatter plots, and pair plots.
Project Outcome:
• The project resulted in a cleaned and preprocessed version of the
Zomato dataset, ready for further analysis or modeling.
• Visualizations were used to demonstrate the impact of data

12 | P a g
e
V i d y a J a n a n i V – 21ITA24

cleaning on the distribution and relationships within the dataset.


• The project showcased the importance of modern tools and
techniques in exploring and understanding diverse datasets for
insightful analysis.
Conclusion:
Through this project, the student demonstrated proficiency in using
modern tools and techniques for exploratory data analysis. By applying
methods for data collection, importing, cleaning, preprocessing, and
visualization, valuable insights were gained from the Zomato dataset. The
project highlighted the significance of data quality and preprocessing in
ensuring accurate analysis and modeling. Overall, it was a comprehensive
exploration of data characteristics using contemporary approaches.

13 | P a g
e

You might also like