Dav - Lab Manual
Dav - Lab Manual
NAME :
REGISTERNO :
VHNO :
YEAR : III
SEMESTER : VI
CERTIFICATE
NAME: ………………………………………………………………………………………………………….
Certified that this is the bonafide record of work done by the above student in 21AI65IT –
DATA ANALYSIS AND VISUALIZATION LABORATORY during the academic year
2023 – 2024.
Signature of Examiners:
To impart the attributes of global engineers to face industrial challenges with social
relevance.
activities.
To empower the students with ethical values and social responsibilities in their
profession.
PEO1: Exhibit professional skills to design, develop and test software systems for real timeneeds.
PEO2: Excel as software Professional or Entrepreneur
PEO3: Demonstrate a sense of societal and ethical responsibilities in their profession.
PROGRAMME OUTCOMES (POs)
PO1: Engineering knowledge: Apply the knowledge of mathematics, science, engineering fundamentals, and
an engineering specialization to the solution of complex engineering problems.
PO2: Problem analysis: Identify, formulate, review research literature, and analyze complex engineering
problems reaching substantiated conclusions using first principles of mathematics, natural sciences, and
engineering sciences.
PO3: Design/development of solutions: Design solutions for complex engineering problems and design
system components or processes that meet the specified needs with appropriate consideration for the public
health and safety, and the cultural, societal, and environmental considerations.
PO4: Conduct investigations of complex problems: Use research-based knowledge and research methods
including design of experiments, analysis and interpretation of data, and synthesis of the information to provide
valid conclusions.
PO5: Modern tool usage: Create, select, and apply appropriate techniques, resources, and modern engineering
and IT tools including prediction and modeling to complex engineering activities with an understanding of the
limitations.
PO6: The engineer and society: Apply reasoning informed by the contextual knowledge to assess societal,
health, safety, legal and cultural issues and the consequent responsibilities relevant to the professional
engineering practice.
PO7: Environment and sustainability: Understand the impact of the professional engineering solutions in
societal and environmental contexts, and demonstrate the knowledge of, and need for sustainable development.
PO8: Ethics: Apply ethical principles and commit to professional ethics and responsibilities and norms of the
engineering practice.
PO9: Individual and team work: Function effectively as an individual, and as a member or leader in
diverse teams, and in multidisciplinary settings.
PO10: Communication: Communicate effectively on complex engineering activities with the engineering
community and with society at large, such as, being able to comprehend and write effective reports and design
documentation, make effective presentations, and give and receive clear instructions.
PO11: Project management and finance: Demonstrate knowledge and understanding of the engineering and
management principles and apply these to one’s own work, as a member and leader in a team, to manage projects
and in multidisciplinary environments.
PO12: Life-long learning: Recognize the need for, and have the preparation and ability to engage in
independent and life-long learning in the broadest context of technological change.
COURSE OBJECTIVES:
PREREQUISITE:
COURSE OUTCOMES:
COURSE OUTCOMES MAPPING WITH PROGRAM OUTCOMES AND PROGRAM SPECIFIC OUTCOMES
CO No. P PO PO- PO- PO- PO- PO- PO- PO- PO- PO- PO- PSO- PSO-
O-1 -2 3 4 5 6 7 8 9 10 11 12 1 2
C604. 1 2 2 1 - 3 - - - - 2 - - 2 3
C604. 2 2 2 1 - 3 - - - - 2 - - 2 3
C604. 3 2 2 1 - 3 - - - - 2 - - 2 3
C604. 4 3 3 2 - 3 - - - - 2 - - 2 3
C604. 5 3 3 2 - 3 - - - - 2 - - 2 3
AIM:
Analyze sales data using NumPy tools and arrays, including calculating basic statistics,
identifying months with above-average sales, and determining month-over-month sales
growth.
ALGORITHM:
1. Import the NumPy library.
2. Define the sample sales data array for 12 months.
3. Calculate total sales using np.sum.
4. Calculate average sales using np. mean.
5. Find maximum sales using np.max.
6. Find minimum sales using np.min.
7. Identify months with above-average sales using np.where.
8. Calculate month-over-month sales growth using np.diff.
9. Calculate the average monthly growth using np. mean.
10. Print the total sales, average monthly sales, maximum monthly sales, minimum monthly
sales, months with above-average sales, and average monthly sales growth.
PROGRAM:
import numpy as np
OUTPUT:
RESULT :
Ex. No:2 DATA VISUALIZATION BASED ON PANDAS DATA STRUCTURES
AIM:
Visualize monthly sales data using Pandas data structures and Matplotlib.
ALGORITHM:
1. Import the Pandas library as pd.
2. Import the Matplotlib library as plt.
3. Define sample sales data in a dictionary format, including months and corresponding
sales.
4. Create a DataFrame using pd.DataFrame with the sales data.
5. Plot the sales data using Matplotlib:
- Set the figure size using plt.figure(figsize=(10,6)).
- Plot the sales data using plt.plot.
- Customize the plot with markers, color, and linestyle.
- Set the title using plt.title.
- Set the labels for x and y axes using plt.xlabel and plt.ylabel.
- Enable grid using plt.grid(True).
- Rotate x-axis labels using plt.xticks(rotation=45).
- Adjust layout using plt.tight_layout().
- Display the plot using plt.show().
PROGRAM:
import pandas as pd
import matplotlib.pyplot as plt
# Sample data:
Monthly sales data = {
'Month': ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'],'Sales': [100, 150,200,
250, 300, 350, 400, 450, 500, 550, 600, 650]
}
# Create a DataFrame
df = pd.DataFrame(data)
# Plotting
plt.figure(figsize=(10,
6))
plt.plot(df['Month'], df['Sales'], marker='o', color='b', linestyle='-')
plt.title('Monthly Sales')
plt.xlabel('Month')
plt.ylabel('Sales')
plt.grid(True)
plt.xticks(rotation=45
) plt.tight_layout()
plt.show()
OUTPUT:
RESULT:
APPLY VARIOUS FEATURES ON DATA LOADING, STORAGE
Ex. No:3
AND FILE FORMATS
AIM:
Apply various features for data loading, storage, and file formats using Pandas.
ALGORITHM:
PROGRAM:
import pandas as pd
# Create a DataFrame
df = pd.DataFrame(data)
df.to_csv('data.csv', index=False)
# Save to CSV file
df.to_excel('data.xlsx', index=False)
df.to_json('data.json', orient='records')
df_excel =pd.read_excel('data.xlsx')
df_json = pd.read_json('data.json')
RESULT:
APPLY USE OF PANDAS TOOLS FOR INTERACTING WITH
Ex. No:4
WEB APIs
AIM:
Fetch data from an API endpoint and convert it into a Pandas DataFrame.
ALGORITHM:
PROGRAM:
import pandas as pd
import requests
OUTPUT:
userId id title \
RESULT:
EXPLORE VARIOUS TOOLS BASED ON DATA CLEANING
Ex. No:5
AND PREPARATION
AIM:
To use Pandas, NumPy, Matplotlib, and Seaborn for data cleaning, preparation,
analysis, and visualization.
ALGORITHM:
PROGRAM
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Create a sample DataFramedata = {'Name': ['John', 'Anna', 'Peter', 'Linda', 'Tom'], 'Age': [25, 30, 35,
np.nan, 40],'Gender': ['M', 'F', 'M', 'F', 'M'],'Salary': [50000, 60000, 70000, 55000, 65000]}
df = pd.DataFrame(data)
# Data Cleaning
df['Age'].fillna(df['Age'].mean(), inplace=True)
# Fill missing values with mean
# Data Preparation
df['Age_Category'] = pd.cut(df['Age'], bins=[20, 30, 40, 50], labels=['20s', '30s', '40s'])
age_distribution = df['Age_Category'].value_counts()
plt.subplot(1, 2, 2)
age_distribution.plot(kind='pie', autopct='%1.1f%%', colors=['lightblue', 'lightgreen','lightcoral'])
plt.title('Age Distribution')
plt.ylabel('')
plt.tight_layout()
plt.show()
OUTPUT:
RESULT:
Ex. No:6 USE OF DATA WRANGLING IN VISUALIZATION
AIM:
Utilize data wrangling techniques in data visualization.
ALGORITHM:
1. Data Preparation: Create a DataFrame with sample data representing sales and
expenses over years.
2. Data Wrangling: Calculate profit by subtracting Expenses from Sales and add it
as a new column.
3. Visualization: Plot Sales, Expenses, and Profit over Years.
- Each line represents a different aspect (Sales, Expenses, Profit) over the
years.
4. Enhancements:
• Labels: Add labels for the x-axis (Year) and y-axis (Amount).
• Title: Title the plot to reflect the data being visualized.
• Legend: Include a legend to differentiate between Sales, Expenses, and Profit.
• Grid: Enable grid lines to aid readability.
• X-axis Ticks: Ensure all years are shown on the x-axis for clarity.
5. Display: Show the finalized plot.
PROGRAM:
import pandas as pd
import matplotlib.pyplot as plt
# Create a DataFrame with sample data data ={'Year': [2015, 2016, 2017, 2018, 2019],'Sales': [100, 150, 200,
250, 300],'Expenses': [80, 100, 120, 150, 200]}
df = pd.DataFrame(data)
# Plotting
plt.figure(figsize=(10, 6))
# Plot Sales
plt.plot(df['Year'], df['Sales'], marker='o', label='Sales')
# Plot Expenses
plt.plot(df['Year'], df['Expenses'], marker='o', label='Expenses')
# Plot Profit
plt.plot(df['Year'], df['Profit'], marker='o', label='Profit')
plt.grid(True)
plt.xticks(df['Year'])
# Ensure all years are shown on the x-axis
# Show plot
plt.tight_layout()
plt.show()
OUTPUT:
RESULT:
Ex. No:7 DATA VISUALIZATION USING MATPLOTLIB
AIM:
To visualize the data using Matplotlib.
ALGORITHM:
1. Import the Matplotlib.pyplot library as plt.
2. Define sample data: months and corresponding sales.
3. Plot the data:
• Set the figure size using plt.figure(figsize=(8, 5)).
• Create a bar plot using plt.bar(months, sales, color='skyblue').
4. Add labels and title:
• Label the x-axis as 'Month' using plt.xlabel.
• Label the y-axis as 'Sales' using plt.ylabel.
• Set the title of the plot to 'Monthly Sales' using plt.title.
5. Add a grid to the plot using plt.grid(True).
6. Display the plot using plt.show().
PROGRAM:
# Sample data
months = ['January', 'February', 'March', 'April', 'May']
# Bar plot
plt.bar(months, sales, color='skyblue')
# Adding grid
plt.grid(True)
# Show plot
plt.show()
OUTPUT:
RESULT:
AGGREGATE ‘SUM’ AND ‘MIN’ FUNCTION ACROSS ALL
Ex. No:8
THE COLUMNS IN DATA FRAME USING DATA
AGGREGATION FUNCTIONS
AIM:
Aggregate data using 'sum' and 'min' functions across all columns in a DataFrame
using data aggregation functions.
ALGORITHM:
PROGRAM:
import pandas as pd
# Sample DataFrame
data = { 'A': [1,2, 3, 4],'B': [5, 6, 7, 8],'C': [9, 10, 11, 12]}
df = pd.DataFrame(data)
Aggregated Data:
A B C
Sum 10 26 42
Min 1 5 9
RESULT:
Ex. No:9
DATA BASED ON TIME SERIES DATA ANALYSIS
AIM:
Conduct comprehensive analysis and visualization of time series data.
ALGORITHM:
1. Generate Data: Create a time series dataset with dates ranging from '2024-01-01'
to '2024-04-09' and random values.
2. Display Initial Data: Print the first few rows of the generated data to inspect its
structure and values.
3. Plot Time Series Data: Visualize the time series data by plotting 'Date'
against 'Value'.
• Set up the plot with appropriate labels and titles.
• Enable grid lines for clarity.
4. Basic Data Analysis: Provide basic statistics of the data using describe()
function to understand its distribution and summary metrics.
5. Calculate Rolling Mean: Compute the rolling mean of the 'Value' column
using a window size of 7 to smooth out fluctuations.
6. Plot Rolling Mean: Overlay the original data and rolling mean on the same
plot to observe trends and changes over time.
• Label the lines appropriately and include a legend for clarity.
7. Display Plots: Show both plots to visualize the time series data and its rolling
mean.
PROGRAM:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(0)
plt.figure(figsize=(10, 6))
plt.plot(df['Date'], df['Value'], label='Original Data')
plt.plot(df['Date'], rolling_mean, label='Rolling Mean (window=7)')
plt.title('Rolling Mean of Time Series Data')
plt.xlabel('Date')
plt.ylabel('Value')
plt.legend()
plt.grid(True)
plt.show()
OUTPUT:
0 2024-01-01 97
1 2024-01-02 167
2 2024-01-03 117
3 2024-01-04 153
4 2024-01-05 59
AIM:
Explore different data preprocessing options using benchmark datasets.
ALGORITHM:
1. Import the Pandas library as pd and load the iris dataset from scikit- learn.
2. Load the iris dataset using load_iris() function from sklearn.datasets.
3. Create a DataFrame (df) using the iris dataset's data and feature names.
4. Introduce missing values into the DataFrame for demonstration purposes.
5. Fill the missing values with the mean of each column.
6. Create a new DataFrame (df_filled) with missing values filled using the fillna()
method with the mean of each column.
7. Print the first few rows of the filled DataFrame to inspect the changes.
PROGRAM:
import pandas as pd
from sklearn.datasets
import load_iris
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
RESULT:
FORMULATE REAL BUSINESS PROBLEMS SCENARIOS TO
Ex. No:11
HYPOTHESIS AND SOLVE STATISTICAL TESTING
FEATURES.
AIM:
Investigate data preprocessing options using benchmark datasets.
ALGORITHM:
PROGRAM:
import pandas as pd
from scipy.stats
import ttest_ind
print("A company wants to determine if there is a significant difference in productivity between Group A
and Group B.")
# Define the
hypothesis
print("\nHypothesis:"
)
print("Null Hypothesis (H0): There is no significant difference in productivity betweenGroup A and
Group B.")
print("Alternative Hypothesis (H1): There is a significant difference in productivitybetween Group A
and Group B.")
# Perform t-test for independent
samples
statistic, p_value = ttest_ind(df['Group_A'], df['Group_B'])
Statistical Testing:
T-statistic: 0.5
P-value: 0.6305360755569764
Conclusion:
Fail to reject the null hypothesis (H0). There is no significant difference in
productivitybetween Group A and Group B.
RESULT:
Ex. No:12 FORMULATE REAL BUSINESS PROBLEMS SCENARIOS TO
HYPOTHESIS AND SOLVE USING PANDAS.
AIM:
Solve real business problems using Pandas by formulating hypotheses and
conducting hypothesis testing.
ALGORITHM:
PROGRAM:
import pandas as pd
from scipy.stats
import ttest_ind
else:
print("\nConclusion:")
print("Mean salaries are equal. Statistical testing is not required.")
OUTPUT:
Hypothesis:
Null Hypothesis (H0): There is no significant difference in salaries between the
Salesdepartment and the Marketing department.
Alternative Hypothesis (H1): There is a significant difference in salaries
between theSales department and the Marketing department.
Statistical Testing:
T-statistic: -2.8284271247461903
P-value: 0.1055728090000841
Conclusion:
Fail to reject the null hypothesis (H0). There is no significant difference in salaries
RESULT: