0% found this document useful (0 votes)
34 views34 pages

Dav - Lab Manual

The document outlines the structure and objectives of the Data Analysis and Visualization Laboratory course for B. Tech students in Artificial Intelligence and Data Science. It includes details on course objectives, outcomes, program educational objectives, and various exercises involving data analysis using NumPy and Pandas. Additionally, it emphasizes the vision and mission of the institution and department, along with the expected competencies of engineering graduates.

Uploaded by

Praveen Tommy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views34 pages

Dav - Lab Manual

The document outlines the structure and objectives of the Data Analysis and Visualization Laboratory course for B. Tech students in Artificial Intelligence and Data Science. It includes details on course objectives, outcomes, program educational objectives, and various exercises involving data analysis using NumPy and Pandas. Additionally, it emphasizes the vision and mission of the institution and department, along with the expected competencies of engineering graduates.

Uploaded by

Praveen Tommy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

21AI65IT – DATA ANALYSIS AND VISUALIZATION LABORATORY

NAME :

REGISTERNO :

VHNO :

BRANCH : B. Tech -Artificial Intelligence &Data Science

YEAR : III

SEMESTER : VI
CERTIFICATE

NAME: ………………………………………………………………………………………………………….

YEAR: ……………… SEMESTER: ................... BRANCH: ………….

UNIVERSITY REGISTER NO: …………………………………... VH NO: …………………

Certified that this is the bonafide record of work done by the above student in 21AI65IT –
DATA ANALYSIS AND VISUALIZATION LABORATORY during the academic year
2023 – 2024.

Signature of Head of the Department Signature of Lab In charge

Submitted for the University Practical Examination held on at


VELTECH HIGH TECH DR. RANGARAJAN DR. SAKUNTHALAENGINEERING COLLEGE,
NO. 60, AVADI –VEL TECH ROAD, AVADI, CHENNAI – 600 062.

Signature of Examiners:

Internal: ………………………… External: ………………………….


Vision and Mission of the Institution
Vision of the Institution
Pursuit of excellence in technical education to create civic responsibility with competency.

Mission of the Institution

 To impart the attributes of global engineers to face industrial challenges with social

 relevance.

 To indoctrinate as front runners through moral practices.

 To attain the skills through lifelong learning.

Vision and Mission of the Department

Vision of the Department

To be a center of excellence in the field of Artificial Intelligence and Data Science.

Mission of the Department

 To provide conducive learning environment for quality education in the field of

 Artificial Intelligence and Data Science.

 To pursue industry institute interaction and promote collaborative research

 activities.

 To empower the students with ethical values and social responsibilities in their

 profession.

Programme Educational Objectives (PEOs)

PEO1: Exhibit professional skills to design, develop and test software systems for real timeneeds.
PEO2: Excel as software Professional or Entrepreneur
PEO3: Demonstrate a sense of societal and ethical responsibilities in their profession.
PROGRAMME OUTCOMES (POs)

Engineering Graduates will be able to:

PO1: Engineering knowledge: Apply the knowledge of mathematics, science, engineering fundamentals, and
an engineering specialization to the solution of complex engineering problems.

PO2: Problem analysis: Identify, formulate, review research literature, and analyze complex engineering
problems reaching substantiated conclusions using first principles of mathematics, natural sciences, and
engineering sciences.

PO3: Design/development of solutions: Design solutions for complex engineering problems and design
system components or processes that meet the specified needs with appropriate consideration for the public
health and safety, and the cultural, societal, and environmental considerations.

PO4: Conduct investigations of complex problems: Use research-based knowledge and research methods
including design of experiments, analysis and interpretation of data, and synthesis of the information to provide
valid conclusions.

PO5: Modern tool usage: Create, select, and apply appropriate techniques, resources, and modern engineering
and IT tools including prediction and modeling to complex engineering activities with an understanding of the
limitations.

PO6: The engineer and society: Apply reasoning informed by the contextual knowledge to assess societal,
health, safety, legal and cultural issues and the consequent responsibilities relevant to the professional
engineering practice.

PO7: Environment and sustainability: Understand the impact of the professional engineering solutions in
societal and environmental contexts, and demonstrate the knowledge of, and need for sustainable development.

PO8: Ethics: Apply ethical principles and commit to professional ethics and responsibilities and norms of the
engineering practice.

PO9: Individual and team work: Function effectively as an individual, and as a member or leader in
diverse teams, and in multidisciplinary settings.

PO10: Communication: Communicate effectively on complex engineering activities with the engineering
community and with society at large, such as, being able to comprehend and write effective reports and design
documentation, make effective presentations, and give and receive clear instructions.

PO11: Project management and finance: Demonstrate knowledge and understanding of the engineering and
management principles and apply these to one’s own work, as a member and leader in a team, to manage projects
and in multidisciplinary environments.

PO12: Life-long learning: Recognize the need for, and have the preparation and ability to engage in
independent and life-long learning in the broadest context of technological change.
COURSE OBJECTIVES:

• To create user-friendly relational and NoSQL data models.


• To create scalable and efficient data warehouses.
• To Develop skills to both design and critique visualizations
• To understand why visualization is an important part of data analysis.

PREREQUISITE:

• Basic knowledge in Data Analytics, Python and data visualization.

COURSE OUTCOMES:

Course Outcomes Blooms


CO. No.
level
On successful completion of this Course, students will be able to
C604. 1 Apply the fundamental concept of Data Analysis in real K3
time application.
C604. 2 Identify the strengths and weaknesses of different types of data K2
bases and data storage techniques.
C604. 3 Apply Data visualization technique for result analysis. K3

Manipulate data with matplotlib and seaborn.


C604. 4 K3
C604. 5 Setup data pipeline schedules. K3

COURSE OUTCOMES MAPPING WITH PROGRAM OUTCOMES AND PROGRAM SPECIFIC OUTCOMES

CO No. P PO PO- PO- PO- PO- PO- PO- PO- PO- PO- PO- PSO- PSO-
O-1 -2 3 4 5 6 7 8 9 10 11 12 1 2

C604. 1 2 2 1 - 3 - - - - 2 - - 2 3

C604. 2 2 2 1 - 3 - - - - 2 - - 2 3

C604. 3 2 2 1 - 3 - - - - 2 - - 2 3

C604. 4 3 3 2 - 3 - - - - 2 - - 2 3

C604. 5 3 3 2 - 3 - - - - 2 - - 2 3

Note:1: Slight,2: Moderate,3: Substantial.


Ex. No:1 USING NUMPY TOOLS AND ARRAY FOR DATA ANALYSIS

AIM:
Analyze sales data using NumPy tools and arrays, including calculating basic statistics,
identifying months with above-average sales, and determining month-over-month sales
growth.

ALGORITHM:
1. Import the NumPy library.
2. Define the sample sales data array for 12 months.
3. Calculate total sales using np.sum.
4. Calculate average sales using np. mean.
5. Find maximum sales using np.max.
6. Find minimum sales using np.min.
7. Identify months with above-average sales using np.where.
8. Calculate month-over-month sales growth using np.diff.
9. Calculate the average monthly growth using np. mean.
10. Print the total sales, average monthly sales, maximum monthly sales, minimum monthly
sales, months with above-average sales, and average monthly sales growth.

PROGRAM:

import numpy as np

# Generate sample sales data for 12 months


sales_data = np.array([100, 150, 200, 250, 300, 350, 400, 450, 500, 550, 600, 650])

# Calculate basic statistics


total_sales =
np.sum(sales_data)
average_sales =
np.mean(sales_data) max_sales
= np.max(sales_data)
min_sales = p.min(sales_data)
# Find months with above average sales
above_avg_months = np.where(sales_data > average_sales)[0]

# Calculate month-over-month sales growth


monthly_growth = np.diff(sales_data)
average_monthly_growth = np.mean(monthly_growth)
# Print results
print("Total sales:", total_sales)

print("Average monthly sales:", average_sales)


print("Maximum monthly sales:", max_sales)
print("Minimum monthly sales:", min_sales)
print("Months with above average sales:", above_avg_months)
print("Average monthly sales growth:", average_monthly_growth)

OUTPUT:

Total sales: 4500


Average monthly sales: 375.0

Maximum monthly sales: 650

Minimum monthly sales: 100


Months with above average sales: [ 6 7 8 9 10 11]
Average monthly sales growth: 50.0

RESULT :
Ex. No:2 DATA VISUALIZATION BASED ON PANDAS DATA STRUCTURES

AIM:
Visualize monthly sales data using Pandas data structures and Matplotlib.

ALGORITHM:
1. Import the Pandas library as pd.
2. Import the Matplotlib library as plt.
3. Define sample sales data in a dictionary format, including months and corresponding
sales.
4. Create a DataFrame using pd.DataFrame with the sales data.
5. Plot the sales data using Matplotlib:
- Set the figure size using plt.figure(figsize=(10,6)).
- Plot the sales data using plt.plot.
- Customize the plot with markers, color, and linestyle.
- Set the title using plt.title.
- Set the labels for x and y axes using plt.xlabel and plt.ylabel.
- Enable grid using plt.grid(True).
- Rotate x-axis labels using plt.xticks(rotation=45).
- Adjust layout using plt.tight_layout().
- Display the plot using plt.show().

PROGRAM:

import pandas as pd
import matplotlib.pyplot as plt

# Sample data:
Monthly sales data = {
'Month': ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'],'Sales': [100, 150,200,
250, 300, 350, 400, 450, 500, 550, 600, 650]
}

# Create a DataFrame
df = pd.DataFrame(data)

# Plotting
plt.figure(figsize=(10,
6))
plt.plot(df['Month'], df['Sales'], marker='o', color='b', linestyle='-')
plt.title('Monthly Sales')
plt.xlabel('Month')
plt.ylabel('Sales')
plt.grid(True)
plt.xticks(rotation=45
) plt.tight_layout()
plt.show()

OUTPUT:

RESULT:
APPLY VARIOUS FEATURES ON DATA LOADING, STORAGE
Ex. No:3
AND FILE FORMATS

AIM:
Apply various features for data loading, storage, and file formats using Pandas.

ALGORITHM:

1. Import the Pandas library as pd.


2. Define sample data in a dictionary format, including Name, Age, Gender, and City.
3. Create a DataFrame using pd.DataFrame with the sample data.
4. Save the DataFrame to different file formats:
- Save to CSV file using df.to_csv('data.csv', index=False).
- Save to Excel file using df.to_excel('data.xlsx', index=False).
- Save to JSON file using df.to_json('data.json', orient='records').
5. Load data from different file formats:
- Load from CSV file using pd.read_csv('data.csv').
- Load from Excel file using pd.read_excel('data.xlsx').
- Load from JSON file using pd.read_json('data.json').
6. Print the loaded data from each file format.

PROGRAM:

import pandas as pd

# Create sample data


data ={'Name': ['John', 'Alice', 'Bob', 'Emily', 'Michael'],'Age': [30, 25, 35,28, 40],'Gender': ['Male',
'Female', 'Male', 'Female', 'Male'],'City': ['New York', 'Los Angeles', 'Chicago', 'San Francisco', 'Boston']
}

# Create a DataFrame
df = pd.DataFrame(data)

# Save data to different file formats

df.to_csv('data.csv', index=False)
# Save to CSV file

df.to_excel('data.xlsx', index=False)

# Save to Excel file

df.to_json('data.json', orient='records')

# Save to JSON file

# Load data from different file formats


df_csv = pd.read_csv('data.csv')

# Load from CSV file

df_excel =pd.read_excel('data.xlsx')

# Load from Excel file

df_json = pd.read_json('data.json')

# Load from JSON file


# Print loaded data print("Loaded from
CSV:")
print(df_csv)
print("\nLoaded from Excel:")
print(df_excel)
print("\nLoaded from JSON:")
print(df_json)
OUTPUT:

Loaded from CSV:


Name Age Gender City
John 30 Male New York
Alice 25 Female Los Angeles
Emily 28 Female San Francisco
Michael 40 Male Boston

Loaded from Excel:


Name Age Gender City
John 30 Male New York
Alice 25 Female Los Angeles
Emily 28 Female San Francisco
Michael 40 Male Boston

Loaded from JSON:


Name Age Gender City
John 30 Male New York
Alice 25 Female Los Angeles

Emily 28 Female San Francisco


Michael 40 Male

RESULT:
APPLY USE OF PANDAS TOOLS FOR INTERACTING WITH
Ex. No:4
WEB APIs

AIM:
Fetch data from an API endpoint and convert it into a Pandas DataFrame.

ALGORITHM:

1. Import the Pandas library as pd and the requests library.


2. Define the API endpoint URL.
3. Send a GET request to the API using requests.get(api_url) and store the response.
4. Check if the request was successful (status code 200):
- If successful:
- Convert the JSON response to a Pandas DataFrame using response.json() and
pd.DataFrame(data).
- Display the DataFrame using print(df).
- If not successful:
- Print an error message indicating failure to fetch data from the API.

PROGRAM:

import pandas as pd
import requests

# Define the API endpoint URL


api_url = 'https://jsonplaceholder.typicode.com/posts'

# Send a GET request to the API


response =requests.get(api_url)
# Check if the request was successful (status code 200)
ifresponse.status_code == 200:
# Convert the JSON response to a pandas
DataFrame data =response.json()
df = pd.DataFrame(data)

# Display the DataFrame print("DataFrame from API response:")


print(df)
else:
print("Error fetching data from the API.")

OUTPUT:

DataFrame from API response:

userId id title \

0 1 1 sunt aut facere repellat provident occaecati e...


1 1 2 qui est esse
2 1 3 ea molestias quasi exercitationem repellat qui...
3 1 4 eum et est occaecati
4 1 5 nesciunt quas odio
.. ... ... ...
95 10 96 quaerat velit veniam amet cupiditate aut numqu...
96 10 97 quas fugiat ut perspiciatis vero provident
97 10 98 laboriosam dolor voluptates
98 10 99 temporibus sit alias delectus eligendi possimu...
99 10 100at nam consequatur ea labore ea harum body
0 quia et suscipit\nsuscipit recusandae consequu...
1 est rerum tempore vitae\nsequi sint nihil repr...
2 et iusto sed quo iure\nvoluptatem occaecati om...
3 ullam et saepe reiciendis voluptatem adipisci\...
4 repudiandae veniam quaerat sunt sed\nalias aut...
.. ...
95 in non odio excepturi sint eum\nlabore volupta... 96 eumnon
blanditiis soluta porro quibusdam volu...97 doloremqueex facilis
sit sint culpa\nsoluta a...
98 quo deleniti praesentium dicta non quod\naut e...99
cupiditate quo est a modi nesciunt soluta\nips...
[100 rows x 4 columns]

RESULT:
EXPLORE VARIOUS TOOLS BASED ON DATA CLEANING
Ex. No:5
AND PREPARATION

AIM:
To use Pandas, NumPy, Matplotlib, and Seaborn for data cleaning, preparation,
analysis, and visualization.

ALGORITHM:

1. Import required libraries.


2. Create a DataFrame with sample data.
3. Clean data by filling missing values with the mean age.
4. Prepare data by categorizing ages.
5. Analyze data by calculating average salary by gender and age distribution.
6. Visualize data with a bar plot for average salary by gender and a pie chart for age
distribution.

PROGRAM

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Create a sample DataFramedata = {'Name': ['John', 'Anna', 'Peter', 'Linda', 'Tom'], 'Age': [25, 30, 35,
np.nan, 40],'Gender': ['M', 'F', 'M', 'F', 'M'],'Salary': [50000, 60000, 70000, 55000, 65000]}
df = pd.DataFrame(data)

# Data Cleaning
df['Age'].fillna(df['Age'].mean(), inplace=True)
# Fill missing values with mean

# Data Preparation
df['Age_Category'] = pd.cut(df['Age'], bins=[20, 30, 40, 50], labels=['20s', '30s', '40s'])

# Create age categories


# Data Analysis
avg_salary_by_gender = df.groupby('Gender')['Salary'].mean()

# Calculate average salary by gender

age_distribution = df['Age_Category'].value_counts()

# Get age distribution


# Data Visualization plt.figure(figsize=(10, 6))
# Bar plot for average salary by gender
plt.subplot(1, 2, 1)
avg_salary_by_gender.plot(kind='bar', color=['blue', 'pink'])
plt.title('Average Salary by Gender')
plt.xlabel('Gender')
plt.ylabel('Average Salary')

# Pie chart for age


distribution

plt.subplot(1, 2, 2)
age_distribution.plot(kind='pie', autopct='%1.1f%%', colors=['lightblue', 'lightgreen','lightcoral'])
plt.title('Age Distribution')
plt.ylabel('')
plt.tight_layout()
plt.show()

OUTPUT:

RESULT:
Ex. No:6 USE OF DATA WRANGLING IN VISUALIZATION

AIM:
Utilize data wrangling techniques in data visualization.

ALGORITHM:
1. Data Preparation: Create a DataFrame with sample data representing sales and
expenses over years.
2. Data Wrangling: Calculate profit by subtracting Expenses from Sales and add it
as a new column.
3. Visualization: Plot Sales, Expenses, and Profit over Years.
- Each line represents a different aspect (Sales, Expenses, Profit) over the
years.
4. Enhancements:
• Labels: Add labels for the x-axis (Year) and y-axis (Amount).
• Title: Title the plot to reflect the data being visualized.
• Legend: Include a legend to differentiate between Sales, Expenses, and Profit.
• Grid: Enable grid lines to aid readability.
• X-axis Ticks: Ensure all years are shown on the x-axis for clarity.
5. Display: Show the finalized plot.

PROGRAM:

import pandas as pd
import matplotlib.pyplot as plt

# Create a DataFrame with sample data data ={'Year': [2015, 2016, 2017, 2018, 2019],'Sales': [100, 150, 200,
250, 300],'Expenses': [80, 100, 120, 150, 200]}
df = pd.DataFrame(data)

# Calculate profit by subtracting Expenses from Sales

df['Profit'] = df['Sales'] - df['Expenses']

# Plotting
plt.figure(figsize=(10, 6))
# Plot Sales
plt.plot(df['Year'], df['Sales'], marker='o', label='Sales')
# Plot Expenses
plt.plot(df['Year'], df['Expenses'], marker='o', label='Expenses')
# Plot Profit
plt.plot(df['Year'], df['Profit'], marker='o', label='Profit')

# Add labels and title


plt.title('Sales, Expenses, and Profit Over Years')
plt.xlabel('Year')
plt.ylabel('Amount')
plt.legend()

plt.grid(True)
plt.xticks(df['Year'])
# Ensure all years are shown on the x-axis

# Show plot
plt.tight_layout()

plt.show()

OUTPUT:

RESULT:
Ex. No:7 DATA VISUALIZATION USING MATPLOTLIB

AIM:
To visualize the data using Matplotlib.

ALGORITHM:
1. Import the Matplotlib.pyplot library as plt.
2. Define sample data: months and corresponding sales.
3. Plot the data:
• Set the figure size using plt.figure(figsize=(8, 5)).
• Create a bar plot using plt.bar(months, sales, color='skyblue').
4. Add labels and title:
• Label the x-axis as 'Month' using plt.xlabel.
• Label the y-axis as 'Sales' using plt.ylabel.
• Set the title of the plot to 'Monthly Sales' using plt.title.
5. Add a grid to the plot using plt.grid(True).
6. Display the plot using plt.show().

PROGRAM:

import matplotlib.pyplot as plt

# Sample data
months = ['January', 'February', 'March', 'April', 'May']

sales = [100, 150, 200, 250, 300]

# Plotting plt.figure(figsize=(8, 5))

# Bar plot
plt.bar(months, sales, color='skyblue')

# Adding labels and title


plt.xlabel('Month')
plt.ylabel('Sales')
plt.title('Monthly Sales')

# Adding grid
plt.grid(True)
# Show plot
plt.show()

OUTPUT:

RESULT:
AGGREGATE ‘SUM’ AND ‘MIN’ FUNCTION ACROSS ALL
Ex. No:8
THE COLUMNS IN DATA FRAME USING DATA
AGGREGATION FUNCTIONS

AIM:
Aggregate data using 'sum' and 'min' functions across all columns in a DataFrame
using data aggregation functions.

ALGORITHM:

1. Import the Pandas library as pd.


2. Define sample data in a dictionary format with columns A, B, and C.
3. Create a DataFrame using pd.DataFrame(data).
4. Aggregate data using 'sum' and 'min' functions across all columns:
• Use the agg function on the DataFrame (df) with parameters ['sum', 'min']
to specify the aggregation functions.
• Store the aggregated data in a new DataFrame (aggregated_data).
5. Display the aggregated data using print().

PROGRAM:

import pandas as pd

# Sample DataFrame
data = { 'A': [1,2, 3, 4],'B': [5, 6, 7, 8],'C': [9, 10, 11, 12]}

df = pd.DataFrame(data)

# Aggregate using sum and min functions

aggregated_data = df.agg(['sum', 'min'])


# Display the aggregated data
print("Aggregated Data:")
print(aggregated_data)
OUTPUT:

Aggregated Data:

A B C

Sum 10 26 42

Min 1 5 9

RESULT:
Ex. No:9
DATA BASED ON TIME SERIES DATA ANALYSIS

AIM:
Conduct comprehensive analysis and visualization of time series data.

ALGORITHM:

1. Generate Data: Create a time series dataset with dates ranging from '2024-01-01'
to '2024-04-09' and random values.
2. Display Initial Data: Print the first few rows of the generated data to inspect its
structure and values.
3. Plot Time Series Data: Visualize the time series data by plotting 'Date'
against 'Value'.
• Set up the plot with appropriate labels and titles.
• Enable grid lines for clarity.
4. Basic Data Analysis: Provide basic statistics of the data using describe()
function to understand its distribution and summary metrics.
5. Calculate Rolling Mean: Compute the rolling mean of the 'Value' column
using a window size of 7 to smooth out fluctuations.
6. Plot Rolling Mean: Overlay the original data and rolling mean on the same
plot to observe trends and changes over time.
• Label the lines appropriately and include a legend for clarity.
7. Display Plots: Show both plots to visualize the time series data and its rolling
mean.

PROGRAM:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Generate sample time series data

np.random.seed(0)

dates = pd.date_range('2024-01-01', periods=100)


values = np.random.randint(50, 200, size=100)
df = pd.DataFrame({'Date': dates, 'Value': values})

# Display the first few rows of the generated data


print("First few rows of the generated data:")
print(df.head())
# Plot the generated time series data
plt.figure(figsize=(10, 6))
plt.plot(df['Date'],df['Value'])
plt.title('Generated Time Series Data')
plt.xlabel('Date')
plt.ylabel('Value')
plt.grid(True)
plt.show()

# Basic Time Series Data Analysis


print("\nBasic statistics of the generated data:")
print(df.describe())
# Calculate rolling mean and plot
rolling_mean = df['Value'].rolling(window=7).mean()

plt.figure(figsize=(10, 6))
plt.plot(df['Date'], df['Value'], label='Original Data')
plt.plot(df['Date'], rolling_mean, label='Rolling Mean (window=7)')
plt.title('Rolling Mean of Time Series Data')

plt.xlabel('Date')
plt.ylabel('Value')

plt.legend()
plt.grid(True)

plt.show()
OUTPUT:

First few rows of the generated data:Date Value

0 2024-01-01 97
1 2024-01-02 167
2 2024-01-03 117
3 2024-01-04 153
4 2024-01-05 59

Basic statistics of the generated data:


Date Value
count 100 100.000000
mean 2024 02-19 12:00:00 130.150000

min 2024-01-01 00:00:00 50.000000


25% 2024-01-25 18:00:00 91.250000
50% 2024-02-19 12:00:00 133.000000
75% 2024-03-15 06:00:00 169.250000
max 2024-04-09 00:00:00 199.000000
std NaN 44.147452
RESULT:
Ex. No:10 EXPLORE VARIOUS DATA PRE-PROCESSING OPTIONS
USING BENCH MARK DATA SETS

AIM:
Explore different data preprocessing options using benchmark datasets.

ALGORITHM:

1. Import the Pandas library as pd and load the iris dataset from scikit- learn.
2. Load the iris dataset using load_iris() function from sklearn.datasets.
3. Create a DataFrame (df) using the iris dataset's data and feature names.
4. Introduce missing values into the DataFrame for demonstration purposes.
5. Fill the missing values with the mean of each column.
6. Create a new DataFrame (df_filled) with missing values filled using the fillna()
method with the mean of each column.
7. Print the first few rows of the filled DataFrame to inspect the changes.

PROGRAM:

import pandas as pd
from sklearn.datasets
import load_iris

# Load Iris datasetiris = load_iris()

df = pd.DataFrame(data=iris.data, columns=iris.feature_names)

# Add some missing values for demonstration

df.loc[::5, 'sepal length (cm)'] = None


# introduce missing values

# Fill missing values with the mean of the column


df_filled = df.fillna(df.mean())
print(df_filled.head())
OUTPUT:

sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)

0 5.799167 3.5 1. 0.2


4
1 4.900000 3.0 1. 0.2
4
2 4.700000 3.2 1. 0.2
3
3 4.600000 3.1 1. 0.2
5
4 5.000000 3.6 1. 0.2
4

RESULT:
FORMULATE REAL BUSINESS PROBLEMS SCENARIOS TO
Ex. No:11
HYPOTHESIS AND SOLVE STATISTICAL TESTING
FEATURES.

AIM:
Investigate data preprocessing options using benchmark datasets.

ALGORITHM:

1. Define sample data representing two groups.


2. Formulate a business problem scenario and hypotheses.
3. Perform a t-test for independent samples using ttest_ind.
• Interpret the results based on the calculated p-value and
significance level.
4. Conclude whether to reject or fail to reject the null hypothesis.

PROGRAM:

import pandas as pd
from scipy.stats
import ttest_ind

# Load the dataset (sample data for illustration purposes)


data= {'Group_A': [10, 12, 14, 16, 18],'Group_B': [9, 11, 13, 15, 17]}
df = pd.DataFrame(data)

# Define the business problem scenario


print("Business Problem Scenario:")

print("A company wants to determine if there is a significant difference in productivity between Group A
and Group B.")
# Define the
hypothesis
print("\nHypothesis:"
)
print("Null Hypothesis (H0): There is no significant difference in productivity betweenGroup A and
Group B.")
print("Alternative Hypothesis (H1): There is a significant difference in productivitybetween Group A
and Group B.")
# Perform t-test for independent
samples
statistic, p_value = ttest_ind(df['Group_A'], df['Group_B'])

# Interpret the results


alpha = 0.05
print("\nStatistical
Testing:") print(f"T-
statistic: {statistic}")
print(f"P-value:
{p_value}")
if p_value < alpha:
print("\nConclusion:")
print("Reject the null hypothesis (H0). There is a significant difference in
productivitybetweenGroup A and Group B.")
else:
print("\nConclusion:")
print("Fail to reject the null hypothesis (H0). There is no significant difference inproductivity
between Group A and Group B.")
OUTPUT:

Business Problem Scenario:


A company wants to determine if there is a significant difference in productivity
betweenGroup A and Group B.
Hypothesis:
Null Hypothesis (H0): There is no significant difference in productivity between
Group Aand Group B.
Alternative Hypothesis (H1):
There is a significant difference in productivity between Group A and Group B.

Statistical Testing:
T-statistic: 0.5
P-value: 0.6305360755569764

Conclusion:
Fail to reject the null hypothesis (H0). There is no significant difference in
productivitybetween Group A and Group B.

RESULT:
Ex. No:12 FORMULATE REAL BUSINESS PROBLEMS SCENARIOS TO
HYPOTHESIS AND SOLVE USING PANDAS.

AIM:
Solve real business problems using Pandas by formulating hypotheses and
conducting hypothesis testing.

ALGORITHM:

1. Load the dataset


• Create sample data representing employee salaries and
departments.
2. Determine if there's a significant difference in salaries between the Sales and
Marketing departments.
3. Formulate Hypotheses:
• Null Hypothesis (H0): No significant difference in salaries
between Sales and Marketing.
• Alternative Hypothesis (H1): Significant difference in salaries
between Sales and Marketing.
4. Filter and calculate mean salaries for Sales and Marketing departments.
5. Use t-test (ttest_ind) to compare salaries between departments. Calculate t-
statistic and p-value.
6. Interpret Results:
• Compare p-value to significance level (alpha).
• Reject or fail to reject null hypothesis based on p-value.
7. Based on hypothesis test results, determine if there's a significant difference in
salaries between departments.

PROGRAM:

import pandas as pd
from scipy.stats
import ttest_ind

# Load the dataset (sample data for illustration purposes)


data= {'Employee_ID': [1, 2, 3, 4, 5],'Department': ['Sales', 'Marketing', 'Sales', 'Finance', 'Marketing'],
'Salary': [50000, 60000, 55000, 70000, 65000]}
df = pd.DataFrame(data)

# Define the business problem scenario


print("Business Problem Scenario:")
print("A company wants to determine if there is a significant difference in salariesbetween the
Salesdepartment and the Marketing department.")
# Define the hypothesis
print("\nHypothesis:")
print("Null Hypothesis (H0): There is no significant difference in salaries between theSales department and
the Marketing department.")
print("Alternative Hypothesis (H1): There is a significant difference in salaries between the
Salesdepartment and the Marketing department.")
# Data Analysis using pandas
sales_salaries = df[df['Department'] == 'Sales']['Salary']
marketing_salaries= df[df['Department'] == 'Marketing']['Salary']

# Calculate mean salaries for each department


sales_mean_salary = sales_salaries.mean()
marketing_mean_salary =
marketing_salaries.mean()

# Print mean salaries


print("\nMean Salary for Sales department:", sales_mean_salary)
print("Mean Salary for Marketing department:",
marketing_mean_salary)

# Perform hypothesis testingalpha = 0.05


if sales_mean_salary != marketing_mean_salary:
print("\nStatistical Testing:")
t_statistic, p_value = ttest_ind(sales_salaries,
marketing_salaries)print(f"T-statistic: {t_statistic}")
print(f"P-value: {p_value}")

if p_value < alpha:


print("\nConclusion:")
print("Reject the null hypothesis (H0). There is a significant difference in salaries between the
Salesdepartment and the Marketing department.")
else:
print("\nConclusion:")
print("Fail to reject the null hypothesis (H0). There is no significant difference in salaries between the
Sales
department and the Marketing department.")

else:
print("\nConclusion:")
print("Mean salaries are equal. Statistical testing is not required.")

OUTPUT:

Business Problem Scenario:

A company wants to determine if there is a significant difference in salaries


between theSales department and the Marketing department.

Hypothesis:
Null Hypothesis (H0): There is no significant difference in salaries between the
Salesdepartment and the Marketing department.
Alternative Hypothesis (H1): There is a significant difference in salaries
between theSales department and the Marketing department.

Mean Salary for Sales department: 52500.0


MeanSalary for Marketing department:
62500.0

Statistical Testing:
T-statistic: -2.8284271247461903
P-value: 0.1055728090000841

Conclusion:
Fail to reject the null hypothesis (H0). There is no significant difference in salaries

betweenthe Sales department and the Marketing department.

RESULT:

You might also like