0% found this document useful (0 votes)

42 views34 pages

Dav - Lab Manual

The document outlines the structure and objectives of the Data Analysis and Visualization Laboratory course for B. Tech students in Artificial Intelligence and Data Science. It includes details on course objectives, outcomes, program educational objectives, and various exercises involving data analysis using NumPy and Pandas. Additionally, it emphasizes the vision and mission of the institution and department, along with the expected competencies of engineering graduates.

Uploaded by

Praveen Tommy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

42 views34 pages

Dav - Lab Manual

Uploaded by

Praveen Tommy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 34

21AI65IT – DATA ANALYSIS AND VISUALIZATION LABORATORY

NAME :

REGISTERNO :

VHNO :

BRANCH : B. Tech -Artificial Intelligence &Data Science

YEAR : III

SEMESTER : VI
CERTIFICATE

NAME: ………………………………………………………………………………………………………….

YEAR: ……………… SEMESTER: ................... BRANCH: ………….

UNIVERSITY REGISTER NO: …………………………………... VH NO: …………………

Certified that this is the bonafide record of work done by the above student in 21AI65IT –
DATA ANALYSIS AND VISUALIZATION LABORATORY during the academic year
2023 – 2024.

Signature of Head of the Department Signature of Lab In charge

Submitted for the University Practical Examination held on at

VELTECH HIGH TECH DR. RANGARAJAN DR. SAKUNTHALAENGINEERING COLLEGE,
NO. 60, AVADI –VEL TECH ROAD, AVADI, CHENNAI – 600 062.

Signature of Examiners:

Internal: ………………………… External: ………………………….

Vision and Mission of the Institution
Vision of the Institution
Pursuit of excellence in technical education to create civic responsibility with competency.

Mission of the Institution

 To impart the attributes of global engineers to face industrial challenges with social

 relevance.

 To indoctrinate as front runners through moral practices.

 To attain the skills through lifelong learning.

Vision and Mission of the Department

Vision of the Department

To be a center of excellence in the field of Artificial Intelligence and Data Science.

Mission of the Department

 To provide conducive learning environment for quality education in the field of

 Artificial Intelligence and Data Science.

 To pursue industry institute interaction and promote collaborative research

 activities.

 To empower the students with ethical values and social responsibilities in their

 profession.

Programme Educational Objectives (PEOs)

PEO1: Exhibit professional skills to design, develop and test software systems for real timeneeds.
PEO2: Excel as software Professional or Entrepreneur
PEO3: Demonstrate a sense of societal and ethical responsibilities in their profession.
PROGRAMME OUTCOMES (POs)

Engineering Graduates will be able to:

PO1: Engineering knowledge: Apply the knowledge of mathematics, science, engineering fundamentals, and
an engineering specialization to the solution of complex engineering problems.

PO2: Problem analysis: Identify, formulate, review research literature, and analyze complex engineering
problems reaching substantiated conclusions using first principles of mathematics, natural sciences, and
engineering sciences.

PO3: Design/development of solutions: Design solutions for complex engineering problems and design
system components or processes that meet the specified needs with appropriate consideration for the public
health and safety, and the cultural, societal, and environmental considerations.

PO4: Conduct investigations of complex problems: Use research-based knowledge and research methods
including design of experiments, analysis and interpretation of data, and synthesis of the information to provide
valid conclusions.

PO5: Modern tool usage: Create, select, and apply appropriate techniques, resources, and modern engineering
and IT tools including prediction and modeling to complex engineering activities with an understanding of the
limitations.

PO6: The engineer and society: Apply reasoning informed by the contextual knowledge to assess societal,
health, safety, legal and cultural issues and the consequent responsibilities relevant to the professional
engineering practice.

PO7: Environment and sustainability: Understand the impact of the professional engineering solutions in
societal and environmental contexts, and demonstrate the knowledge of, and need for sustainable development.

PO8: Ethics: Apply ethical principles and commit to professional ethics and responsibilities and norms of the
engineering practice.

PO9: Individual and team work: Function effectively as an individual, and as a member or leader in
diverse teams, and in multidisciplinary settings.

PO10: Communication: Communicate effectively on complex engineering activities with the engineering
community and with society at large, such as, being able to comprehend and write effective reports and design
documentation, make effective presentations, and give and receive clear instructions.

PO11: Project management and finance: Demonstrate knowledge and understanding of the engineering and
management principles and apply these to one’s own work, as a member and leader in a team, to manage projects
and in multidisciplinary environments.

PO12: Life-long learning: Recognize the need for, and have the preparation and ability to engage in
independent and life-long learning in the broadest context of technological change.
COURSE OBJECTIVES:

• To create user-friendly relational and NoSQL data models.

• To create scalable and efficient data warehouses.
• To Develop skills to both design and critique visualizations
• To understand why visualization is an important part of data analysis.

PREREQUISITE:

• Basic knowledge in Data Analytics, Python and data visualization.

COURSE OUTCOMES:

Course Outcomes Blooms

CO. No.
level
On successful completion of this Course, students will be able to
C604. 1 Apply the fundamental concept of Data Analysis in real K3
time application.
C604. 2 Identify the strengths and weaknesses of different types of data K2
bases and data storage techniques.
C604. 3 Apply Data visualization technique for result analysis. K3

Manipulate data with matplotlib and seaborn.

C604. 4 K3
C604. 5 Setup data pipeline schedules. K3

COURSE OUTCOMES MAPPING WITH PROGRAM OUTCOMES AND PROGRAM SPECIFIC OUTCOMES

CO No. P PO PO- PO- PO- PO- PO- PO- PO- PO- PO- PO- PSO- PSO-
O-1 -2 3 4 5 6 7 8 9 10 11 12 1 2

C604. 1 2 2 1 - 3 - - - - 2 - - 2 3

C604. 2 2 2 1 - 3 - - - - 2 - - 2 3

C604. 3 2 2 1 - 3 - - - - 2 - - 2 3

C604. 4 3 3 2 - 3 - - - - 2 - - 2 3

C604. 5 3 3 2 - 3 - - - - 2 - - 2 3

Note:1: Slight,2: Moderate,3: Substantial.

Ex. No:1 USING NUMPY TOOLS AND ARRAY FOR DATA ANALYSIS

AIM:
Analyze sales data using NumPy tools and arrays, including calculating basic statistics,
identifying months with above-average sales, and determining month-over-month sales
growth.

ALGORITHM:
1. Import the NumPy library.
2. Define the sample sales data array for 12 months.
3. Calculate total sales using np.sum.
4. Calculate average sales using np. mean.
5. Find maximum sales using np.max.
6. Find minimum sales using np.min.
7. Identify months with above-average sales using np.where.
8. Calculate month-over-month sales growth using np.diff.
9. Calculate the average monthly growth using np. mean.
10. Print the total sales, average monthly sales, maximum monthly sales, minimum monthly
sales, months with above-average sales, and average monthly sales growth.

PROGRAM:

import numpy as np

# Generate sample sales data for 12 months

sales_data = np.array([100, 150, 200, 250, 300, 350, 400, 450, 500, 550, 600, 650])

# Calculate basic statistics

total_sales =
np.sum(sales_data)
average_sales =
np.mean(sales_data) max_sales
= np.max(sales_data)
min_sales = p.min(sales_data)
# Find months with above average sales
above_avg_months = np.where(sales_data > average_sales)[0]

# Calculate month-over-month sales growth

monthly_growth = np.diff(sales_data)
average_monthly_growth = np.mean(monthly_growth)
# Print results
print("Total sales:", total_sales)

print("Average monthly sales:", average_sales)

print("Maximum monthly sales:", max_sales)
print("Minimum monthly sales:", min_sales)
print("Months with above average sales:", above_avg_months)
print("Average monthly sales growth:", average_monthly_growth)

OUTPUT:

Total sales: 4500

Average monthly sales: 375.0

Maximum monthly sales: 650

Minimum monthly sales: 100

Months with above average sales: [ 6 7 8 9 10 11]
Average monthly sales growth: 50.0

RESULT :
Ex. No:2 DATA VISUALIZATION BASED ON PANDAS DATA STRUCTURES

AIM:
Visualize monthly sales data using Pandas data structures and Matplotlib.

ALGORITHM:
1. Import the Pandas library as pd.
2. Import the Matplotlib library as plt.
3. Define sample sales data in a dictionary format, including months and corresponding
sales.
4. Create a DataFrame using pd.DataFrame with the sales data.
5. Plot the sales data using Matplotlib:
- Set the figure size using plt.figure(figsize=(10,6)).
- Plot the sales data using plt.plot.
- Customize the plot with markers, color, and linestyle.
- Set the title using plt.title.
- Set the labels for x and y axes using plt.xlabel and plt.ylabel.
- Enable grid using plt.grid(True).
- Rotate x-axis labels using plt.xticks(rotation=45).
- Adjust layout using plt.tight_layout().
- Display the plot using plt.show().

PROGRAM:

import pandas as pd
import matplotlib.pyplot as plt

# Sample data:
Monthly sales data = {
'Month': ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'],'Sales': [100, 150,200,
250, 300, 350, 400, 450, 500, 550, 600, 650]
}

# Create a DataFrame
df = pd.DataFrame(data)

# Plotting
plt.figure(figsize=(10,
6))
plt.plot(df['Month'], df['Sales'], marker='o', color='b', linestyle='-')
plt.title('Monthly Sales')
plt.xlabel('Month')
plt.ylabel('Sales')
plt.grid(True)
plt.xticks(rotation=45
) plt.tight_layout()
plt.show()

OUTPUT:

RESULT:
APPLY VARIOUS FEATURES ON DATA LOADING, STORAGE
Ex. No:3
AND FILE FORMATS

AIM:
Apply various features for data loading, storage, and file formats using Pandas.

ALGORITHM:

1. Import the Pandas library as pd.

2. Define sample data in a dictionary format, including Name, Age, Gender, and City.
3. Create a DataFrame using pd.DataFrame with the sample data.
4. Save the DataFrame to different file formats:
- Save to CSV file using df.to_csv('data.csv', index=False).
- Save to Excel file using df.to_excel('data.xlsx', index=False).
- Save to JSON file using df.to_json('data.json', orient='records').
5. Load data from different file formats:
- Load from CSV file using pd.read_csv('data.csv').
- Load from Excel file using pd.read_excel('data.xlsx').
- Load from JSON file using pd.read_json('data.json').
6. Print the loaded data from each file format.

PROGRAM:

import pandas as pd

# Create sample data

data ={'Name': ['John', 'Alice', 'Bob', 'Emily', 'Michael'],'Age': [30, 25, 35,28, 40],'Gender': ['Male',
'Female', 'Male', 'Female', 'Male'],'City': ['New York', 'Los Angeles', 'Chicago', 'San Francisco', 'Boston']
}

# Create a DataFrame
df = pd.DataFrame(data)

# Save data to different file formats

df.to_csv('data.csv', index=False)
# Save to CSV file

df.to_excel('data.xlsx', index=False)

# Save to Excel file

df.to_json('data.json', orient='records')

# Save to JSON file

# Load data from different file formats

df_csv = pd.read_csv('data.csv')

# Load from CSV file

df_excel =pd.read_excel('data.xlsx')

# Load from Excel file

df_json = pd.read_json('data.json')

# Load from JSON file

# Print loaded data print("Loaded from
CSV:")
print(df_csv)
print("\nLoaded from Excel:")
print(df_excel)
print("\nLoaded from JSON:")
print(df_json)
OUTPUT:

Loaded from CSV:

Name Age Gender City
John 30 Male New York
Alice 25 Female Los Angeles
Emily 28 Female San Francisco
Michael 40 Male Boston

Loaded from Excel:

Name Age Gender City
John 30 Male New York
Alice 25 Female Los Angeles
Emily 28 Female San Francisco
Michael 40 Male Boston

Loaded from JSON:

Name Age Gender City
John 30 Male New York
Alice 25 Female Los Angeles

Emily 28 Female San Francisco

Michael 40 Male

RESULT:
APPLY USE OF PANDAS TOOLS FOR INTERACTING WITH
Ex. No:4
WEB APIs

AIM:
Fetch data from an API endpoint and convert it into a Pandas DataFrame.

ALGORITHM:

1. Import the Pandas library as pd and the requests library.

2. Define the API endpoint URL.
3. Send a GET request to the API using requests.get(api_url) and store the response.
4. Check if the request was successful (status code 200):
- If successful:
- Convert the JSON response to a Pandas DataFrame using response.json() and
pd.DataFrame(data).
- Display the DataFrame using print(df).
- If not successful:
- Print an error message indicating failure to fetch data from the API.

PROGRAM:

import pandas as pd
import requests

# Define the API endpoint URL

api_url = 'https://jsonplaceholder.typicode.com/posts'

# Send a GET request to the API

response =requests.get(api_url)
# Check if the request was successful (status code 200)
ifresponse.status_code == 200:
# Convert the JSON response to a pandas
DataFrame data =response.json()
df = pd.DataFrame(data)

# Display the DataFrame print("DataFrame from API response:")

print(df)
else:
print("Error fetching data from the API.")

OUTPUT:

DataFrame from API response:

userId id title \

0 1 1 sunt aut facere repellat provident occaecati e...

1 1 2 qui est esse
2 1 3 ea molestias quasi exercitationem repellat qui...
3 1 4 eum et est occaecati
4 1 5 nesciunt quas odio
.. ... ... ...
95 10 96 quaerat velit veniam amet cupiditate aut numqu...
96 10 97 quas fugiat ut perspiciatis vero provident
97 10 98 laboriosam dolor voluptates
98 10 99 temporibus sit alias delectus eligendi possimu...
99 10 100at nam consequatur ea labore ea harum body
0 quia et suscipit\nsuscipit recusandae consequu...
1 est rerum tempore vitae\nsequi sint nihil repr...
2 et iusto sed quo iure\nvoluptatem occaecati om...
3 ullam et saepe reiciendis voluptatem adipisci\...
4 repudiandae veniam quaerat sunt sed\nalias aut...
.. ...
95 in non odio excepturi sint eum\nlabore volupta... 96 eumnon
blanditiis soluta porro quibusdam volu...97 doloremqueex facilis
sit sint culpa\nsoluta a...
98 quo deleniti praesentium dicta non quod\naut e...99
cupiditate quo est a modi nesciunt soluta\nips...
[100 rows x 4 columns]

RESULT:
EXPLORE VARIOUS TOOLS BASED ON DATA CLEANING
Ex. No:5
AND PREPARATION

AIM:
To use Pandas, NumPy, Matplotlib, and Seaborn for data cleaning, preparation,
analysis, and visualization.

ALGORITHM:

1. Import required libraries.

2. Create a DataFrame with sample data.
3. Clean data by filling missing values with the mean age.
4. Prepare data by categorizing ages.
5. Analyze data by calculating average salary by gender and age distribution.
6. Visualize data with a bar plot for average salary by gender and a pie chart for age
distribution.

PROGRAM

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Create a sample DataFramedata = {'Name': ['John', 'Anna', 'Peter', 'Linda', 'Tom'], 'Age': [25, 30, 35,
np.nan, 40],'Gender': ['M', 'F', 'M', 'F', 'M'],'Salary': [50000, 60000, 70000, 55000, 65000]}
df = pd.DataFrame(data)

# Data Cleaning
df['Age'].fillna(df['Age'].mean(), inplace=True)
# Fill missing values with mean

# Data Preparation
df['Age_Category'] = pd.cut(df['Age'], bins=[20, 30, 40, 50], labels=['20s', '30s', '40s'])

# Create age categories

# Data Analysis
avg_salary_by_gender = df.groupby('Gender')['Salary'].mean()

# Calculate average salary by gender

age_distribution = df['Age_Category'].value_counts()

# Get age distribution

# Data Visualization plt.figure(figsize=(10, 6))
# Bar plot for average salary by gender
plt.subplot(1, 2, 1)
avg_salary_by_gender.plot(kind='bar', color=['blue', 'pink'])
plt.title('Average Salary by Gender')
plt.xlabel('Gender')
plt.ylabel('Average Salary')

# Pie chart for age

distribution

plt.subplot(1, 2, 2)
age_distribution.plot(kind='pie', autopct='%1.1f%%', colors=['lightblue', 'lightgreen','lightcoral'])
plt.title('Age Distribution')
plt.ylabel('')
plt.tight_layout()
plt.show()

OUTPUT:

RESULT:
Ex. No:6 USE OF DATA WRANGLING IN VISUALIZATION

AIM:
Utilize data wrangling techniques in data visualization.

ALGORITHM:
1. Data Preparation: Create a DataFrame with sample data representing sales and
expenses over years.
2. Data Wrangling: Calculate profit by subtracting Expenses from Sales and add it
as a new column.
3. Visualization: Plot Sales, Expenses, and Profit over Years.
- Each line represents a different aspect (Sales, Expenses, Profit) over the
years.
4. Enhancements:
• Labels: Add labels for the x-axis (Year) and y-axis (Amount).
• Title: Title the plot to reflect the data being visualized.
• Legend: Include a legend to differentiate between Sales, Expenses, and Profit.
• Grid: Enable grid lines to aid readability.
• X-axis Ticks: Ensure all years are shown on the x-axis for clarity.
5. Display: Show the finalized plot.

PROGRAM:

import pandas as pd
import matplotlib.pyplot as plt

# Create a DataFrame with sample data data ={'Year': [2015, 2016, 2017, 2018, 2019],'Sales': [100, 150, 200,
250, 300],'Expenses': [80, 100, 120, 150, 200]}
df = pd.DataFrame(data)

# Calculate profit by subtracting Expenses from Sales

df['Profit'] = df['Sales'] - df['Expenses']

# Plotting
plt.figure(figsize=(10, 6))
# Plot Sales
plt.plot(df['Year'], df['Sales'], marker='o', label='Sales')
# Plot Expenses
plt.plot(df['Year'], df['Expenses'], marker='o', label='Expenses')
# Plot Profit
plt.plot(df['Year'], df['Profit'], marker='o', label='Profit')

# Add labels and title

plt.title('Sales, Expenses, and Profit Over Years')
plt.xlabel('Year')
plt.ylabel('Amount')
plt.legend()

plt.grid(True)
plt.xticks(df['Year'])
# Ensure all years are shown on the x-axis

# Show plot
plt.tight_layout()

plt.show()

OUTPUT:

RESULT:
Ex. No:7 DATA VISUALIZATION USING MATPLOTLIB

AIM:
To visualize the data using Matplotlib.

ALGORITHM:
1. Import the Matplotlib.pyplot library as plt.
2. Define sample data: months and corresponding sales.
3. Plot the data:
• Set the figure size using plt.figure(figsize=(8, 5)).
• Create a bar plot using plt.bar(months, sales, color='skyblue').
4. Add labels and title:
• Label the x-axis as 'Month' using plt.xlabel.
• Label the y-axis as 'Sales' using plt.ylabel.
• Set the title of the plot to 'Monthly Sales' using plt.title.
5. Add a grid to the plot using plt.grid(True).
6. Display the plot using plt.show().

PROGRAM:

import matplotlib.pyplot as plt

# Sample data
months = ['January', 'February', 'March', 'April', 'May']

sales = [100, 150, 200, 250, 300]

# Plotting plt.figure(figsize=(8, 5))

# Bar plot
plt.bar(months, sales, color='skyblue')

# Adding labels and title

plt.xlabel('Month')
plt.ylabel('Sales')
plt.title('Monthly Sales')

# Adding grid
plt.grid(True)
# Show plot
plt.show()

OUTPUT:

RESULT:
AGGREGATE ‘SUM’ AND ‘MIN’ FUNCTION ACROSS ALL
Ex. No:8
THE COLUMNS IN DATA FRAME USING DATA
AGGREGATION FUNCTIONS

AIM:
Aggregate data using 'sum' and 'min' functions across all columns in a DataFrame
using data aggregation functions.

ALGORITHM:

1. Import the Pandas library as pd.

2. Define sample data in a dictionary format with columns A, B, and C.
3. Create a DataFrame using pd.DataFrame(data).
4. Aggregate data using 'sum' and 'min' functions across all columns:
• Use the agg function on the DataFrame (df) with parameters ['sum', 'min']
to specify the aggregation functions.
• Store the aggregated data in a new DataFrame (aggregated_data).
5. Display the aggregated data using print().

PROGRAM:

import pandas as pd

# Sample DataFrame
data = { 'A': [1,2, 3, 4],'B': [5, 6, 7, 8],'C': [9, 10, 11, 12]}

df = pd.DataFrame(data)

# Aggregate using sum and min functions

aggregated_data = df.agg(['sum', 'min'])

# Display the aggregated data
print("Aggregated Data:")
print(aggregated_data)
OUTPUT:

Aggregated Data:

A B C

Sum 10 26 42

Min 1 5 9

RESULT:
Ex. No:9
DATA BASED ON TIME SERIES DATA ANALYSIS

AIM:
Conduct comprehensive analysis and visualization of time series data.

ALGORITHM:

1. Generate Data: Create a time series dataset with dates ranging from '2024-01-01'
to '2024-04-09' and random values.
2. Display Initial Data: Print the first few rows of the generated data to inspect its
structure and values.
3. Plot Time Series Data: Visualize the time series data by plotting 'Date'
against 'Value'.
• Set up the plot with appropriate labels and titles.
• Enable grid lines for clarity.
4. Basic Data Analysis: Provide basic statistics of the data using describe()
function to understand its distribution and summary metrics.
5. Calculate Rolling Mean: Compute the rolling mean of the 'Value' column
using a window size of 7 to smooth out fluctuations.
6. Plot Rolling Mean: Overlay the original data and rolling mean on the same
plot to observe trends and changes over time.
• Label the lines appropriately and include a legend for clarity.
7. Display Plots: Show both plots to visualize the time series data and its rolling
mean.

PROGRAM:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Generate sample time series data

np.random.seed(0)

dates = pd.date_range('2024-01-01', periods=100)

values = np.random.randint(50, 200, size=100)
df = pd.DataFrame({'Date': dates, 'Value': values})

# Display the first few rows of the generated data

print("First few rows of the generated data:")
print(df.head())
# Plot the generated time series data
plt.figure(figsize=(10, 6))
plt.plot(df['Date'],df['Value'])
plt.title('Generated Time Series Data')
plt.xlabel('Date')
plt.ylabel('Value')
plt.grid(True)
plt.show()

# Basic Time Series Data Analysis

print("\nBasic statistics of the generated data:")
print(df.describe())
# Calculate rolling mean and plot
rolling_mean = df['Value'].rolling(window=7).mean()

plt.figure(figsize=(10, 6))
plt.plot(df['Date'], df['Value'], label='Original Data')
plt.plot(df['Date'], rolling_mean, label='Rolling Mean (window=7)')
plt.title('Rolling Mean of Time Series Data')

plt.xlabel('Date')
plt.ylabel('Value')

plt.legend()
plt.grid(True)

plt.show()
OUTPUT:

First few rows of the generated data:Date Value

0 2024-01-01 97
1 2024-01-02 167
2 2024-01-03 117
3 2024-01-04 153
4 2024-01-05 59

Basic statistics of the generated data:

Date Value
count 100 100.000000
mean 2024 02-19 12:00:00 130.150000

min 2024-01-01 00:00:00 50.000000

25% 2024-01-25 18:00:00 91.250000
50% 2024-02-19 12:00:00 133.000000
75% 2024-03-15 06:00:00 169.250000
max 2024-04-09 00:00:00 199.000000
std NaN 44.147452
RESULT:
Ex. No:10 EXPLORE VARIOUS DATA PRE-PROCESSING OPTIONS
USING BENCH MARK DATA SETS

AIM:
Explore different data preprocessing options using benchmark datasets.

ALGORITHM:

1. Import the Pandas library as pd and load the iris dataset from scikit- learn.
2. Load the iris dataset using load_iris() function from sklearn.datasets.
3. Create a DataFrame (df) using the iris dataset's data and feature names.
4. Introduce missing values into the DataFrame for demonstration purposes.
5. Fill the missing values with the mean of each column.
6. Create a new DataFrame (df_filled) with missing values filled using the fillna()
method with the mean of each column.
7. Print the first few rows of the filled DataFrame to inspect the changes.

PROGRAM:

import pandas as pd
from sklearn.datasets
import load_iris

# Load Iris datasetiris = load_iris()

df = pd.DataFrame(data=iris.data, columns=iris.feature_names)

# Add some missing values for demonstration

df.loc[::5, 'sepal length (cm)'] = None

# introduce missing values

# Fill missing values with the mean of the column

df_filled = df.fillna(df.mean())
print(df_filled.head())
OUTPUT:

sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)

0 5.799167 3.5 1. 0.2

4
1 4.900000 3.0 1. 0.2
4
2 4.700000 3.2 1. 0.2
3
3 4.600000 3.1 1. 0.2
5
4 5.000000 3.6 1. 0.2
4

RESULT:
FORMULATE REAL BUSINESS PROBLEMS SCENARIOS TO
Ex. No:11
HYPOTHESIS AND SOLVE STATISTICAL TESTING
FEATURES.

AIM:
Investigate data preprocessing options using benchmark datasets.

ALGORITHM:

1. Define sample data representing two groups.

2. Formulate a business problem scenario and hypotheses.
3. Perform a t-test for independent samples using ttest_ind.
• Interpret the results based on the calculated p-value and
significance level.
4. Conclude whether to reject or fail to reject the null hypothesis.

PROGRAM:

import pandas as pd
from scipy.stats
import ttest_ind

# Load the dataset (sample data for illustration purposes)

data= {'Group_A': [10, 12, 14, 16, 18],'Group_B': [9, 11, 13, 15, 17]}
df = pd.DataFrame(data)

# Define the business problem scenario

print("Business Problem Scenario:")

print("A company wants to determine if there is a significant difference in productivity between Group A
and Group B.")
# Define the
hypothesis
print("\nHypothesis:"
)
print("Null Hypothesis (H0): There is no significant difference in productivity betweenGroup A and
Group B.")
print("Alternative Hypothesis (H1): There is a significant difference in productivitybetween Group A
and Group B.")
# Perform t-test for independent
samples
statistic, p_value = ttest_ind(df['Group_A'], df['Group_B'])

# Interpret the results

alpha = 0.05
print("\nStatistical
Testing:") print(f"T-
statistic: {statistic}")
print(f"P-value:
{p_value}")
if p_value < alpha:
print("\nConclusion:")
print("Reject the null hypothesis (H0). There is a significant difference in
productivitybetweenGroup A and Group B.")
else:
print("\nConclusion:")
print("Fail to reject the null hypothesis (H0). There is no significant difference inproductivity
between Group A and Group B.")
OUTPUT:

Business Problem Scenario:

A company wants to determine if there is a significant difference in productivity
betweenGroup A and Group B.
Hypothesis:
Null Hypothesis (H0): There is no significant difference in productivity between
Group Aand Group B.
Alternative Hypothesis (H1):
There is a significant difference in productivity between Group A and Group B.

Statistical Testing:
T-statistic: 0.5
P-value: 0.6305360755569764

Conclusion:
Fail to reject the null hypothesis (H0). There is no significant difference in
productivitybetween Group A and Group B.

RESULT:
Ex. No:12 FORMULATE REAL BUSINESS PROBLEMS SCENARIOS TO
HYPOTHESIS AND SOLVE USING PANDAS.

AIM:
Solve real business problems using Pandas by formulating hypotheses and
conducting hypothesis testing.

ALGORITHM:

1. Load the dataset

• Create sample data representing employee salaries and
departments.
2. Determine if there's a significant difference in salaries between the Sales and
Marketing departments.
3. Formulate Hypotheses:
• Null Hypothesis (H0): No significant difference in salaries
between Sales and Marketing.
• Alternative Hypothesis (H1): Significant difference in salaries
between Sales and Marketing.
4. Filter and calculate mean salaries for Sales and Marketing departments.
5. Use t-test (ttest_ind) to compare salaries between departments. Calculate t-
statistic and p-value.
6. Interpret Results:
• Compare p-value to significance level (alpha).
• Reject or fail to reject null hypothesis based on p-value.
7. Based on hypothesis test results, determine if there's a significant difference in
salaries between departments.

PROGRAM:

import pandas as pd
from scipy.stats
import ttest_ind

# Load the dataset (sample data for illustration purposes)

data= {'Employee_ID': [1, 2, 3, 4, 5],'Department': ['Sales', 'Marketing', 'Sales', 'Finance', 'Marketing'],
'Salary': [50000, 60000, 55000, 70000, 65000]}
df = pd.DataFrame(data)

# Define the business problem scenario

print("Business Problem Scenario:")
print("A company wants to determine if there is a significant difference in salariesbetween the
Salesdepartment and the Marketing department.")
# Define the hypothesis
print("\nHypothesis:")
print("Null Hypothesis (H0): There is no significant difference in salaries between theSales department and
the Marketing department.")
print("Alternative Hypothesis (H1): There is a significant difference in salaries between the
Salesdepartment and the Marketing department.")
# Data Analysis using pandas
sales_salaries = df[df['Department'] == 'Sales']['Salary']
marketing_salaries= df[df['Department'] == 'Marketing']['Salary']

# Calculate mean salaries for each department

sales_mean_salary = sales_salaries.mean()
marketing_mean_salary =
marketing_salaries.mean()

# Print mean salaries

print("\nMean Salary for Sales department:", sales_mean_salary)
print("Mean Salary for Marketing department:",
marketing_mean_salary)

# Perform hypothesis testingalpha = 0.05

if sales_mean_salary != marketing_mean_salary:
print("\nStatistical Testing:")
t_statistic, p_value = ttest_ind(sales_salaries,
marketing_salaries)print(f"T-statistic: {t_statistic}")
print(f"P-value: {p_value}")

if p_value < alpha:

print("\nConclusion:")
print("Reject the null hypothesis (H0). There is a significant difference in salaries between the
Salesdepartment and the Marketing department.")
else:
print("\nConclusion:")
print("Fail to reject the null hypothesis (H0). There is no significant difference in salaries between the
Sales
department and the Marketing department.")

else:
print("\nConclusion:")
print("Mean salaries are equal. Statistical testing is not required.")

OUTPUT:

Business Problem Scenario:

A company wants to determine if there is a significant difference in salaries

between theSales department and the Marketing department.

Hypothesis:
Null Hypothesis (H0): There is no significant difference in salaries between the
Salesdepartment and the Marketing department.
Alternative Hypothesis (H1): There is a significant difference in salaries
between theSales department and the Marketing department.

Mean Salary for Sales department: 52500.0

MeanSalary for Marketing department:
62500.0

Statistical Testing:
T-statistic: -2.8284271247461903
P-value: 0.1055728090000841

Conclusion:
Fail to reject the null hypothesis (H0). There is no significant difference in salaries

betweenthe Sales department and the Marketing department.

RESULT:

Ocs353dsf Unit Wise Notes
100% (2)
Ocs353dsf Unit Wise Notes
121 pages
R Studio Cheat Sheet For Math1041
No ratings yet
R Studio Cheat Sheet For Math1041
3 pages
Eda Lab Verified
No ratings yet
Eda Lab Verified
38 pages
Eda Lab Manual Without Output
No ratings yet
Eda Lab Manual Without Output
33 pages
DMV Lab Manual
No ratings yet
DMV Lab Manual
45 pages
Course Plan For DEV
No ratings yet
Course Plan For DEV
18 pages
Vamshi ml-1,2
No ratings yet
Vamshi ml-1,2
25 pages
DV Lab 97541
No ratings yet
DV Lab 97541
91 pages
DV Lab Manual AI DS 2024 25
No ratings yet
DV Lab Manual AI DS 2024 25
89 pages
Data Science
No ratings yet
Data Science
10 pages
EDA Lab Record
No ratings yet
EDA Lab Record
45 pages
DS&BD Lab Manul
No ratings yet
DS&BD Lab Manul
98 pages
Lab Manual FOR CSE 355/ Data Science Professional Certification Name
No ratings yet
Lab Manual FOR CSE 355/ Data Science Professional Certification Name
20 pages
Updated - DSV - Lab Manual (2024-25)
No ratings yet
Updated - DSV - Lab Manual (2024-25)
90 pages
Data Science Lab Manual
No ratings yet
Data Science Lab Manual
30 pages
Record DSCP508 - DV-1-1
No ratings yet
Record DSCP508 - DV-1-1
89 pages
Data Science and Big Data by IBM CE Allsoft Summer Training Final Report
100% (1)
Data Science and Big Data by IBM CE Allsoft Summer Training Final Report
41 pages
Data Analysis
No ratings yet
Data Analysis
8 pages
R Manual
No ratings yet
R Manual
62 pages
Ad3467 Data Science and Analytics Laboratory Manual
No ratings yet
Ad3467 Data Science and Analytics Laboratory Manual
59 pages
Dav Cis R20 DS
No ratings yet
Dav Cis R20 DS
9 pages
Dsa Report
No ratings yet
Dsa Report
24 pages
CS3352 Foundations of Data Science
No ratings yet
CS3352 Foundations of Data Science
27 pages
Self Intoduction 1 Project
No ratings yet
Self Intoduction 1 Project
11 pages
DSBA Curriculum Guide
No ratings yet
DSBA Curriculum Guide
18 pages
Hemanth SDP
No ratings yet
Hemanth SDP
13 pages
FDS Lab
No ratings yet
FDS Lab
43 pages
OJT-Field Report - Research Project Format 2025
No ratings yet
OJT-Field Report - Research Project Format 2025
9 pages
Data Analytics
No ratings yet
Data Analytics
22 pages
CHaitanya Mondi - CV
No ratings yet
CHaitanya Mondi - CV
3 pages
Knowledge Institute of Technology: (An Autonomous Institution)
No ratings yet
Knowledge Institute of Technology: (An Autonomous Institution)
33 pages
DL 1
No ratings yet
DL 1
63 pages
Become An AI Engineer - Baap of All Jobs
No ratings yet
Become An AI Engineer - Baap of All Jobs
29 pages
Institute's Vision
No ratings yet
Institute's Vision
57 pages
Report 5th
No ratings yet
Report 5th
14 pages
DS Curriculum
No ratings yet
DS Curriculum
4 pages
227C4A Data Science
No ratings yet
227C4A Data Science
2 pages
Ad3411 Datascience and Analytics Record
No ratings yet
Ad3411 Datascience and Analytics Record
49 pages
Experiment List. DSPYL
No ratings yet
Experiment List. DSPYL
10 pages
Sukhrobov Siroj Resume
No ratings yet
Sukhrobov Siroj Resume
5 pages
Data Science Lab Manual
No ratings yet
Data Science Lab Manual
93 pages
Dav Report
No ratings yet
Dav Report
17 pages
Department of Computer Science and Engineering: Even Semester
No ratings yet
Department of Computer Science and Engineering: Even Semester
45 pages
DSBDAlab Manual
No ratings yet
DSBDAlab Manual
116 pages
R22EF169 - 4th SEM - SDP - Report
No ratings yet
R22EF169 - 4th SEM - SDP - Report
11 pages
Ids Course File-3rd Year
No ratings yet
Ids Course File-3rd Year
34 pages
Course Outline DPA
No ratings yet
Course Outline DPA
5 pages
Data Analytics Lab File Rohit
No ratings yet
Data Analytics Lab File Rohit
23 pages
Data Science Lab Master Record
No ratings yet
Data Science Lab Master Record
59 pages
DATA MINING Using PYTHON
No ratings yet
DATA MINING Using PYTHON
37 pages
DWDM Lab Manual
No ratings yet
DWDM Lab Manual
31 pages
Pavithra Balaji: Skills
No ratings yet
Pavithra Balaji: Skills
1 page
3rd Sem Syllabus
No ratings yet
3rd Sem Syllabus
5 pages
Ilide - Info Data Analytics Lab File Rohit PR
No ratings yet
Ilide - Info Data Analytics Lab File Rohit PR
23 pages
In Semester (Individual) Assignment
No ratings yet
In Semester (Individual) Assignment
12 pages
BTCS9202 Data Sciences Lab Manual
No ratings yet
BTCS9202 Data Sciences Lab Manual
39 pages
Roshan SDP
No ratings yet
Roshan SDP
11 pages
Certified Professional Diploma in Data Science-1
No ratings yet
Certified Professional Diploma in Data Science-1
43 pages
Data Analytics With R - BDS306C - LAB - Full
No ratings yet
Data Analytics With R - BDS306C - LAB - Full
61 pages
STAT2000 - Unit 1
No ratings yet
STAT2000 - Unit 1
217 pages
Channelling Fisher Randomization Tests and The Statistical Insignificance of Seemingly Significant Experimental Results
No ratings yet
Channelling Fisher Randomization Tests and The Statistical Insignificance of Seemingly Significant Experimental Results
47 pages
Inference by Eye Confidence Intervals An PDF
No ratings yet
Inference by Eye Confidence Intervals An PDF
11 pages
Causes and Effects of Change Orders On Construction Projects in Kuwait
No ratings yet
Causes and Effects of Change Orders On Construction Projects in Kuwait
9 pages
Kakkad Thesis 2017
No ratings yet
Kakkad Thesis 2017
37 pages
DR 1 Mulungu-and-Mukama 221019 170123
No ratings yet
DR 1 Mulungu-and-Mukama 221019 170123
15 pages
Cour Arena Slide
No ratings yet
Cour Arena Slide
68 pages
Aviation Research Methods and Techniques Rough
No ratings yet
Aviation Research Methods and Techniques Rough
14 pages
Enhancing Foreign Consumer Acceptance
No ratings yet
Enhancing Foreign Consumer Acceptance
23 pages
Advances in Data Mining. Applications and Theoretical Aspects
No ratings yet
Advances in Data Mining. Applications and Theoretical Aspects
336 pages
A. Find Critical Value Problem Statement: Code: Output:: Syit - Cost Aim: Chi-Squared Test
No ratings yet
A. Find Critical Value Problem Statement: Code: Output:: Syit - Cost Aim: Chi-Squared Test
42 pages
Business Educators Perceived Career Advancement
No ratings yet
Business Educators Perceived Career Advancement
10 pages
Aspects of Strategic Intelligence and Its Role in Achieving Organizational Agility: An Empirical Investigation
No ratings yet
Aspects of Strategic Intelligence and Its Role in Achieving Organizational Agility: An Empirical Investigation
10 pages
Academic and Social Motivation Practical Research 2
No ratings yet
Academic and Social Motivation Practical Research 2
29 pages
06 Machine - Learning.Capstone
No ratings yet
06 Machine - Learning.Capstone
50 pages
40-Article Text-107-1-10-20190426
No ratings yet
40-Article Text-107-1-10-20190426
4 pages
Introduction To Bonev Package: Dongmei Li February 12, 2016
No ratings yet
Introduction To Bonev Package: Dongmei Li February 12, 2016
3 pages
One-Way ANOVA Step-by-Step JASP Guide
No ratings yet
One-Way ANOVA Step-by-Step JASP Guide
21 pages
Research in English 1
No ratings yet
Research in English 1
50 pages
Case Study Reliance
No ratings yet
Case Study Reliance
5 pages
Hypothesis Testing Excercise 2022 23
No ratings yet
Hypothesis Testing Excercise 2022 23
11 pages
Determinants of Higher Education Quality
No ratings yet
Determinants of Higher Education Quality
16 pages
Male Vs Female Academic Performance 2019 Tarp
No ratings yet
Male Vs Female Academic Performance 2019 Tarp
34 pages
The Effectiveness of Red and Blue Led Lights As The Primary SOURCE OF LIGHT IN THE GROWTH OF Solanum Lycopersicum
No ratings yet
The Effectiveness of Red and Blue Led Lights As The Primary SOURCE OF LIGHT IN THE GROWTH OF Solanum Lycopersicum
32 pages
Consumer Perception Towards Online Grocery Stores, Chennai
No ratings yet
Consumer Perception Towards Online Grocery Stores, Chennai
14 pages
Consumer Buying Behaviour Regarding Shoes
No ratings yet
Consumer Buying Behaviour Regarding Shoes
27 pages
ML Interview Questions and Answers
100% (1)
ML Interview Questions and Answers
25 pages
Statistical Instruments and References Writing in Research
No ratings yet
Statistical Instruments and References Writing in Research
36 pages
Ga2 - Sbase - G4
No ratings yet
Ga2 - Sbase - G4
15 pages