0% found this document useful (0 votes)
30 views3 pages

Assignment 2

This document performs several preprocessing tasks on an iris dataset: it calculates completeness, replaces special values with NA, defines validation rules in a text file, checks for rule violations, and identifies outliers in sepal length using boxplots. No rules are found to be violated.

Uploaded by

BHAVIKA MALHOTRA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views3 pages

Assignment 2

This document performs several preprocessing tasks on an iris dataset: it calculates completeness, replaces special values with NA, defines validation rules in a text file, checks for rule violations, and identifies outliers in sepal length using boxplots. No rules are found to be violated.

Uploaded by

BHAVIKA MALHOTRA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

Q2.

Perform the following preprocessing tasks on the dirty_iris dataset i) Calculate the number
and percentage of observations that are complete.

ii) Replace all the special values in data with NA. iii) Define these rules in a separate text file and
read them. (Use editfile function in R (package editrules). Use similar function in Python).

Print the resulting constraint object. – Species should be one of the following values: setosa,
versicolor or virginica. – All measured numerical properties of an iris should be positive. – The
petal length of an iris is at least 2 times its petal width. – The sepal length of an iris cannot
exceed 30 cm. – The sepals of an iris are longer than its petals.

iv)Determine how often each rule is broken (violatedEdits). Also summarize and plot the result.
v) Find outliers in sepal length using boxplot and boxplot.stats

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Load the iris dataset


iris = sns.load_dataset('iris')

# i) Calculate the number and percentage of observations that are


complete
complete_observations = iris.dropna()
num_complete_observations = len(complete_observations)
percentage_complete = (num_complete_observations / len(iris)) * 100

print(f"Number of complete observations: {num_complete_observations}")


print(f"Percentage of complete observations: {percentage_complete:.2f}
%")

Number of complete observations: 150


Percentage of complete observations: 100.00%

# ii) Replace all the special values in data with NA


iris.replace(['?', '!','$','^'], pd.NA, inplace=True)

# iii) Define rules in a separate text file and read them


# Save the rules in a text file (e.g., rules.txt)
rules_filename = 'rules.txt'
with open(rules_filename, 'w') as file:
file.write("""
species: setosa, versicolor, virginica
numerical_properties: positive
petal_length: >= 2 * petal_width
sepal_length: <= 30
sepals_length_greater_than_petals: sepal_length >
petal_length""")

# Read rules from the text file


with open(rules_filename, 'r') as file:
rules = file.read()

# Print the resulting constraint object


print("Rules:", rules)

Rules:
species: setosa, versicolor, virginica
numerical_properties: positive
petal_length: >= 2 * petal_width
sepal_length: <= 30
sepals_length_greater_than_petals: sepal_length > petal_length

# iv) Determine how often each rule is broken


violated_rules_count = 0

# Rule 1: Species should be one of the following values


violated_rules_count += len(iris[~iris['species'].isin(['setosa',
'versicolor', 'virginica'])])

# Rule 2: All measured numerical properties of an iris should be


positive
numerical_properties = ['sepal_length', 'sepal_width', 'petal_length',
'petal_width']
violated_rules_count += len(iris[(iris[numerical_properties] <=
0).any(axis=1)])

# Rule 3: The petal length of an iris is at least 2 times its petal


width
violated_rules_count += len(iris[iris['petal_length'] < 2 *
iris['petal_width']])

# Rule 4: The sepal length of an iris cannot exceed 30 cm


violated_rules_count += len(iris[iris['sepal_length'] > 30])

# Rule 5: The sepals of an iris are longer than its petals


violated_rules_count += len(iris[iris['sepal_length'] <=
iris['petal_length']])

print(f"Number of violated rules: {violated_rules_count}")

Number of violated rules: 0

# v) Find outliers in sepal length using boxplot and boxplot.stats


ax = sns.boxplot(x='sepal_length', data=iris)
plt.title('Boxplot for Sepal Length')
plt.show()

# Get boxplot statistics


boxplot_stats = ax.get_lines()[0].get_ydata()
print("Boxplot Statistics:")
print(boxplot_stats)

Boxplot Statistics:
[0 0]

# Get boxplot statistics using get_lines()[0].get_ydata()


outliers = ax.get_lines()[0].get_ydata()

print("Outliers in Sepal Length:")


print(outliers)

Outliers in Sepal Length:


[0 0]

You might also like