0% found this document useful (0 votes)
3 views4 pages

Data Wrangling 1

The document outlines a laboratory exercise for a Data Science and Big Data Analytics course, focusing on data wrangling using the Iris dataset. It details steps including importing libraries, loading the dataset, preprocessing data, checking for missing values, and encoding categorical variables. The exercise emphasizes data manipulation and visualization techniques using Python's pandas, numpy, seaborn, and matplotlib libraries.

Uploaded by

Chirag Patekar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views4 pages

Data Wrangling 1

The document outlines a laboratory exercise for a Data Science and Big Data Analytics course, focusing on data wrangling using the Iris dataset. It details steps including importing libraries, loading the dataset, preprocessing data, checking for missing values, and encoding categorical variables. The exercise emphasizes data manipulation and visualization techniques using Python's pandas, numpy, seaborn, and matplotlib libraries.

Uploaded by

Chirag Patekar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4

Third Year Engineering (2019 Pattern)

Course Code: 310256


Course Name: Data Science and Big Data Analytics Laboratory
Group A
1) Data Wrangling I
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Step 2: Locate an open-source dataset


# I'll use the "Iris" dataset from UCI Machine Learning Repository
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"

# Column names based on the dataset documentation


columns = ["sepal_length", "sepal_width", "petal_length", "petal_width",
"species"]

# Step 3: Load the dataset into a pandas dataframe


df = pd.read_csv(url, names=columns)

# Step 4: Data Preprocessing


print("\nBasic Statistics of the Dataset:\n")
print(df.describe()) # Provides basic statistics of numerical variables

print("\nChecking for Missing Values:\n")


print(df.isnull().sum()) # Check for missing values

print("\nDataset Dimensions (Rows, Columns):", df.shape)

# Step 5: Data Formatting and Normalization


print("\nData Types Before Conversion:\n")
print(df.dtypes)

# Convert categorical variable 'species' to categorical data type


df['species'] = df['species'].astype('category')

print("\nData Types After Conversion:\n")


print(df.dtypes)

# Step 6: Convert Categorical Variables into Quantitative Variables


print("\nEncoding Categorical Variable 'species':\n")
df['species_encoded'] = df['species'].cat.codes
print(df.head())
Explanation of Each Step:
1. Import Libraries:
o pandas: Handles dataframes and data manipulation.
o numpy: Supports numerical operations.
o seaborn & matplotlib: Used for visualization.
2. Dataset Selection:
o The Iris dataset is a well-known dataset for classification tasks.
o It is sourced from the UCI Machine Learning Repository: Iris
Dataset.
3. Loading the Dataset:
o Read the dataset directly from the web into a pandas dataframe.
o Assign column names based on dataset documentation.
4. Data Preprocessing:
o Use .describe() to get summary statistics.
o Check for missing values using .isnull().sum().
o Print dataset dimensions.
5. Data Formatting and Normalization:
o Check data types using .dtypes.
o Convert the categorical column species into a categorical data
type.
6. Encoding Categorical Variables:
o Convert the species categorical column into a numerical format
using .cat.codes.

OUTPUT-

You might also like