Dsbda Ass1
Dsbda Ass1
Data Analytics
Laboratory
Third Year 2019 Course
Prof.K.B.Sadafale
Assistant Professor
Computer Dept. GCOEAR, Avasari
Data Wrangling: I
➢ The aim is to make data more accessible for things like business
analytics or machine learning.
➢ This means Python programs are not compiled and stored into a binary
code file (e.g., an executable)., but its source is translated into machine
code and executed by the interpreter directly (without us seeing the
executable code).
✓ Python Operators
Python Basic Input and Output
Python Output
In Python, we can simply use the print() function to print output.
For example:
print('Python is powerful')
Here,
Syntax of input()
input(prompt)
Output
Enter a number: 10
You Entered: 10
Data type of num: <class 'str'>
To convert user input into a number we can use int() or float() functions as:
num = int(input('Enter a number: '))
Here, the data type of the user input is converted from string to integer .
Python Data Types
Data Types Classes Description
num1 = 1.5
num2 = 6.3
Output
Output
import module_name.member_name
2.Syntax:
➢ Open Data means the kind of data which is open for anyone
and everyone for access, modification, reuse, and sharing.
✓ https://www.kaggle.com
✓ https://rockcontent.com/blog/data-sources/
What is a CSV?
✓ “Comma Separated Values.” It is the simplest form of storing
data in tabular form as plain text.
✓ Structure of CSV:
Reading a CSV
Put CSV in default directory “C:\Users\KBS”
Any other directory then put a csv file path like
"D:\\Demo\\Salary_Data.csv"
Salary_Data.csv
2 Implementing the above code using with() statement:
import csv
rows = []
with open("Salary_Data.csv") as file:
csvreader = csv.reader(file)
header = next(csvreader)
for row in csvreader:
rows.append(row)
print(header)
print(rows)
file.close()
pandas
✓ Pandas is a Python library.
✓ Pandas is used to analyze data.
✓ Pandas is an open-source,
✓ BSD-licensed Python library providing high-
performance, easy-to-use data structures and data
analysis tools for the Python programming
language.
✓ Python with Pandas is used in a wide range of fields
including academic and commercial domains
including finance, economics, Statistics, analytics,
etc.
Key Features of Pandas
✓ Fast and efficient DataFrame object with default and
customized indexing.
✓ Tools for loading data into in-memory data objects from
different file formats.
✓ Data alignment and integrated handling of missing data.
✓ Reshaping and pivoting of date sets.
✓ Label-based slicing, indexing and subsetting of large data
sets.
✓ Columns from a data structure can be deleted or inserted.
✓ Group by data for aggregation and transformations.
✓ High performance merging and joining of data.
✓ Time Series functionality.
Using pandas
1. Import pandas library
Import csv
header = ['Name', 'M1 Score', 'M2 Score']
data = [['Alex', 62, 80], ['Brad', 45, 56], ['Joey', 85, 98]]
filename = 'Students_Data.csv'
with open(filename, 'w', newline="") as file:
csvwriter = csv.writer(file) # create a csvwriter object
csvwriter.writerow(header) # write the header
csvwriter.writerows(data) # write the rest of the data
Writing to a CSV file
2 Using .writelines()
data = [['Alex', 62, 80], ['Brad', 45, 56], ['Joey', 85, 98]]
data.to_csv('Stu_data.csv', index=False)
Data cleaning
Data cleaning means fixing bad data in your data set.
Empty cells
Data in wrong format
Wrong data
Duplicates
data.isnull().sum()
How to Count NaN (Not a Number) values in Pandas DataFrame
✓ Using Numpy
➢ you can place np.nan each time you want to add a NaN value in
the DataFrame.
➢For example, in the code below, there are 4 instances of np.nan under a
single Data Frame column:
(2) Importing a file with blank values
Product Price
Desktop Computer 700
Tablet
500
Laptop 1200
Product.csv
The Example
For example, let’s get the count of NaNs under the ‘first_set‘
column:
(2) Count NaN values under the entire DataFrame
In that case, you may use the following syntax to get the total
count of NaNs:
df.isna().sum().sum()
(3) Count NaN values across a single DataFrame row:
✓ You’ll need to specify the index value that represents the row
needed.
✓ The index values are located on the left side of the
DataFrame (starting from 0):
➢ Let’s say that you want to count the NaN values across the
row with the index of 7:
How to Remove Duplicates from Pandas DataFrame
df.drop_duplicates()
Color Shape
For example, let’s say that you have the Green Rectangle
following data about boxes, where
each box may have a different color or shape:
Green Rectangle
Green Square
Blue Rectangle
Blue Square
Red Square
Red Square
Red Rectangle
create Pandas DataFrame using this code:
Remove duplicates from Pandas DataFrame
df.drop_duplicates()
Let’s say that you want to remove the duplicates across the two
columns of Color and Shape.
As you can see, only the distinct values across the two columns remain:
✓ But what if you want to remove the duplicates on a specific
column, such as the Color column?
df.drop_duplicates(subset=[‘Color’])
df.drop_duplicates(subset=[‘Shape’])
Get the Descriptive Statistics for
Pandas DataFrame
✓ To get the descriptive statistics for a specific column in your
DataFrame:
df['dataframe_column'].describe()
df.describe(include='all')
Steps to Get the Descriptive Statistics for Pandas DataFrame
✓ you want to get the descriptive statistics for the ‘price‘ field,
which contains numerical data
✓ df['price'].describe()
The output contains 6 decimal places. You may then
add astype(int) to the code to get integer values.
descriptive statistics for the ‘product‘ field using this code:
to get the descriptive statistics for the entire DataFrame:
df.describe(include='all')