0% found this document useful (0 votes)
35 views61 pages

Dsbda Ass1

Uploaded by

ngak1214
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views61 pages

Dsbda Ass1

Uploaded by

ngak1214
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 61

Data Science and Big

Data Analytics
Laboratory
Third Year 2019 Course

Prof.K.B.Sadafale
Assistant Professor
Computer Dept. GCOEAR, Avasari
Data Wrangling: I

Perform the operations using Python


on any open source dataset
Data Wrangling
➢ Data wrangling is the process of removing errors and combining
complex data sets to make them more accessible and easier to
analyze.

➢ Due to the rapid expansion of the amount of data and data


sources available today, storing and organizing large quantities of
data for analysis is becoming increasingly necessary.

➢ Data wrangling can be defined as the process of cleaning, organizing,


and transforming raw data into the desired format for analysts to use
for prompt decision-making.

➢ The aim is to make data more accessible for things like business
analytics or machine learning.

➢ Also known as data cleaning or data munging.


Importance of Data Wrangling
Data professionals spend almost 80% of their time wrangling the data, leaving
a mere 20% for exploration and modelling?

✓ Making raw data usable. Accurately wrangled data guarantees


that quality data is entered into the downstream analysis.
✓ Getting all data from various sources into a centralized location
so it can be used.
✓ Piecing together raw data according to the required format and
understanding the business context of data
✓ Automated data integration tools are used as data wrangling
techniques that clean and convert source data into a standard
format that can be used repeatedly according to end
requirements
✓ Cleansing the data from the noise or flawed, missing elements
✓ Data wrangling acts as a preparation stage for the data mining
process, which involves gathering data and making sense of it.
✓ Helping business users make concrete, timely decisions
Introduction To PYTHON
➢ Python is a widely used general-purpose, high level programming
language.

➢ It was created by Guido van Rossum in 1991 and further developed by


the Python Software Foundation.

➢ It was designed with an emphasis on code readability, and its syntax


allows programmers to express their concepts in fewer lines of code.

➢ Python is a programming language that lets you work quickly and


integrate systems more efficiently.

➢ There are two major Python versions: Python 2 and Python 3.

➢ Both are quite different.


Some examples of basic data wrangling tools are
➢ Spreadsheets / Excel Power Query - It is the most basic
manual data wrangling tool

➢ OpenRefine - An automated data cleaning tool that requires


programming skills

➢ Tabula – It is a tool suited for all data types

➢ Google DataPrep – It is a data service that explores, cleans,


and prepares data

➢ Data wrangler – It is a data cleaning and transforming tool


Introduction To PYTHON
➢ It is used for:

➢ web development (server-side),


➢ software development,
➢ mathematics,
➢ system scripting.

➢ Python finds its use in developing programs for


graphics applications, text processing, data
analysis, among others.
➢ Python is a general-purpose programming language and supports
multiple paradigms or ways of programming.

➢ we can write procedural (sequence of steps) as well as object oriented


programs (entities as objects and communication using messages across
them) in it.

➢ It can also be used to write functional programs (applying and Composing


functions), among others.

➢ Unlike C and C++, Python is interpreted.

➢ This means Python programs are not compiled and stored into a binary
code file (e.g., an executable)., but its source is translated into machine
code and executed by the interpreter directly (without us seeing the
executable code).

➢ Thus, on your computers, python3 is an interpreter (and gcc is a compiler


for C programs).
➢ Python is a cross-platform programming language, which
means that it can run on multiple platforms like Windows,
macOS, Linux, and has even been ported to the Java and
.NET virtual machines.

➢ It is free and open-source.


Python Programming
Python Program to Add Two Numbers

To understand this example, you should have the knowledge of


the following Python programming topics:

✓ Python Basic Input and Output

✓ Python Data Types

✓ Python Operators
Python Basic Input and Output
Python Output
In Python, we can simply use the print() function to print output.

For example:

print('Python is powerful')

# Output: Python is powerful


Syntax of print()

the actual syntax of the print function accepts 5 parameters

print(object= separator= end= file= flush=)

Here,

object - value(s) to be printed


sep (optional) - allows us to separate multiple objects inside print().
end (optional) - allows us to add specific values like new line "\n", tab "\t"
file (optional) - where the values are printed. It's default value is sys.stdout
(screen)
flush (optional) - boolean specifying if the output is flushed or buffered. Default:
False
Python Input
While programming, we might want to take the input from the user.

In Python, we can use the input() function.

Syntax of input()

input(prompt)

Here, prompt is the string we wish to display on the screen.


It is optional
Example: Python User Input

# using input() to take user input


num = input('Enter a number: ')
print('You Entered:', num)
print('Data type of num:', type(num))

Output

Enter a number: 10
You Entered: 10
Data type of num: <class 'str'>

To convert user input into a number we can use int() or float() functions as:
num = int(input('Enter a number: '))

Here, the data type of the user input is converted from string to integer .
Python Data Types
Data Types Classes Description

Numeric int, float, complex holds numeric values

String str holds sequence of characters

Sequence list, tuple, range holds collection of items

Mapping dict holds data in key-value pair form

Boolean bool holds either True or False

Set set, frozeenset hold collection of unique items


Example 1: Add Two Numbers

# This program adds two numbers

num1 = 1.5
num2 = 6.3

# Add two numbers


sum = num1 + num2

# Display the sum


print('The sum of {0} and {1} is {2}'.format(num1, num2, sum))

Output

The sum of 1.5 and 6.3 is 7.8


Example 2: Add Two Numbers With User Input

# Store input numbers


num1 = input('Enter first number: ')
num2 = input('Enter second number: ')

# Add two numbers


sum = float(num1) + float(num2)

# Display the sum


print('The sum of {0} and {1} is {2}'.format(num1, num2, sum))

Output

Enter first number: 1.5


Enter second number: 6.3
The sum of 1.5 and 6.3 is 7.8
Jupyter Notebook
➢ The Jupyter Notebook is an incredibly powerful tool for
interactively developing and presenting data science projects.

➢ The easiest way for a beginner to get started with Jupyter


Notebooks is by installing Anaconda.

➢ Anaconda is the most widely used Python distribution for data


science and comes pre-loaded with all the most popular libraries
and tools.

➢ Some of the biggest Python libraries included in Anaconda


include NumPy, pandas, and Matplotlib, though the full 1000+ list
is exhaustive.
Assignment No 1
Data Wrangling, I
Perform the following operations using Python on any open source dataset (e.g.,
data.csv)
1. Import all the required Python Libraries.
2. Locate an open source data from the web (e.g. https://www.kaggle.com).
Provide a clear description of the data and its source (i.e., URL of the web
site).
3. Load the Dataset into pandas data frame.
4. Data Preprocessing: check for missing values in the data using pandas
insult(), describe() function to get some initial statistics. Provide variable
descriptions. Types of variables etc. Check the dimensions of the data frame.
5. Data Formatting and Data Normalization: Summarize the types of variables
by checking the data types (i.e., character, numeric, integer, factor, and
logical) of the variables in the data set. If variables are not in the correct data
type, apply proper type conversions.
6. Turn categorical variables into quantitative variables in Python.
In addition to the codes and outputs, explain every operation that you do in the
above steps and explain everything that you do to import/read/scrape the
data set.
1.Import all the required Python Libraries

➢ Import in python is similar to #include header_file in


C/C++.
➢ Python modules can get access to code from another
module by importing the file/function using import.
import math
print(math.pi)
1.Syntax:

import module_name.member_name

2.Syntax:

from module_name import *


from math import pi

In the above code module, math is not imported,


rather just pi has been imported as a variable.
2. Locate an open source data from the web
(e.g. https://www.kaggle.com)

➢ Open Data means the kind of data which is open for anyone
and everyone for access, modification, reuse, and sharing.

➢ Following is the link to get open source dataset

✓ https://www.kaggle.com

✓ https://rockcontent.com/blog/data-sources/
What is a CSV?
✓ “Comma Separated Values.” It is the simplest form of storing
data in tabular form as plain text.

✓ Structure of CSV:
Reading a CSV
Put CSV in default directory “C:\Users\KBS”
Any other directory then put a csv file path like
"D:\\Demo\\Salary_Data.csv"

For Ubuntu :- file = open(“/home/student/Desktop/Demo.csv”)


import csv
file = open("Salary_Data.csv")
csvreader = csv.reader(file)
header = next(csvreader)
print(header)
rows = []
for row in csvreader:
rows.append(row)
print(rows)
file.close()
Output

Salary_Data.csv
2 Implementing the above code using with() statement:

import csv
rows = []
with open("Salary_Data.csv") as file:
csvreader = csv.reader(file)
header = next(csvreader)
for row in csvreader:
rows.append(row)
print(header)
print(rows)
file.close()
pandas
✓ Pandas is a Python library.
✓ Pandas is used to analyze data.
✓ Pandas is an open-source,
✓ BSD-licensed Python library providing high-
performance, easy-to-use data structures and data
analysis tools for the Python programming
language.
✓ Python with Pandas is used in a wide range of fields
including academic and commercial domains
including finance, economics, Statistics, analytics,
etc.
Key Features of Pandas
✓ Fast and efficient DataFrame object with default and
customized indexing.
✓ Tools for loading data into in-memory data objects from
different file formats.
✓ Data alignment and integrated handling of missing data.
✓ Reshaping and pivoting of date sets.
✓ Label-based slicing, indexing and subsetting of large data
sets.
✓ Columns from a data structure can be deleted or inserted.
✓ Group by data for aggregation and transformations.
✓ High performance merging and joining of data.
✓ Time Series functionality.
Using pandas
1. Import pandas library

2. Load CSV files to pandas using read_csv()


Basic Syntax: pandas.read_csv(filename, delimiter=’,’)

3. Extract the field names

.columns is used to obtain the header/field names.

4 . Extract the rows

All the data of a data frame can be accessed using the


field names.
Writing to a CSV file
1 Using csv.writer

Let’s assume we are recording 3 Students data(Name, M1 Score, M2 Score)

Import csv
header = ['Name', 'M1 Score', 'M2 Score']
data = [['Alex', 62, 80], ['Brad', 45, 56], ['Joey', 85, 98]]
filename = 'Students_Data.csv'
with open(filename, 'w', newline="") as file:
csvwriter = csv.writer(file) # create a csvwriter object
csvwriter.writerow(header) # write the header
csvwriter.writerows(data) # write the rest of the data
Writing to a CSV file
2 Using .writelines()

header = ['Name', 'M1 Score', 'M2 Score']


data = [['Alex', 62, 80], ['Brad', 45, 56], ['Joey', 85, 98]]
filename = 'Student_scores.csv'
with open(filename, 'w') as file:
for header in header:
file.write(str(header)+', ')
file.write(‘\n')
for row in data:
for x in row:
file.write(str(x)+', ')
file.write(‘\n')
Writing to a CSV file
3. Using pandas

header = ['Name', 'M1 Score', 'M2 Score']

data = [['Alex', 62, 80], ['Brad', 45, 56], ['Joey', 85, 98]]

data = pd.DataFrame(data, columns=header)

data.to_csv('Stu_data.csv', index=False)
Data cleaning
Data cleaning means fixing bad data in your data set.

Bad data could be:

Empty cells
Data in wrong format
Wrong data
Duplicates

Sources of Missing Values


✓ User forgot to fill in a field.
✓ Data was lost while transferring manually from a legacy database.
✓ There was a programming error.
✓ Users chose not to fill out a field tied to their beliefs about how the
results would be used or interpreted.
isnull() on:
dataframes
dataframe columns

The output will be an object of the same size as your


dataframe that contains boolean True/False values.

These boolean values indicate which dataframe values were


missing.
COLUMN SYNTAX

COUNT THE MISSING VALUES IN EVERY COLUMN OF A DATAFRAME

data.isnull().sum()
How to Count NaN (Not a Number) values in Pandas DataFrame

ways to create NaN values in Pandas DataFrame:

✓ Using Numpy

✓ Importing a file with blank values

(1) Using Numpy

➢ you can place np.nan each time you want to add a NaN value in
the DataFrame.
➢For example, in the code below, there are 4 instances of np.nan under a
single Data Frame column:
(2) Importing a file with blank values
Product Price
Desktop Computer 700
Tablet
500
Laptop 1200

Product.csv
The Example

Suppose you created the following DataFrame that


contains NaN values:
How to count the NaN values in the above DataFrame for the
following 3 scenarios:

✓ Under a single DataFrame column

✓ Under the entire DataFrame

✓ Across a single DataFrame row


(1) Count NaN values under a single DataFrame column

For example, let’s get the count of NaNs under the ‘first_set‘
column:
(2) Count NaN values under the entire DataFrame

In that case, you may use the following syntax to get the total
count of NaNs:

df.isna().sum().sum()
(3) Count NaN values across a single DataFrame row:

✓ You’ll need to specify the index value that represents the row
needed.
✓ The index values are located on the left side of the
DataFrame (starting from 0):
➢ Let’s say that you want to count the NaN values across the
row with the index of 7:
How to Remove Duplicates from Pandas DataFrame

✓ you can apply the following syntax to remove duplicates from


your DataFrame :

df.drop_duplicates()
Color Shape

For example, let’s say that you have the Green Rectangle
following data about boxes, where
each box may have a different color or shape:
Green Rectangle
Green Square
Blue Rectangle
Blue Square
Red Square
Red Square
Red Rectangle
create Pandas DataFrame using this code:
Remove duplicates from Pandas DataFrame

To remove duplicates from the DataFrame, you may use the


following syntax :

df.drop_duplicates()

Let’s say that you want to remove the duplicates across the two
columns of Color and Shape.
As you can see, only the distinct values across the two columns remain:
✓ But what if you want to remove the duplicates on a specific
column, such as the Color column?

✓ In that case, you can specify the column name using


a subset:

df.drop_duplicates(subset=[‘Color’])
df.drop_duplicates(subset=[‘Shape’])
Get the Descriptive Statistics for
Pandas DataFrame
✓ To get the descriptive statistics for a specific column in your
DataFrame:

df['dataframe_column'].describe()

✓ To get the descriptive statistics for an entire DataFrame:

df.describe(include='all')
Steps to Get the Descriptive Statistics for Pandas DataFrame

Step 1: Collect the Data

To start, you’ll need to collect the data for your DataFrame.


For example, here is a simple dataset that can be used for our DataFrame

product price year


A 22000 2014
B 27000 2015
C 25000 2016
C 29000 2017
D 35000 2018
Step 2: Create the DataFrame
Here is the code to create the DataFrame for our example:
Step 3: Get the Descriptive Statistics for Pandas DataFrame

✓ To get the descriptive statistics using the template


✓ df['dataframe_column'].describe()

✓ you want to get the descriptive statistics for the ‘price‘ field,
which contains numerical data

✓ df['price'].describe()
The output contains 6 decimal places. You may then
add astype(int) to the code to get integer values.
descriptive statistics for the ‘product‘ field using this code:
to get the descriptive statistics for the entire DataFrame:

df.describe(include='all')

You might also like