Module-I Data Wrangling with Python
What Is Data Wrangling?
Data Wrangling is the process of cleaning, organizing, and transforming raw data into a
more usable format to prepare it for analysis. This is crucial because raw data often contains
errors, inconsistencies, or irrelevant information that can skew the results of any analysis.
Data wrangling ensures that the data is in the right structure and format to derive
meaningful insights.
Importance of Data Wrangling:
1. Improves Data Quality: It ensures that data is accurate, complete, and relevant,
minimizing errors in analysis.
2. Increases Efficiency: Properly wrangled data makes analysis faster, leading to
quicker decision-making.
3. Ensures Consistency: Cleaning the data eliminates duplicate or redundant
information.
4. Enhances Data Usability: Raw data is often unstructured and requires wrangling to
make it suitable for processing by algorithms and machine learning models.
How Is Data Wrangling Performed?
1. Data Collection: Gather raw data from various sources such as databases, APIs, or
files.
2. Data Exploration: Understand the structure, format, and potential issues in the data.
3. Data Cleaning: Remove duplicates, handle missing values, correct inconsistencies, and
deal with outliers.
4. Data Transformation: Convert the data into a format suitable for analysis, such as
normalizing, aggregating, or converting data types.
5. Data Integration: Combine multiple data sources into a cohesive dataset.
6. Data Validation: Ensure the final dataset meets the requirements and is error-free.
Tasks of Data Wrangling:
- Handling missing data (filling, interpolation, or deletion).
- Correcting data inconsistencies (format errors, incorrect data types).
- Filtering and selecting relevant data.
- Removing duplicates and irrelevant data.
- Aggregating data (grouping or summarizing).
- Transforming and normalizing data (scaling, converting categorical to numerical).
Data Wrangling Tools:
1. Python Libraries: Pandas, NumPy, Matplotlib, Seaborn.
2. R Programming: dplyr, tidyr.
3. SQL: For database querying and manipulation.
4. Excel/Google Sheets: For smaller data wrangling tasks.
5. OpenRefine: A powerful tool for data cleaning and transformation.
Introduction to Python
Python is a versatile, high-level programming language widely used for data science,
machine learning, web development, and automation. Its simplicity, readability, and rich
library ecosystem make it a popular choice for beginners and professionals alike.
Python Basics:
- Variables: Used to store data values.
```python
x=5
name = 'John'
```
- Data Types: int, float, str, list, dict, etc.
- Control Structures: if, for, while, etc.
- Functions: Used to modularize code.
```python
def greet():
print('Hello, World!')
```
- Libraries: Python has extensive libraries for data wrangling, including pandas, numpy,
and csv.
Data Meant to Be Read by Machines
Machine-readable data refers to structured data that computers can process directly. Some
common formats include:
1. CSV (Comma-Separated Values): A simple file format for tabular data.
2. JSON (JavaScript Object Notation): A lightweight format for storing and exchanging
data, commonly used in APIs.
3. XML (eXtensible Markup Language): A markup language that defines rules for
encoding documents in a format that is both human-readable and machine-readable.
CSV Data
CSV (Comma-Separated Values) files are used to store tabular data, with each line in the
file representing a row, and each field separated by a comma.
Example:
```csv
name,age,city
John,25,New York
Jane,30,Los Angeles
```
JSON Data
JSON is used to represent structured data in a readable text format, often used in web
applications for transmitting data.
Example:
```json
{
'name': 'John',
'age': 25,
'city': 'New York'
}
```
XML Data
XML is used to describe data in a hierarchical structure, making it useful for representing
complex data models.
Example:
```xml
<person>
<name>John</name>
<age>25</age>
<city>New York</city>
</person>
```
Experiment – 1: Develop a Python Program for Reading and Writing CSV Files
Here’s a basic Python program to read and write CSV files using the csv module.
Program:
import csv
# Writing to a CSV file
data = [['Name', 'Age', 'City'],
['John', '25', 'New York'],
['Jane', '30', 'Los Angeles']]
with open('people.csv', 'w', newline='') as file:
writer = csv.writer(file)
writer.writerows(data)
# Reading from a CSV file
with open('people.csv', 'r') as file:
reader = csv.reader(file)
for row in reader:
print(row)
Csv File:
Name Age City
John 25 New York
Jane 30 Los Angeles
Output:
Experiment – 2: Develop a Python Program for Reading XML Files
This Python program reads XML data using the xml.etree.ElementTree module.
Program:
import xml.etree.ElementTree as ET
# Parsing an XML file
tree = ET.parse('data.xml')
root = tree.getroot()
# Iterating through the XML
for person in root.findall('person'):
name = person.find('name').text
age = person.find('age').text
city = person.find('city').text
print(f'Name: {name}, Age: {age}, City: {city}')
Xml File:
<?xml version="1.0"?>
<people>
<person> Xml Output:
<name>John</name>
<age>25</age>
<city>New York</city>
</person>
<person>
<name>Jane</name>
<age>30</age>
<city>Los Angeles</city>
</person>
</people>
Output:
Experiment – 3: Develop a Python Program for Reading and Writing JSON to a
File
Here’s how to read and write JSON using Python’s json module.
Program:
import json
# Writing multiple rows of JSON data to a file
data = [
{"name": "John", "age": 25, "city": "New York"},
{"name": "Jane", "age": 30, "city": "Los Angeles"},
{"name": "Doe", "age": 40, "city": "Chicago"}
with open('data.json', 'w') as json_file:
json.dump(data, json_file, indent=4)
# Reading multiple rows of JSON data from a file
with open('data.json', 'r') as json_file:
data = json.load(json_file)
for person in data:
print(person)
Output:
Module-II: Working with Excel Files
and PDFs
This module focuses on working with Excel files and PDFs using Python, key tasks for
automating and processing data efficiently. We’ll cover how to parse and manipulate these
files, how to install the required Python packages, and introduce some basic database
concepts that provide alternative data storage options. The hands-on experiments will give
you experience with real-world scenarios, such as converting files between formats and
parsing data.
1. Installing Python Packages
To work with Excel and PDF files in Python, you’ll need to install specific libraries.
Some common libraries include:
pandas: For handling data in various formats (CSV, Excel, TSV, etc.).
openpyxl: For reading and writing Excel files.
pdfminer.six: For extracting text from PDFs.
Example: Installing necessary packages
pip install pandas openpyxl pdfminer.six
This command installs all the required libraries. Once installed, you can start using them
in your Python scripts.
2. Parsing Excel Files
Excel files (.xlsx) are commonly used for storing tabular data, and Python offers several
packages for reading and writing Excel files.
2.1 Reading Excel Files
Library: pandas or openpyxl
- Task: You can read an Excel file and manipulate it as a DataFrame using the
pandas library.
2.2 Writing to an Excel File
- You can also write data to an Excel file using pandas.
3. Parsing PDFs
PDF parsing can be more complex than Excel, as PDFs do not have a structured tabular
format. However, Python libraries like pdfminer.six allow for extracting text from PDFs,
which can then be further processed.
3.1 Extracting Text from PDFs
Library: pdfminer.six
Task: Extract raw text from a PDF document.
3.2 Converting PDF to Text and Processing
4. Converting Between File Formats
4.1 Converting a TSV File to Excel
A Tab-Separated Values (TSV) file is similar to a CSV file, but columns are
separated by tabs instead of commas. Converting TSV to Excel is straightforward
using pandas.
5. Databases: A Brief Introduction
Relational Databases:
Relational databases, like MySQL and PostgreSQL, store data in structured tables
with rows and columns. They are suitable when you need complex queries,
relationships between datasets, and strong consistency.
Non-Relational Databases (NoSQL)
Non-relational databases, such as MongoDB, store data in a flexible, document-
oriented format (e.g., JSON). They are preferred when scalability and flexibility
are needed, such as in large-scale web apps.
When to Use: Relational Databases for structured data and complex queries.
NoSQL Databases for flexibility and scalability.
Experiments:
Experiment 4: Develop a Python Program for Reading an Excel File
Program:
import pandas as pd
# Load the Excel file
df = pd.read_excel('sample_data.xlsx')
# Display the first 5 rows
print(df.head())
Excel file:
Name Age City
John Doe 28 New York
Jane Smith 34 Los Angeles
Emily Davis 22 Chicago
Output:
Experiment 5: Develop a Python Program for Converting a TSV File into
Excel
Program:
import pandas as pd
# Read the TSV file
df = pd.read_csv('data.tsv', sep='\t')
# Convert to Excel and save
df.to_excel('output_data.xlsx', index=False)
print('TSV file successfully converted to Excel!')
TSV File:
Name Age City
Alice 24 Seattle
Bob 30 Portland
Charlie 29 San Francisco
David 35 New York
Output:
Experiment 6: Develop a Python Program for Converting a PDF File into
Excel
Program:
from pdfminer.high_level import extract_text
import pandas as pd
# Step 1: Extract text from the PDF
text = extract_text('sample.pdf')
# Step 2: Replace (cid:9) with actual tab characters
text = text.replace('(cid:9)', '\t')
# Step 3: Split the text by lines
lines = text.strip().split('\n')
# Step 4: Inspect and parse the lines into structured data
data = []
for line in lines[1:]:
columns = line.split('\t')
print(f"Parsed Line: {columns}")
data.append(columns)
# Step 5: Ensure all rows have 3 columns before proceeding
clean_data = [row for row in data if len(row) == 3]
# Step 6: Create a DataFrame with appropriate column names
df = pd.DataFrame(clean_data, columns=['Name', 'Age', 'City'])
# Step 7: Save the DataFrame as an Excel file (use a new file name to avoid permission
issues)
output_excel_path = 'output_data.xlsx'
df.to_excel(output_excel_path, index=False)
print('PDF data has been successfully converted to Excel!')
Pdf File:
Excel File:
Name Age City
John 28 New York
Jane 34 Los Angeles
Emily 22 Chicago
Output:
Module-III Data Cleanup
Why Clean Data?
Data cleanup ensures that the dataset is accurate, consistent, and usable for analysis. Dirty
data can cause incorrect models, misleading results, or failed applications. Data cleaning
involves the removal or rectification of missing values, duplicates, formatting errors, and
inconsistencies.
Data Cleanup Basics
Data cleanup involves tasks such as:
• Handling missing values: Removing or imputing empty cells.
• Correcting wrong formats: Ensuring consistency in date formats, string cases,
numerical types, etc.
• Removing outliers: Identifying and addressing extreme data points that can skew
analysis results.
Identifying Values for Data Cleanup
Before cleaning the data, it is essential to identify issues like:
• Missing or null values.
• Incorrect data types or formats.
• Outliers or erroneous data points.
• Duplicated data entries.
Formatting Data
Formatting data involves ensuring consistency in date formats, numerical types (floats or
integers), string casing (lowercase/uppercase), and handling categorical variables.
Finding Outliers and Bad Data
Outliers are extreme values that differ significantly from the majority of the dataset and
can negatively affect analysis. Common methods to detect outliers include:
Using Z-scores or Interquartile Range (IQR).
Visualizations like box plots.
Finding Duplicates
Duplicates can bias your analysis, and removing them ensures the integrity of the dataset.
Python's pandas library provides methods to identify and remove duplicates.
Fuzzy Matching and RegEx Matching
• Fuzzy Matching: Useful for finding strings that are similar but not exact matches, often
applied when merging datasets.
• Regular Expressions (RegEx): Useful for pattern matching, such as identifying specific
string formats like email addresses or phone numbers.
Normalizing and Standardizing Data
• Normalization: Scales data to a specific range, typically [0, 1].
• Standardization: Centers data around the mean and scales it by standard deviation. This
is useful in machine learning algorithms.
Saving the Data
Once data is cleaned, saving the cleaned dataset is essential for further analysis and
modeling without repeating the process.
Scripting the Cleanup
Automating the cleanup process with Python scripts ensures the procedure is repeatable,
especially when new data is added to the dataset.
Experiment 7: Develop a Python Program for cleaning empty cells and
cleaning wrong format
Program:
import pandas as pd
# Sample DataFrame with missing values and wrong format
data = {
'Name': ['Alice', 'Bob', None, 'David'],
'Age': [25, None, 30, 'Twenty'],
'Salary': [50000, 60000, None, 80000]
df = pd.DataFrame(data)
# Cleaning empty cells by filling with default values or dropping rows
df['Name'] = df['Name'].fillna('Unknown') # No inplace=True, just assignment
df['Age'] = pd.to_numeric(df['Age'], errors='coerce') # Convert 'Age' to numeric, NaN
for errors
df['Age'] = df['Age'].fillna(df['Age'].mean()) # Assign back after filling missing values
df = df.dropna(subset=['Salary']) # No inplace, assign back to df
print("Cleaned DataFrame:")
print(df)
Output:
Experiment 8: Develop a Python Program for finding duplicates in a
data frame
Program:
import pandas as pd
# Sample DataFrame with duplicates
data = {
'Name': ['Alice', 'Bob', 'Alice', 'David', 'Alice'],
'Age': [25, 30, 25, 40, 25]
df = pd.DataFrame(data)
# Finding duplicates
duplicates = df.duplicated()
print("Duplicated Rows:")
print(df[duplicates])
# Removing duplicates
df_no_duplicates = df.drop_duplicates()
print("\nDataFrame after removing duplicates:")
print(df_no_duplicates)
Output:
Experiment 9: Develop a Python Program for normalizing data
Program:
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
# Sample DataFrame for normalization
data = {
'Feature1': [10, 20, 30, 40, 50],
'Feature2': [1, 2, 3, 4, 5]
df = pd.DataFrame(data)
# Normalizing data using MinMaxScaler
scaler = MinMaxScaler()
df_normalized = pd.DataFrame(scaler.fit_transform(df), columns=df.columns)
print("Normalized DataFrame:")
print(df_normalized)
Output:
Module IV: Data Exploration and
Analysis
1. Exploring Data
Exploring data involves a preliminary examination of the data to understand its characteristics.
This is where you look at basic statistics, identify data types, check for missing values, and get an
overall sense of the dataset. Key Steps include: overview of the dataset, summary statistics,
distribution checks, and identifying missing values.
2. Importing Data
Data import is the process of bringing external data into your Python environment for analysis.
Data can be imported from various file types like CSV, Excel, SQL databases, and web APIs. In
Python, methods to import data include:
• CSV Files: pandas.read_csv("filename.csv")
• Excel Files: pandas.read_excel("filename.xlsx")
• SQL Databases: Using sqlalchemy or sqlite3 for connecting and querying databases.
3. Exploring Table Functions
Table functions allow you to interact with and manipulate datasets for better understanding.
Functions include head() and tail() to inspect rows, info() for data types, and describe() for
summary statistics.
4. Joining Numerous Datasets
Joining datasets is combining multiple datasets to create a unified dataset for analysis. This
includes Inner Join, Outer Join, Left Join, and Right Join. In Python, tools like merge() and
concat() from pandas are used.
5. Identifying Correlations
Correlation measures the statistical relationship between two variables. It can show whether
changes in one variable predict changes in another. A correlation matrix shows the relationship
between all variables. Visualizing correlations using a heatmap is common.
6. Identifying Outliers
Outliers are data points that significantly differ from the rest of the dataset. Methods for
identifying outliers include the Standard Deviation method and Interquartile Range (IQR).
Visualization tools like box plots and histograms help in identifying outliers.
7. Creating Groupings
Grouping data involves categorizing data into different segments for analysis. The groupby()
function in pandas allows grouping data based on specific categories, and applying aggregation
methods like mean or sum.
8. Analyzing Data - Separating and Focusing the Data
Analyzing data involves separating relevant features for focused exploration. Methods include
filtering data using conditions or selecting specific columns for subsetting.
9. Presenting Data
After analysis, presenting data involves summarizing insights and using visuals like charts and
graphs to communicate findings clearly.
10. Visualizing the Data
Visualizations make data easier to interpret by presenting it in graphical formats. Common
visualizations include bar charts, histograms, pie charts, line charts, and scatter plots.
11. Time-Related Data
Time-related data involves handling datasets with temporal components, such as stock prices or
weather trends. Techniques like time series visualization and rolling averages are used for
analysis.
12. Maps, Interactives, Words, Images, Video, and Illustrations
Advanced visualizations include geographic maps, word clouds, and interactive charts. Python
libraries like folium, geopandas, plotly, and bokeh are used to create these visualizations.
13. Presentation Tools
Presentation tools like Tableau, Power BI, and Google Data Studio allow creating interactive
reports and dashboards for sharing results. Python libraries like matplotlib and seaborn are also
used for building visuals.
14. Publishing the Data - Open-Source Platforms
Publishing data involves sharing your analysis on platforms like GitHub or Kaggle. Tools like
Tableau Public can be used to share interactive dashboards.
Experiments
Experiment 10: Python Program for Detecting and Removing Outliers
Program:
import pandas as pd
import numpy as np
# Sample data
data = {
'Value': [10, 12, 14, 18, 90, 13, 15, 14, 300, 17, 13, 12, 16, 10]
df = pd.DataFrame(data)
# Detecting outliers using IQR
Q1 = df['Value'].quantile(0.25)
Q3 = df['Value'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
# Detecting outliers
outliers = df[(df['Value'] < lower_bound) | (df['Value'] > upper_bound)]
print("Outliers detected:\n", outliers)
# Removing outliers
df_no_outliers = df[(df['Value'] >= lower_bound) & (df['Value'] <= upper_bound)]
print("Data after removing outliers:\n", df_no_outliers)
Output:
Experiment 11: Python Program for Drawing Bar Chart, Histogram, and Pie Chart
Program:
import matplotlib.pyplot as plt
# Sample data for visualizations
categories = ['Category A', 'Category B', 'Category C', 'Category D']
values = [23, 45, 56, 78]
# Bar chart
plt.figure(figsize=(6,4))
plt.bar(categories, values, color='blue')
plt.title('Bar Chart')
plt.show()
# Histogram
data = [10, 12, 13, 15, 18, 18, 19, 20, 23, 25, 29, 30, 31, 32, 35]
plt.figure(figsize=(6,4))
plt.hist(data, bins=5, color='green')
plt.title('Histogram')
plt.show()
# Pie chart
plt.figure(figsize=(6,4))
plt.pie(values, labels=categories, autopct='%1.1f%%', startangle=140, colors=['blue', 'orange',
'green', 'red'])
plt.title('Pie Chart')
plt.show()
Output:
Experiment 12: Python Program for Time Series Visualization
Program:
import pandas as pd
import matplotlib.pyplot as plt
# Creating a sample time series DataFrame with 'MS' for month start
date_rng = pd.date_range(start='2024-01-01', end='2024-12-31', freq='MS')
data = {'Sales': [200, 210, 215, 220, 230, 250, 245, 260, 270, 275, 290, 300]}
df = pd.DataFrame(data, index=date_rng)
# Time series plot
plt.figure(figsize=(10, 6))
plt.plot(df.index, df['Sales'], marker='o', linestyle='-', color='blue')
plt.title('Monthly Sales Time Series')
plt.xlabel('Month')
plt.ylabel('Sales')
plt.grid(True)
plt.show()
Output:
Module V: Web Scraping
Web scraping is the process of extracting data from websites. It is a critical skill in data analysis
and machine learning, especially when the required data isn't available in structured formats like
CSV or databases. This module covers various aspects of web scraping, including techniques and
tools to interact with web pages and extract meaningful information.
1. What to Scrape and How
What to Scrape: Identify the information you need from a website, such as product details, news
articles, or user reviews. Not all content on a webpage is relevant, so knowing what to scrape helps
to target the exact data needed.
How to Scrape: There are different methods like using requests for simple pages, or more advanced
tools like Selenium or Scrapy for pages that require interaction or load dynamically.
2. Analyzing a Web Page
This involves understanding the structure of a webpage by inspecting its HTML elements.
Tools like Chrome DevTools help in finding the tags (e.g., <div>, <p>, <span>) that contain the
required data. It is important to locate elements correctly before writing a scraper.
3. Network/Timeline
Network Analysis: Using browser dev tools, you can inspect network requests to understand how
data is loaded. This is especially useful for scraping dynamically loaded content, such as AJAX
calls.
Timeline: Helps to track when different elements load on a page, useful when dealing with
JavaScript-heavy websites.
4. Interacting with JavaScript
Some websites load content dynamically using JavaScript. Scrapers like Selenium can automate
interactions with JavaScript-based content, allowing you to scrape data that would otherwise be
invisible with static HTML scraping methods.
5. In-Depth Analysis of a Page
Understanding how different elements are nested and structured is crucial for writing effective
scraping scripts. This involves deep inspection of HTML tags, attributes, and JavaScript functions
that might load data dynamically.
6. Getting Pages
This involves sending HTTP requests to a URL to fetch the HTML content. Libraries like requests
in Python are commonly used for this. The response can then be parsed to extract the needed data.
7. Reading a Web Page with LXML and XPath
LXML: A powerful library for parsing XML and HTML documents. It is faster than BeautifulSoup
and allows more complex parsing.
XPath: A language used for navigating through elements and attributes in an XML/HTML
document. It allows precise selection of elements, making it effective for scraping.
8. Advanced Web Scraping - Browser-Based Parsing
Selenium: A tool that automates browsers, useful for scraping pages that require user interactions
like clicking buttons or filling forms. Selenium can simulate real user actions.
Ghost.py: A headless browser scraping tool, suitable for web scraping without opening a browser
window.
9. Screen Reading with Selenium
Screen Reading: Automating the extraction of visible elements using Selenium.
This is especially useful when you need to scrape data that requires scrolling or clicking to appear.
10. Spidering the Web - Building a Spider with Scrapy
Scrapy: A powerful and fast web crawling framework in Python that automates the extraction and
storage of data from websites. It is ideal for large-scale scraping projects.
Building a Spider: A spider is a class in Scrapy that defines how to follow links and extract content
from the target website.
11. Crawling Whole Websites with Scrapy
Scrapy spiders can be set to follow links and scrape multiple pages on a website. This process is
called crawling. Scrapy manages the crawling efficiently and allows data storage in formats like
JSON or CSV.
Experiments:
Experiment 13: Develop a Python Program for Reading an HTML Page
Program:
import requests
from bs4 import BeautifulSoup
# URL of the webpage to be scraped
url = 'https://example.com' # Replace with the URL you want to scrape
# Send a GET request to the URL
response = requests.get(url)
# Check if the request was successful
if response.status_code == 200:
# Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(response.content, 'html.parser')
# Extract and print the title of the page
title = soup.title.string
print("Title of the page:", title)
# Extract all paragraphs and print their text content
paragraphs = soup.find_all('p')
print("\nParagraphs:")
for i, paragraph in enumerate(paragraphs, start=1):
print(f"Paragraph {i}: {paragraph.get_text()}")
else:
print("Failed to retrieve the web page. Status code:", response.status_code)
Web Page:
Output:
Experiment 14: Develop a Python Program for Building a Spider Using
Scrapy
Program:
import scrapy
class ExampleSpider(scrapy.Spider):
name = 'example'
start_urls = ['https://example.com'] # Replace with the target URL
def parse(self, response):
# Extracting the page title
title = response.xpath('//title/text()').get()
yield {'Title': title}
# Extracting all paragraphs' text
paragraphs = response.xpath('//p/text()').getall()
for i, paragraph in enumerate(paragraphs, start=1):
yield {f'Paragraph {i}': paragraph}
Web Page:
Output: