L6 and 7-Data Preprocessing-coding
L6 and 7-Data Preprocessing-coding
Data Preprocessing
Problems in data
Missing value
Mixed data: (e.g. in 1st Col, car_name with company name, in 2nd col.
Car_price amount with Lakh, in last Col. Date is in unstructured form.
Using Python- Advantages
• Syntax used is simple to understand code and reasonably
fast to prototype
• Libraries designed for specific data science tasks
• Provides good ecosystem libraries that are robust and varied
• Links well with majority of the cloud platform service
providers
• Tight-knit integration with big data frameworks such as
Hadoop, Spark, etc.
• Supports both object oriented and functional programming
paradigms
• Supports reading files from local, databases and cloud
Data Science using Python
• Python libraries provide key feature sets which
essential for data science
• For this, necessary knowledge of:
– Python and following powerful and basic modules or
libraries for data analysis and visualization:
• Pandas (for data manipulation and cleaning)
• Matplotlib (for general-purpose plotting)
• Seaborn (builds on Matplotlib for advanced statistical
visualizations)
• NumPy (for numerical python)
• Example: use it to read data from CSV files for cleaning/ analysis.
.csv file extension stands for "comma-separated value” file, and it's one of
the most common outputs for any spreadsheet program.
https://flatfile.com/demo/
Example: Series (1D) and DataFrame
(2D)
• Series (1D) • DataFrame (2D)
import pandas as pd data = {
data = [100, 200, 300, 400] "Name": ["Alice", "Bob", "Charlie"],
series = pd.Series(data, "Age": [25, 30, 35], "Salary":
index=['A', 'B', 'C', 'D']) [50000, 60000, 70000]
print(series) }
Output df = pd.DataFrame(data)
A 100 print(df)
B 200 Output
C 300 Name Age Salary
D 400 0 Alice 25 50000
dtype: int64 1 Bob 30 60000
2 Charlie 35 70000
Matplotlib
• A plotting module used for creating static, animated, and
interactive visualizations
• Renaming Columns
df.rename(columns={"OldColumn": "NewColumn"},
inplace=True)
• print(filtered_df)
Transforming Data
• Transforming Data
df["Salary"] = df["Salary"].apply(lambda x: x * 1.1)
# Increase salary by 10%
• Replacing Values
df["Department"] = df["Department"].replace({"HR": "Human
Resources", "IT": "Tech"})
# Replacing Islamabad' with ‘Rawalpindi‘
df["City"] = df["City"].replace(" Islamabad ", " Rawalpindi ")
*A SQL JOIN is used to combine rows from two or more tables based on
a related column between them
Example LEFT JOIN
=df1 Returns all records from the left table
(Employees), and matching records from
the right (Departments).
If no match is found, NULL is returned.
inner_merge = pd.merge(df1, df2, on=‘DepartmentID',
how=‘left')
=df2
Note that David is missing because there's no matching DepartmentID = 4 in the Departments table.
RIGHT JOIN
Returns all records from the right table (Departments), and matching records from the left
(Employees).
inner_merge = pd.merge(df1, df2, on=‘DepartmentID', how=‘right')
Note that David is included (no match in Departments) and "Sales" appears with NULL
(Employees).
Combining Datasets (Merging, Joining,
and Concatenation)
"Customers" Table
"Orders“ Table
OrderID CustomerID OrderDate CustomerID CustomerName ContactName Country
Alfreds
1 Maria Anders Germany
Futterkiste
10308 2 1996-09-18 Ana Trujillo
10309 37 1996-09-19 2 Emparedados y Ana Trujillo Mexico
10310 77 1996-09-20 helados
Antonio Moreno Antonio
Notice that the "CustomerID" column in the 3 Mexico
Taquería Moreno
"Orders" table refers to the "CustomerID" in the
"Customers" table. The relationship between the
two tables above is the "CustomerID" column.
3. Filter the merged dataset to show only employees who earn a salary
greater than some specific value X.