Python & MySQL for Data Analysis
Python & MySQL for Data Analysis
One of the key libraries that enhances Python's data manipulation capabilities
is Pandas. This open-source library provides data structures and functions
designed to facilitate the manipulation and analysis of structured data. With
its powerful DataFrame object, Pandas allows users to easily read, filter, and
aggregate data, making it a staple for data analysts. The library also
integrates seamlessly with other Python libraries, enabling users to perform
complex data operations with minimal effort.
If you prefer to install Python separately, download it from the official Python
website and follow the installation prompts.
Once Anaconda or Python is installed, you can install Pandas and Matplotlib
using pip (Python’s package installer) or through Anaconda Navigator.
Using pip:
Using Anaconda:
To work with MySQL, you need to install the server and the client. Follow
these steps:
To connect Python with MySQL, you’ll need to install the MySQL Connector
library. You can do this using pip:
Once installed, you can establish a connection to MySQL using the following
code snippet:
import mysql.connector
connection = mysql.connector.connect(
host='localhost',
user='your_username',
password='your_password',
database='your_database'
)
if connection.is_connected():
print("Successfully connected to the database")
This setup provides a solid foundation for data analysis using Python,
enabling you to manipulate data with Pandas, visualize it with Matplotlib, and
manage it through MySQL.
One of the most common tasks in data analysis is to read data from CSV files.
Pandas makes this easy with the read_csv() function. For example:
import pandas as pd
data = pd.read_csv('data.csv')
This command reads the data from a specified CSV file and stores it in a
DataFrame named data . The DataFrame is a 2-dimensional labeled data
structure, similar to a spreadsheet, which allows for easy manipulation and
analysis.
CREATING DATAFRAMES
In addition to reading data from files, you can create DataFrames directly
from dictionaries or lists. For instance:
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'City': ['New York', 'Los Angeles', 'Chicago']
}
df = pd.DataFrame(data)
Pandas allows for powerful data selection and filtering capabilities. You can
select specific columns or rows using indexing. For example, to select the
'Name' column:
names = df['Name']
missing = df.isnull().sum()
cleaned_data = df.dropna()
average_age = df.groupby('City')['Age'].mean()
This command groups the data by the 'City' column and computes the mean
of the 'Age' column for each group.
SAMPLE DATA
CREATING A DATAFRAME
import pandas as pd
# Creating DataFrame
df = pd.DataFrame(data)
OUTPUT EXAMPLE
When you run the above code, you'll get the initial DataFrame displayed as
follows:
Initial DataFrame:
EmployeeID Name Age Department
0 101 Alice 28 HR
1 102 Bob 34 IT
2 103 Charlie 29 Finance
3 104 David 45 IT
SELECTING COLUMNS
OUTPUT EXAMPLE
FILTERING ROWS
OUTPUT EXAMPLE
OUTPUT EXAMPLE
CONCLUSION
SAMPLE DATA
data = {
'ProductID': [1, 2, 3, 4, 5],
'ProductName': ['Laptop', 'Mouse', 'Keyboard',
'Monitor', 'Printer'],
'Category': ['Electronics', 'Accessories',
'Accessories', 'Electronics', 'Office'],
'Price': [1200, 25, 45, 300, 150],
'Stock': [50, 200, 150, 100, 80]
}
import pandas as pd
FILTERING DATA
Next, we'll filter the DataFrame to find products that belong to the
Electronics category and have a price greater than $200. This filtering
allows us to focus on higher-end electronic items.
OUTPUT EXAMPLE
When you run the above code, the output will display the filtered DataFrame:
FURTHER FILTERING
further_filtered_products = df[((df['Category'] ==
'Electronics') | (df['Category'] == 'Accessories')) &
(df['Stock'] > 100)]
print("\nFurther Filtered Products (Electronics or
Accessories with Stock > 100):")
print(further_filtered_products)
OUTPUT EXAMPLE
CONCLUSION
SAMPLE DATA
Let’s create a sample dataset that includes some missing values to illustrate
our techniques. Our dataset will consist of information about students,
including their Name , Age , and Score .
import pandas as pd
import numpy as np
df = pd.DataFrame(data)
print("Initial DataFrame with Missing Values:")
print(df)
missing_values = df.isnull().sum()
print("\nMissing Values Count:")
print(missing_values)
FILLING MISSING VALUES
1. Forward Fill: This method replaces missing values with the last valid
observation.
df_ffill = df.fillna(method='ffill')
print("\nDataFrame after Forward Fill:")
print(df_ffill)
1. Backward Fill: This technique fills missing values with the next valid
observation.
df_bfill = df.fillna(method='bfill')
print("\nDataFrame after Backward Fill:")
print(df_bfill)
1. Fill with a Specific Value: We can also fill missing values with a specific
constant, such as 0 or any other value relevant to our analysis.
df_fill_zero = df.fillna(0)
print("\nDataFrame after Filling with Zero:")
print(df_fill_zero)
mean_age = df['Age'].mean()
df_fill_mean = df.fillna({'Age': mean_age})
print("\nDataFrame after Filling Missing Age with Mean:")
print(df_fill_mean)
SAMPLE DATA
import pandas as pd
data = {
'TransactionID': [1, 2, 3, 4, 5, 6],
'Product': ['Apple', 'Banana', 'Orange', 'Apple',
'Banana', 'Orange'],
'Category': ['Fruit', 'Fruit', 'Fruit', 'Fruit',
'Fruit', 'Fruit'],
'SalesAmount': [100, 150, 200, 120, 160, 210],
'Quantity': [10, 15, 20, 12, 18, 25]
}
df = pd.DataFrame(data)
print("Initial Sales DataFrame:")
print(df)
GROUPING DATA
To analyze total sales and quantities sold for each product, we can group the
data by Product and then apply aggregation functions like sum() to
calculate the total SalesAmount and Quantity .
grouped_data = df.groupby('Product').agg({'SalesAmount':
'sum', 'Quantity': 'sum'}).reset_index()
print("\nGrouped Sales Data by Product:")
print(grouped_data)
OUTPUT EXAMPLE
The output will display the total sales and quantities for each product:
ADDITIONAL AGGREGATIONS
additional_grouped_data =
df.groupby('Product').agg({'SalesAmount': ['sum',
'mean'], 'Quantity': ['sum', 'mean']}).reset_index()
print("\nGrouped Sales Data with Multiple Aggregations:")
print(additional_grouped_data)
OUTPUT EXAMPLE
The output will show both the total and average values:
After grouping and aggregating data, you may want to filter the results based
on specific criteria. For example, let’s say we only want to see products where
the total sales are greater than $250:
filtered_grouped_data =
grouped_data[grouped_data['SalesAmount'] > 250]
print("\nFiltered Grouped Sales Data (SalesAmount >
250):")
print(filtered_grouped_data)
OUTPUT EXAMPLE
import pandas as pd
import matplotlib.pyplot as plt
SAMPLE DATA
data = {
'Month': ['January', 'February', 'March', 'April',
'May', 'June'],
'Sales_A': [200, 300, 250, 400, 350, 450],
'Sales_B': [150, 250, 300, 200, 500, 600]
}
df = pd.DataFrame(data)
print("Sales Data:")
print(df)
LINE PLOT
Line plots are ideal for visualizing trends over time. We can create a line plot
to show the sales performance of two products (Sales_A and Sales_B) across
the months.
plt.figure(figsize=(10, 5))
plt.plot(df['Month'], df['Sales_A'], marker='o',
label='Product A', color='blue')
plt.plot(df['Month'], df['Sales_B'], marker='o',
label='Product B', color='orange')
plt.title('Monthly Sales for Products A and B')
plt.xlabel('Month')
plt.ylabel('Sales')
plt.legend()
plt.grid()
plt.show()
BAR CHART
HISTOGRAM
plt.figure(figsize=(8, 5))
plt.hist(df['Sales_A'], bins=5, color='blue', alpha=0.7,
edgecolor='black')
plt.title('Distribution of Sales for Product A')
plt.xlabel('Sales Amount')
plt.ylabel('Frequency')
plt.grid()
plt.show()
CONCLUSION
CONCATENATION
import pandas as pd
# Concatenating DataFrames
df_combined = pd.concat([df_q1, df_q2],
ignore_index=True)
print("Concatenated DataFrame:")
print(df_combined)
The output will show a single DataFrame containing the sales data from both
quarters, stacked vertically.
MERGING
# Product DataFrame
data_products = {
'ProductID': [1, 2, 3],
'Product': ['A', 'B', 'C']
}
df_products = pd.DataFrame(data_products)
# Sales DataFrame
data_sales = {
'ProductID': [1, 2, 1],
'Sales': [150, 200, 180]
}
df_sales = pd.DataFrame(data_sales)
This will generate a DataFrame that includes product names alongside their
corresponding sales figures, effectively integrating data from both sources.
KEY DIFFERENCES
Let's begin by creating a sample DataFrame that we will manipulate and then
export. For this example, we will create a simple dataset representing
employee information.
import pandas as pd
df = pd.DataFrame(data)
print("Initial Employee DataFrame:")
print(df)
DATA MANIPULATION
EXPORTING TO CSV
Now that we have our manipulated DataFrame, we can export it to a CSV file
using the to_csv() method provided by Pandas. We will specify the name
of the file and ensure that the index is not included in the output file.
To ensure that our data has been exported correctly, we can read the newly
created CSV file back into a DataFrame and display its contents.
# Reading the exported CSV file
import os
if os.path.exists('filtered_employees.csv'):
exported_df = pd.read_csv('filtered_employees.csv')
print("\nData read from 'filtered_employees.csv':")
print(exported_df)
CONCLUSION
IMPORTING LIBRARIES
To get started, we need to import the required libraries. Ensure you have
Pandas installed in your Python environment.
import pandas as pd
import numpy as np
Let’s create a sample time series dataset representing daily sales data over a
month. The dataset will consist of dates and corresponding sales figures.
# Create a date range
date_rng = pd.date_range(start='2023-01-01',
end='2023-01-31', freq='D')
# Create a DataFrame
df = pd.DataFrame(data={'Date': date_rng, 'Sales':
sales_data})
df.set_index('Date', inplace=True)
DATE-TIME INDEXING
With our DataFrame set up, we can leverage the date-time index to perform
various time series operations. For instance, we can easily access sales data
for specific dates or periods.
specific_date = df.loc['2023-01-15']
print("\nSales on January 15, 2023:")
print(specific_date)
RESAMPLING DATA
One of the powerful features of time series data is the ability to resample it.
We can aggregate our daily sales data to weekly sales totals.
weekly_sales = df.resample('W').sum()
print("\nWeekly Sales Summary:")
print(weekly_sales)
ROLLING STATISTICS
Finally, visualizing time series data can provide insightful trends and patterns.
Using Matplotlib, we can plot the original sales data along with the rolling
mean.
plt.figure(figsize=(12, 6))
plt.plot(df.index, df['Sales'], label='Daily Sales',
color='blue', marker='o')
plt.plot(df.index, df['Rolling Mean'], label='7-Day
Rolling Mean', color='orange', linewidth=2)
plt.title('Daily Sales Data with 7-Day Rolling Mean')
plt.xlabel('Date')
plt.ylabel('Sales')
plt.legend()
plt.grid()
plt.show()
CONCLUSION
UTILIZING SUBPLOTS
Subplots allow for the creation of multiple plots within a single figure, which
is particularly useful for comparing different datasets or visualizing various
aspects of a single dataset side by side. The plt.subplots() function
provides an efficient way to generate a grid of plots.
Let's create a figure with multiple subplots to visualize sales data for two
different products across several months. We will use the same sales data
from previous examples but display it across different plot types.
import pandas as pd
import matplotlib.pyplot as plt
# Creating subplots
fig, axs = plt.subplots(2, 2, figsize=(12, 8))
fig.suptitle('Sales Data Visualization', fontsize=16)
Example of Styling
We will modify our previous plots with specific styles to improve their
presentation:
# Applying styles
plt.style.use('seaborn-darkgrid')
# Adjusting layout
plt.tight_layout(rect=[0, 0, 1, 0.96])
plt.show()
CONCLUSION
import pandas as pd
import numpy as np
df = pd.DataFrame(data)
print("Sample DataFrame:")
print(df)
CALCULATING THE CORRELATION MATRIX
Next, we will calculate the correlation matrix using the corr() method from
Pandas. This matrix will reveal how strongly the variables are related to each
other.
Now, we will use Seaborn to create a heatmap from the correlation matrix.
The heatmap() function allows for a visually appealing representation of
the correlation values, making it easier to identify relationships.
Once the heatmap is generated, each cell in the matrix will provide the
correlation coefficient between the variables. A value close to 1 indicates a
strong positive correlation, while a value close to -1 indicates a strong
negative correlation. Values around 0 suggest no correlation.
CONCLUSION
INTRODUCTION TO MYSQL
MySQL is an open-source relational database management system (RDBMS)
that utilizes Structured Query Language (SQL) for managing and
manipulating data. Developed by Oracle Corporation, MySQL is widely
recognized for its reliability, flexibility, and ease of use, making it a preferred
choice for both small and large-scale applications. MySQL supports a wide
variety of platforms and can handle large databases, which is essential for
applications that require efficient data storage and retrieval.
SQL STATEMENT
EXPECTED OUTPUT
To confirm that the database has been created successfully, you can use the
MySQL command line or a graphical interface like MySQL Workbench. After
executing the above command, you can check for the existence of the new
database by running the following command:
SHOW DATABASES;
The expected output should include the newly created database along with
any other existing databases. It will look something like this:
+--------------------+
| Database |
+--------------------+
| information_schema |
| mysql |
| performance_schema |
| data_analysis_db |
+--------------------+
This output confirms the successful creation of the data_analysis_db
database, which is now ready to store tables and data for your data analysis
tasks.
IMPORTANT CONSIDERATIONS
When creating a database, ensure that the name you choose follows the
naming conventions and does not conflict with any existing databases.
Additionally, you should have the necessary privileges to create databases in
your MySQL server. If you encounter any errors during the creation process,
check your permissions or syntax to resolve the issue.
In this example, we will create a table named employees , which will store
information about employees in an organization. The table will include the
following columns: EmployeeID , Name , Age , Department , and
Salary . Here’s the SQL statement to create this table:
USE data_analysis_db;
EXPECTED OUTPUT
To confirm that the employees table has been created successfully, you can
run the following command to display the structure of the table:
DESCRIBE employees;
+-------------+---------------+------+-----+---------
+----------------+
| Field | Type | Null | Key | Default |
Extra |
+-------------+---------------+------+-----+---------
+----------------+
| EmployeeID | int(11) | NO | PRI | NULL |
auto_increment |
| Name | varchar(100) | YES | | NULL
| |
| Age | int(11) | YES | | NULL
| |
| Department | varchar(50) | YES | | NULL
| |
| Salary | decimal(10,2) | YES | | NULL
| |
+-------------+---------------+------+-----+---------
+----------------+
This output confirms the successful creation of the employees table with
the specified columns and their data types, making it ready for data insertion
and future queries.
EXPECTED OUTPUT
To verify that the data has been inserted successfully, you can use the
SELECT statement to query the employees table and view the records:
+-------------+---------+-----+------------+---------+
| EmployeeID | Name | Age | Department | Salary |
+-------------+---------+-----+------------+---------+
| 101 | Alice | 28 | HR | 50000.00|
| 102 | Bob | 34 | IT | 60000.00|
| 103 | Charlie | 29 | Finance | 55000.00|
| 104 | David | 45 | IT | 70000.00|
+-------------+---------+-----+------------+---------+
This output confirms that the records have been successfully inserted into the
employees table. Each row corresponds to an individual employee, with
their respective attributes accurately represented in the table.
IMPORTANT CONSIDERATIONS
When inserting data, ensure that you adhere to the constraints defined in the
table schema. For example, EmployeeID must be unique because it is the
primary key. Additionally, make sure that the data types of the values being
inserted match those defined for each column. If there are any violations of
these constraints, MySQL will return an error indicating the issue.
Inserting data correctly is crucial for maintaining the integrity and usability of
your database, as it allows for accurate data retrieval and analysis in the
future.
QUERY 4: SELECTING DATA
Selecting data from a table is a fundamental operation in SQL that allows you
to retrieve information stored within a database. The SELECT statement is
used to query the database and fetch specific data from one or more tables.
In this section, we will present a SQL query to select all data from the
employees table that we created earlier in the data_analysis_db
database, along with the expected output.
SQL STATEMENT
To select all data from the employees table, you can use the following SQL
query:
EXPECTED OUTPUT
When you execute the above SQL statement, it will return all records stored in
the employees table. The expected output should look like this:
+-------------+---------+-----+------------+---------+
| EmployeeID | Name | Age | Department | Salary |
+-------------+---------+-----+------------+---------+
| 101 | Alice | 28 | HR | 50000.00|
| 102 | Bob | 34 | IT | 60000.00|
| 103 | Charlie | 29 | Finance | 55000.00|
| 104 | David | 45 | IT | 70000.00|
+-------------+---------+-----+------------+---------+
This output displays all the rows and columns present in the employees
table, providing a comprehensive view of the stored employee data. Each row
corresponds to an employee, with their respective attributes such as
EmployeeID , Name , Age , Department , and Salary .
ADDITIONAL NOTES
Using the SELECT statement is a powerful way to retrieve data for analysis,
reporting, or application use. You can modify this query to include specific
columns by listing them instead of using the asterisk, and you can also apply
conditions using the WHERE clause to filter the results based on specific
criteria. For example, if you wanted to select only employees in the IT
department, you could use the following query:
SQL STATEMENT
For this example, let's filter the employees who belong to the 'IT' department
and have a salary greater than $60,000. The SQL query would look like this:
EXPECTED OUTPUT
When you execute the above SQL statement, the expected output should
return records that meet the specified conditions. Assuming the following
data in the employees table:
+-------------+---------+-----+------------+---------+
| EmployeeID | Name | Age | Department | Salary |
+-------------+---------+-----+------------+---------+
| 101 | Alice | 28 | HR | 50000.00|
| 102 | Bob | 34 | IT | 60000.00|
| 103 | Charlie | 29 | Finance | 55000.00|
| 104 | David | 45 | IT | 70000.00|
+-------------+---------+-----+------------+---------+
+-------------+------+-----+------------+---------+
| EmployeeID | Name | Age | Department | Salary |
+-------------+------+-----+------------+---------+
| 104 | David| 45 | IT | 70000.00|
+-------------+------+-----+------------+---------+
This output indicates that only one employee, David, meets the criteria of
being in the 'IT' department with a salary higher than $60,000.
IMPORTANT CONSIDERATIONS
When filtering data, you can use various operators, such as = , > , < , >= ,
<= , and <> (not equal) to define your conditions. Additionally, you can
combine multiple conditions using logical operators like AND , OR , and
NOT to create more complex filters. Properly filtering your data is crucial for
making informed decisions based on specific subsets of your dataset,
allowing you to focus on relevant records.
UPDATE employees
SET Salary = Salary * 1.10
WHERE Department = 'IT';
Expected Output:
After executing this command, the salaries of employees in the 'IT'
department will be updated. If Bob had a salary of $60,000, it would now be
$66,000.
Expected Output:
After executing this command, Alice’s record will be removed from the
employees table, and a subsequent SELECT * FROM employees; will
show only Bob, Charlie, and David.
QUERY 8: JOINING TABLES
Joining tables allows you to combine rows from two or more tables based on
a related column. For example, if we have another table named
departments that contains department details, we can join it with the
employees table:
Expected Output:
This query will return a list of employee names, their salaries, and the names
of their departments.
Expected Output:
This query will return the number of employees in each department, like so:
+------------+--------------------+
| Department | NumberOfEmployees |
+------------+--------------------+
| HR | 1 |
| IT | 2 |
| Finance | 1 |
+------------+--------------------+
Expected Output:
This will return the average salary for employees in each department.
Expected Output:
This will return employees who meet both criteria, providing a refined list of
eligible employees.
To find employees whose names start with the letter 'D', you can utilize the
LIKE operator:
SELECT * FROM employees
WHERE Name LIKE 'D%';
Expected Output:
This will return David's record, as he is the only employee whose name starts
with 'D'.
To filter employees who work in either 'HR' or 'Finance', you can use the IN
operator:
Expected Output:
This will return records for Alice and Charlie.
To sort the employees by their salary in descending order, you can use the
ORDER BY clause:
Expected Output:
This will return the list of employees sorted by salary from highest to lowest.
The HAVING clause is used to filter results after aggregation. For example, to
find departments with more than one employee, you would write:
Subqueries allow you to nest queries. For example, to find employees whose
salary is above the average salary of the entire table:
Expected Output:
This will return records of employees earning more than the average salary.
To combine results from two different queries, you can use the UNION
operator. For example, to select employees from the 'IT' department and
employees with a salary greater than $60,000:
Expected Output:
This will return a unique list of names from both queries.
To list employee names in each department as a single string, you can use
GROUP_CONCAT :
Expected Output:
This will return departments with a concatenated list of employee names.
To remove a table from the database entirely, you can use the DROP TABLE
statement. For example, to delete the employees table:
Expected Output:
After executing this command, the employees table will be permanently
removed from the database.
CONCLUSION
Learning Python for data analysis and SQL for database management equips
students with essential skills that are increasingly vital in today's data-driven
landscape. Python, with its versatile libraries like Pandas and Matplotlib,
allows for efficient data manipulation, analysis, and visualization. These
capabilities enable students to transform raw data into actionable insights,
fostering a deeper understanding of complex datasets. As they become
proficient in Python, students gain the ability to automate tasks, perform
statistical analyses, and create compelling visual narratives that can influence
decision-making processes across various domains.
On the other hand, SQL provides the foundational knowledge necessary for
managing and querying relational databases. Understanding how to
effectively use SQL empowers students to interact with large datasets,
ensuring data integrity while performing operations such as data retrieval,
insertion, updates, and deletions. This skill is particularly beneficial for
students aspiring to work in roles that require database management, such
as data analysts, data scientists, and software developers. Furthermore, the
ability to extract relevant information from databases using SQL enhances the
overall data analysis process, bridging the gap between data storage and
actionable insights.
Together, proficiency in Python and SQL not only prepares students for a
variety of career opportunities but also cultivates critical thinking and
problem-solving skills. As they navigate the complexities of data analysis and
database management, students develop a toolkit that is essential for
contributing to data-driven decision-making in organizations. This
combination of skills positions them as valuable assets in an increasingly
competitive job market, where the demand for data literacy continues to
grow.