Dataquest.io-tutorial Inserting Records and DataFrames Into a SQL Database
Dataquest.io-tutorial Inserting Records and DataFrames Into a SQL Database
SQL Database
dataquest.io/blog/sql-insert-tutorial
One of the key roles of a data scientist is to extract patterns and insights from raw data.
Since much of the world’s government and corporate data is organized in relational
databases, it makes sense that data scientists need to know how to work with these
database structures. Writing SQL queries to insert, extract, and filter data in databases
is a key skill for anyone interested in data analytics or data science.
Although it has been around for decades, learning SQL is still a critical skill for modern
data scientists because SQL is commonly used in all kinds of relational database
software, including MySQL, SQL Server, Oracle, and PostgreSQL.
In this tutorial, we’ll learn about SQL insertion operations in detail. Here is the list of
topics that we will learn in this tutorial:
SQL Insertion
Inserting records into a database
Inserting Pandas DataFrames into a database using the insert command
Inserting Pandas DataFrames into a database using the to_sql() command
Reading records from a database
Updating records in a database
Want to reach a higher level of SQL skill? Sign up for free and check out Dataquest’s
SQL courses for thorough, interactive lessons on all the SQL skills you’ll need for data
science.
1/15
Write real queries
SQL Insertion
SQL Insertion is an essential operation for data workers to understand. Inserting
missing data or adding new data is a major part of the data cleaning process on most
data science projects.
Insertion is also how most data gets into databases in the first place, so it’s important
anytime you’re collecting data, too. When your company gets new data on a customer,
for example, chances are than a SQL insert will be how that data gets into your existing
customer database.
In fact, whether or not you’re aware of it, data is flowing into databases using SQL
inserts all the time! When you fill out a marketing survey, complete a transaction, file a
government form online, or do any of thousands of other things, your data is likely
being inserted into a database somewhere using SQL.
Let’s dive into how we can actually use SQL to insert data into a database. We can insert
data row by row, or add multiple rows at a time.
In SQL, we use the INSERT command to add records/rows into table data. This
command will not modify the actual structure of the table we’re inserting to, it just adds
data.
Let’s imagine we have a data table like the one below, which is being used to store some
information about a company’s employees.
2/15
Now, let’s imagine we have new employees we need to put into the system.
This employee table could be created using the CREATE TABLE command, so we
could use that command to create an entirely new table. But it would be very inefficient
to create a completely new table every time we want to add data! Instead, let’s use the
INSERT command to add the new data into our existing table.
We start with the command INSERT INTO followed by the name of table into which
we’d like to insert data. After the table name, we list the columns of new data we’re
inserting column by column, inside parentheses. Then, on the next line, we used the
command VALUES along with the values we want to insert (in sequence inside
parentheses.
So for our employee table, if we were adding a new employee named Kabir, our
INSERT command might look like this:
Since we’re often working with our data in Python when doing data science, let’s insert
data from Python into a MySQL database. This is a common task that has a variety of
applications in data science.
Create a connection using pymysql ‘s connect() function with the parameters host,
user, database name, and password.
(The parameters below are for demonstration purposes only; you’ll need to fill in the
specific access details required for the MySQL database you’re accessing.)
This will allow us to execute the SQL query once we’ve written it.
cursor = connection.cursor()
Commit the changes using the commit() function, and check the inserted records.
Note that we can create a variable called sql , assign our query’s syntax to it, and then
pass sql and the specific data we want to insert as arguments to cursor.execute() .
4/15
# Create a new record
sql = "INSERT INTO `employee` (`EmployeeID`, `Ename`, `DeptID`, `Salary`, `Dname`,
`Dlocation`) VALUES (%s, %s, %s, %s, %s, %s)"
# the connection is not autocommited by default. So we must commit to save our changes.
connection.commit()
Let’s do a quick check to see if the record we wanted to insert has actually been inserted.
We can do this by querying the database for the entire contents of employee , and then
fetching and printing those results.
# Fetch all the records and use a for loop to print them one line at a time
result = cursor.fetchall()
for i in result:
print(i)
It worked! Above, we can see the new record has been inserted and is now the final row
in our MySQL database.
Now that we’re done, we should close the database connection using close() method.
Of course, it would be better to write this code in a way that could better handle
exceptions and errors. We can do this using try to contain the body of our code and
except to print errors if any arise. Then, we can use finally to close the connection
once we’re finished, regardless of whether try succeeded or failed.
5/15
import pymysql
try:
# Connect to the database
connection = pymysql.connect(host='localhost',
user='root',
password='12345',
db='employee')
cursor=connection.cursor()
# Execute query
sql = "SELECT * FROM `employee`"
cursor.execute(sql)
# Fetch all the records
result = cursor.fetchall()
for i in result:
print(i)
except Error as e:
print(e)
finally:
# close the database connection using close() method.
connection.close()
((1001, 'John', 2, 4000, 'IT', 'New Delhi'), (1002, 'Anna', 1, 3500, 'HR', 'Mumbai'), (1003,
'James', 1, 2500, 'HR', 'Mumbai'), (1004, 'David', 2, 5000, 'IT', 'New Delhi'), (1005, 'Mark', 2,
3000, 'IT', 'New Delhi'), (1006, 'Steve', 3, 4500, 'Finance', 'Mumbai'), (1007, 'Alice', 3, 3500,
'Finance', 'Mumbai'), (1008, 'Kabir', 2, 5000, 'IT', 'New Delhi'), (1009, 'Morgan', 1, 4000, 'HR',
'Mumbai'), (1009, 'Morgan', 1, 4000, 'HR', 'Mumbai'))
6/15
Inserting Pandas DataFrames Into Databases Using INSERT
When working with data in Python, we’re often using pandas , and we’ve often got our
data stored as a pandas DataFrame. Thankfully, we don’t need to do any conversions if
we want to use SQL with our DataFrames; we can directly insert a pandas DataFrame
into a MySQL database using INSERT .
We could also import data from a CSV or create a DataFrame in any number of other
ways, but for the purposes of this example, we’re just going to create a small DataFrame
that saves the titles and prices of some data science texbooks.
# Import pandas
import pandas as pd
# Create dataframe
data = pd.DataFrame({
'book_id':[12345, 12346, 12347],
'title':['Python Programming', 'Learn MySQL', 'Data Science Cookbook'],
'price':[29, 23, 27]
})
data
Python Programming 29
Learn MySQL 23
Before inserting data into MySQL, we’re going to to create a book table in MySQL to
hold our data. If such a table already existed, we could skip this step.
We’ll use a CREATE TABLE statement to create our table, follow that with our table
name (in this case, book_details ), and then list each column and its corresponding
datatype.
7/15
Step 3: Create a connection to the database
Once we’ve created that table, we can once again create a connection to the database
from Python using pymysql .
import pymysql
# create cursor
cursor=connection.cursor()
Next, we’ll create a column list and insert our dataframe rows one by one into the
database by iterating through each row and using INSERT INTO to insert that row’s
values into the database.
(It is also possible to insert the entire DataFrame at once, and we’ll look at a way of
doing that in the next section, but first let’s look at how to do it row-by-row).
# the connection is not autocommitted by default, so we must commit to save our changes
connection.commit()
Again, let’s query the database to make sure that our inserted data has been saved
correctly.
8/15
# Execute query
sql = "SELECT * FROM `book_details`"
cursor.execute(sql)
Once we’re satisfied that everything looks right, we can close the connection.
connection.close()
This approach accomplishes the same end result in a more direct way, and allows us to
add a whole dataframe to a MySQL database all at once.
# Import modules
import pandas as pd
# Create dataframe
data=pd.DataFrame({
'book_id':[12345,12346,12347],
'title':['Python Programming','Learn MySQL','Data Science Cookbook'],
'price':[29,23,27]
})
data
9/15
book_id title price
Import the module sqlalchemy and create an engine with the parameters user,
password, and database name. This is how we connect and log in to the MySQL
database.
Once we’re connected, we can export the whole DataFrame to MySQL using the
to_sql() function with the parameters table name, engine name, if_exists, and
chunksize.
We’ll take a closer look at what each of these parameters refers to in a moment, but
first, take a look at how much simpler it is to insert a pandas DataFrame into a MySQL
database using this method. We can do it with just a single line of code:
Now let’s take a closer look at what each of these parameters is doing in our code.
book_details is the name of table into which we want to insert our DataFrame.
con = engine provides the connection details (recall that we created engine
using our authentication details in the previous step).
if_exists = 'append' checks whether the table we specified already exists or not,
and then appends the new data (if it does exist) or creates a new table (if it
doesn’t).
chunksize writes records in batches of a given size at a time. By default, all rows
will be written at once.
10/15
you’d be printing thousands of rows (or more). So let’s take a more in-depth look at how
we can read back the records we’ve created or inserted into our SQL database.
We can read records from a SQL database using the SELECT command. We can select
specific columns, or use * to select everything from a given table. We can also select to
return only records that meet a particular condition using the WHERE command.
With larger databases, WHERE is useful for returing only the data we want to see. So if,
for example, we’ve just inserted some new data about a particular department, we could
use WHERE to specify the department ID in our query, and it would return only the
records with a department ID that matches the one we specified.
Compare, for example, the results of these two queries using our employee table from
earlier. In the first, we’re returning all the rows. In the second, we’re getting back only
the rows we’ve asked for. This may not make a big difference when our table has seven
rows, but when you’re working with seven thousand rows, or even seven million, using
WHERE to return only the results you want is very important!
11/15
If we want to do this from within Python, we can use the same script we used earlier in
this tutorial to query these records. The only difference is that we’ll tell pymysql to
execute the SELECT command rather than the INSERT command we used earlier.
12/15
# Import module
import pymysql
# create connection
connection = pymysql.connect(host='localhost',
user='root',
password='12345',
db='employee')
# Create cursor
my_cursor = connection.cursor()
# Execute Query
my_cursor.execute("SELECT * from employee")
for i in result:
print(i)
Above, we’ve selected and printed the entire database, but if we wanted to use WHERE
to make a more careful, limited selection, the approach is the same:
13/15
Updating Records in the Database
Often, we’ll need to modify the records in the table after creating them.
For example, imagine that an employee in our employee table got a promotion. We’d
want to update their salary data. The INSERT INTO command won’t help us here,
because we don’t want to add an entirely new row.
To modify existing records in the table, we need to use the UPDATE command.
UPDATE is used to change the contents of existing records. We can specify specific
columns and values to change using SET , and we can also make conditional changes
with WHERE to apply those changes only to rows that meet that condition.
Now, let’s update the records from our employee table and display the results. In this
case, let’s say David got the promotion — we’ll write a query using UPDATE that sets
Salary to 6000 only in columns where the employee ID is 1004 (David’s ID).
Be careful — without the WHERE clause, this query would update every record in the
table, so don’t forget that!
14/15
Conclusion
In this tutorial, we’ve taken a look at SQL inserts and how to insert data into MySQL
databases from Python. We also learned to insert Pandas DataFrames into SQL
databases using two different methods, including the highly efficient to_sql() method.
Of course, this is just the tip of the iceberg when it comes to SQL queries. If you really
want to become a master of SQL, sign up for free and dive into one of Dataquest’s
interactive SQL courses to get interactive instruction and hands-on experience writing
all the queries you’ll need to do productive, professional data science work.
15/15