Exploring your working directory
In order to import data into Python, you should first have an idea of what files
are in your working directory.
IPython, which is running on DataCamp's servers, has a bunch of cool commands,
including its magic commands. For example, starting a line with ! gives you
complete system shell access. This means that the IPython magic command ! ls will
display the contents of your current directory. Your task is to use the IPython
magic command ! ls to check out the contents of your current directory and answer
the following question: which of the following files is in your working directory?
!ls
# Open a file: file
file = open('moby_dick.txt', 'r')
# Print it
print(file.read())
# Check whether file is closed
print(file.closed)
# Close file
file.close()
# Check whether file is closed
print(file.closed)
Importing text files line by line
For large files, we may not want to print all of their content to the shell: you
may wish to print only the first few lines. Enter the readline() method, which
allows you to do this. When a file called file is open, you can print out the first
line by executing file.readline(). If you execute the same command again, the
second line will print, and so on.
In the introductory video, Hugo also introduced the concept of a context manager.
He showed that you can bind a variable file by using a context manager construct:
with open('huck_finn.txt') as file:
While still within this construct, the variable file will be bound to
open('huck_finn.txt'); thus, to print the file to the shell, all the code you need
to execute is:
with open('huck_finn.txt') as file:
print(file.readline())
# Read & print the first 3 lines
with open('moby_dick.txt') as file:
print(file.readline() )
print(file.readline())
print(file.readline())
Using NumPy to import flat files
In this exercise, you're now going to load the MNIST digit recognition dataset
using the numpy function loadtxt() and see just how easy it can be:
The first argument will be the filename.
The second will be the delimiter which, in this case, is a comma.
You can find more information about the MNIST dataset here on the webpage of Yann
LeCun, who is currently Director of AI Research at Facebook and Founding Director
of the NYU Center for Data Science, among many other things.
Instructions
100 XP
Fill in the arguments of np.loadtxt() by passing file and a comma ',' for the
delimiter.
Fill in the argument of print() to print the type of the object digits. Use the
function type().
Execute the rest of the code to visualize one of the rows of the data.
# Import package
import numpy as np
# Assign filename to variable: file
file = 'digits.csv'
# Load file as array: digits
digits = np.loadtxt(file, delimiter=',')
# Print datatype of digits
print(type(digits) )
# Select and reshape a row
im = digits[21, 1:]
im_sq = np.reshape(im, (28, 28))
# Plot reshaped data (matplotlib.pyplot already loaded as plt)
plt.imshow(im_sq, cmap='Greys', interpolation='nearest')
plt.show()
Customizing your NumPy import
What if there are rows, such as a header, that you don't want to import? What if
your file has a delimiter other than a comma? What if you only wish to import
particular columns?
There are a number of arguments that np.loadtxt() takes that you'll find useful:
delimiter changes the delimiter that loadtxt() is expecting, for example, you can
use ',' and '\t' for comma-delimited and tab-delimited respectively; skiprows
allows you to specify how many rows (not indices) you wish to skip; usecols takes a
list of the indices of the columns you wish to keep.
The file that you'll be importing, digits_header.txt,
has a header
is tab-delimited.
Instructions
100 XP
Complete the arguments of np.loadtxt(): the file you're importing is tab-delimited,
you want to skip the first row and you only want to import the first and third
columns.
Complete the argument of the print() call in order to print the entire array that
you just imported.
# Import numpy
import numpy as np
# Assign the filename: file
file = 'digits_header.txt'
# Load the data: data
data = np.loadtxt(file, delimiter='\t', skiprows=1, usecols=[0,2])
# Print data
print(data)
Importing different datatypes
The file seaslug.txt
has a text header, consisting of strings
is tab-delimited.
These data consists of percentage of sea slug larvae that had metamorphosed in a
given time period. Read more here.
Due to the header, if you tried to import it as-is using np.loadtxt(), Python would
throw you a ValueError and tell you that it could not convert string to float.
There are two ways to deal with this: firstly, you can set the data type argument
dtype equal to str (for string).
Alternatively, you can skip the first row as we have seen before, using the
skiprows argument.
Instructions
100 XP
Complete the first call to np.loadtxt() by passing file as the first argument.
Execute print(data[0]) to print the first element of data.
Complete the second call to np.loadtxt(). The file you're importing is tab-
delimited, the datatype is float, and you want to skip the first row.
Print the 10th element of data_float by completing the print() command. Be guided
by the previous print() call.
Execute the rest of the code to visualize the data.
# Assign filename: file
file = 'seaslug.txt'
# Import file: data
data = np.loadtxt(file, delimiter='\t', dtype=str)
# Print the first element of data
print(data[0])
# Import data as floats and skip the first row: data_float
data_float = np.loadtxt(file, delimiter='\t', dtype=float, skiprows=1)
# Print the 10th element of data_float
print(data_float[9])
# Plot a scatterplot of the data
plt.scatter(data_float[:, 0], data_float[:, 1])
plt.xlabel('time (min.)')
plt.ylabel('percentage of larvae')
plt.show()
Working with mixed datatypes (1)
Much of the time you will need to import datasets which have different datatypes in
different columns; one column may contain strings and another floats, for example.
The function np.loadtxt() will freak at this. There is another function,
np.genfromtxt(), which can handle such structures. If we pass dtype=None to it, it
will figure out what types each column should be.
Import 'titanic.csv' using the function np.genfromtxt() as follows:
data = np.genfromtxt('titanic.csv', delimiter=',', names=True, dtype=None)
Here, the first argument is the filename, the second specifies the delimiter , and
the third argument names tells us there is a header. Because the data are of
different types, data is an object called a structured array. Because numpy arrays
have to contain elements that are all the same type, the structured array solves
this by being a 1D array, where each element of the array is a row of the flat file
imported. You can test this by checking out the array's shape in the shell by
executing np.shape(data).
Working with mixed datatypes (2)
You have just used np.genfromtxt() to import data containing mixed datatypes. There
is also another function np.recfromcsv() that behaves similarly to np.genfromtxt(),
except that its default dtype is None. In this exercise, you'll practice using this
to achieve the same result.
Instructions
100 XP
Import titanic.csv using the function np.recfromcsv() and assign it to the
variable, d. You'll only need to pass file to it because it has the defaults
delimiter=',' and names=True in addition to dtype=None!
Run the remaining code to print the first three entries of the resulting array d.
# Assign the filename: file
file = 'titanic.csv'
# Import file using np.recfromcsv: d
d = np.recfromcsv(file)
# Print out first three entries of d
print(d[:3])
Using pandas to import flat files as DataFrames (1)
In the last exercise, you were able to import flat files containing columns with
different datatypes as numpy arrays. However, the DataFrame object in pandas is a
more appropriate structure in which to store such data and, thankfully, we can
easily import files of mixed data types as DataFrames using the pandas functions
read_csv() and read_table().
Import the pandas package using the alias pd.
Read titanic.csv into a DataFrame called df. The file name is already stored in the
file object.
In a print() call, view the head of the DataFrame.
# Import pandas as pd
import pandas as pd
# Assign the filename: file
file = 'titanic.csv'
# Read the file into a DataFrame: df
df = pd.read_csv(file)
# View the head of the DataFrame
print(df.head())
Using pandas to import flat files as DataFrames (2)
In the last exercise, you were able to import flat files into a pandas DataFrame.
As a bonus, it is then straightforward to retrieve the corresponding numpy array
using the attribute values. You'll now have a chance to do this using the MNIST
dataset, which is available as digits.csv.
Import the first 5 rows of the file into a DataFrame using the function
pd.read_csv() and assign the result to data. You'll need to use the arguments nrows
and header (there is no header in this file).
Build a numpy array from the resulting DataFrame in data and assign to data_array.
Execute print(type(data_array)) to print the datatype of data_array.
# Assign the filename: file
file = 'digits.csv'
# Read the first 5 rows of the file into a DataFrame: data
data = pd.read_csv(file, nrows=5, header = None)
# Build a numpy array from the DataFrame: data_array
data_array = data.values
# Print the datatype of data_array to the shell
print(type(data_array))
Customizing your pandas import
The pandas package is also great at dealing with many of the issues you will
encounter when importing data as a data scientist, such as comments occurring in
flat files, empty lines and missing values. Note that missing values are also
commonly referred to as NA or NaN. To wrap up this chapter, you're now going to
import a slightly corrupted copy of the Titanic dataset titanic_corrupt.txt, which
contains comments after the character '#', is tab-delimited.
Complete the sep (the pandas version of delim), comment and na_values arguments of
pd.read_csv(). comment takes characters that comments occur after in the file,
which in this case is '#'. na_values takes a list of strings to recognize as
NA/NaN, in this case the string 'Nothing'.
Execute the rest of the code to print the head of the resulting DataFrame and plot
the histogram of the 'Age' of passengers aboard the Titanic.
# Import matplotlib.pyplot as plt
import matplotlib.pyplot as plt
# Assign filename: file
file = 'titanic_corrupt.txt'
# Import file: data
data = pd.read_csv(file, sep='\t', comment='#', na_values='Nothing')
# Print the head of the DataFrame
print(data.head())
# Plot 'Age' variable in a histogram
pd.DataFrame.hist(data[['Age']])
plt.xlabel('Age (years)')
plt.ylabel('count')
plt.show()
Not so flat any more
In Chapter 1, you learned how to use the IPython magic command ! ls to explore your
current working directory. You can also do this natively in Python using the
library os, which consists of miscellaneous operating system interfaces.
imports library os,
stores the name of the current directory in a string called wd and
outputs the contents of the directory in a list to the shell.
import os
wd = os.getcwd()
os.listdir(wd)
Loading a pickled file
There are a number of datatypes that cannot be saved easily to flat files, such as
lists and dictionaries. If you want your files to be human readable, you may want
to save them as text files in a clever manner. JSONs, which you will see in a later
chapter, are appropriate for Python dictionaries.
However, if you merely want to be able to import them into Python, you can
serialize them. All this means is converting the object into a sequence of bytes,
or a bytestream.
In this exercise, you'll import the pickle package, open a previously pickled data
structure from a file and load it.
Import the pickle package.
Complete the second argument of open() so that it is read only for a binary file.
This argument will be a string of two letters, one signifying 'read only', the
other 'binary'.
Pass the correct argument to pickle.load(); it should use the variable that is
bound to open.
Print the data, d.
Print the datatype of d; take your mind back to your previous use of the function
type().
# Import pickle package
import pickle
# Open pickle file and load data: d
with open('data.pkl', 'rb') as file:
d = pickle.load(file)
# Print d
print(d)
# Print datatype of d
print(type(d) )
Listing sheets in Excel files
Whether you like it or not, any working data scientist will need to deal with Excel
spreadsheets at some point in time. You won't always want to do so in Excel,
however!
Here, you'll learn how to use pandas to import Excel spreadsheets and how to list
the names of the sheets in any loaded .xlsx file.
Recall from the video that, given an Excel file imported into a variable
spreadsheet, you can retrieve a list of the sheet names using the attribute
spreadsheet.sheet_names.
Specifically, you'll be loading and checking out the spreadsheet
'battledeath.xlsx', modified from the Peace Research Institute Oslo's (PRIO)
dataset. This data contains age-adjusted mortality rates due to war in various
countries over several years.
Assign the filename to the variable file.
Pass the correct argument to pd.ExcelFile() to load the file using pandas.
Print the sheetnames of the Excel spreadsheet by passing the necessary argument to
the print() function.
# Import pandas
import pandas as pd
# Assign spreadsheet filename: file
file = 'battledeath.xlsx'
# Load spreadsheet: xl
xls = pd.ExcelFile(file)
# Print sheet names
print(xls.sheet_names)
Importing sheets from Excel files
In this exercise, you'll learn how to import any given sheet of your loaded .xlsx
file as a DataFrame. You'll be able to do so by specifying either the sheet's name
or its index.
Load the sheet '2004' into the DataFrame df1 using its name as a string.
Print the head of df1 to the shell.
Load the sheet 2002 into the DataFrame df2 using its index (0).
Print the head of df2 to the shell.
# Load a sheet into a DataFrame by name: df1
df1 = xls.parse('2004')
print(df1.head())
# Load a sheet into a DataFrame by index: df2
df2 = xls.parse(0)
print(df2.head() )
Customizing your spreadsheet import
Here, you'll parse your spreadsheets and use additional arguments to skip rows,
rename columns and select only particular columns.
As before, you'll use the method parse(). This time, however, you'll add the
additional arguments skiprows, names and usecols. These skip rows, name the columns
and designate which columns to parse, respectively. All these arguments can be
assigned to lists containing the specific row numbers, strings and column numbers,
as appropriate.
Parse the first sheet by index. In doing so, skip the first row of data and name
the columns 'Country' and 'AAM due to War (2002)' using the argument names. The
values passed to skiprows and names all need to be of type list.
Parse the second sheet by index. In doing so, parse only the first column with the
usecols parameter, skip the first row and rename the column 'Country'. The argument
passed to usecols also needs to be of type list.
# Parse the first sheet and rename the columns: df1
df1 = xls.parse(0, skiprows=1, names=['Country','AAM due to War (2002)'])
print(df1.head())
# Parse the first column of the second sheet and rename the column: df2
df2 = xls.parse(1, usecols=0, skiprows=1, names=['Country'])
# Print the head of the DataFrame df2
print(df2.head())
# Import sas7bdat package
from sas7bdat import SAS7BDAT
# Save file to a DataFrame: df_sas
with SAS7BDAT('sales.sas7bdat') as file:
df_sas = file.to_data_frame()
# Print head of DataFrame
print(df_sas.head())
# Plot histogram of DataFrame features (pandas and pyplot already imported)
pd.DataFrame.hist(df_sas[['P']])
plt.ylabel('count')
plt.show()
Using read_stata to import Stata files
The pandas package has been imported in the environment as pd and the file
disarea.dta is in your working directory. The data consist of disease extents for
several diseases in various countries (more information can be found here).
# Import pandas
import pandas as pd
# Load Stata file into a pandas DataFrame: df
df = pd.read_stata('disarea.dta')
# Print the head of the DataFrame df
print(df.head() )
# Plot histogram of one column of the DataFrame
pd.DataFrame.hist(df[['disa10']])
plt.xlabel('Extent of disease')
plt.ylabel('Number of countries')
plt.show()
Using h5py to import HDF5 files
The file 'LIGO_data.hdf5' is already in your working directory. In this exercise,
you'll import it using the h5py library. You'll also print out its datatype to
confirm you have imported it correctly. You'll then study the structure of the file
in order to see precisely what HDF groups it contains.
You can find the LIGO data plus loads of documentation and tutorials here. There is
also a great tutorial on Signal Processing with the data here.
Import the package h5py.
Assign the name of the file to the variable file.
Load the file as read only into the variable data.
Print the datatype of data.
Print the names of the groups in the HDF5 file 'LIGO_data.hdf5'.
# Import packages
import numpy as np
import h5py
# Assign filename: file
file = 'LIGO_data.hdf5'
# Load file: data
data = h5py.File(file, 'r')
# Print the datatype of the loaded file
print(type(data) )
# Print the keys of the file
for key in data.keys():
print(key)
Extracting data from your HDF5 file
In this exercise, you'll extract some of the LIGO experiment's actual data from the
HDF5 file and you'll visualize it.
To do so, you'll need to first explore the HDF5 group 'strain'.
Assign the HDF5 group data['strain'] to group.
In the for loop, print out the keys of the HDF5 group in group.
Assign to the variable strain the values of the time series data data['strain']
['Strain'] using the attribute .value.
Set num_samples equal to 10000, the number of time points we wish to sample.
Execute the rest of the code to produce a plot of the time series data in
LIGO_data.hdf5.
# Get the HDF5 group: group
group = data['strain']
# Check out keys of group
for key in group.keys():
print(key)
# Set variable equal to time series data: strain
strain = data['strain']['Strain'].value
# Set number of time points to sample: num_samples
num_samples = 10000
# Set time vector
time = np.arange(0, 1, 1/num_samples)
# Plot data
plt.plot(time, strain[:num_samples])
plt.xlabel('GPS Time (s)')
plt.ylabel('strain')
plt.show()
Loading .mat files
In this exercise, you'll figure out how to load a MATLAB file using
scipy.io.loadmat() and you'll discover what Python datatype it yields.
The file 'albeck_gene_expression.mat' is in your working directory. This file
contains gene expression data from the Albeck Lab at UC Davis. You can find the
data and some great documentation here.
Instructions
100 XP
Import the package scipy.io.
Load the file 'albeck_gene_expression.mat' into the variable mat; do so using the
function scipy.io.loadmat().
Use the function type() to print the datatype of mat to the IPython shell.
# Import package
import scipy.io
# Load MATLAB file: mat
mat = scipy.io.loadmat('albeck_gene_expression.mat')
# Print the datatype type of mat
print(type(mat) )
The structure of .mat in Python
Here, you'll discover what is in the MATLAB dictionary that you loaded in the
previous exercise.
The file 'albeck_gene_expression.mat' is already loaded into the variable mat. The
following libraries have already been imported as follows:
import scipy.io
import matplotlib.pyplot as plt
import numpy as np
Once again, this file contains gene expression data from the Albeck Lab at UCDavis.
You can find the data and some great documentation here.
Instructions
100 XP
Use the method .keys() on the dictionary mat to print the keys. Most of these keys
(in fact the ones that do NOT begin and end with '__') are variables from the
corresponding MATLAB environment.
Print the type of the value corresponding to the key 'CYratioCyt' in mat. Recall
that mat['CYratioCyt'] accesses the value.
Print the shape of the value corresponding to the key 'CYratioCyt' using the numpy
function shape().
Execute the entire script to see some oscillatory gene expression data!
# Print the keys of the MATLAB dictionary
print(mat.keys() )
# Print the type of the value corresponding to the key 'CYratioCyt'
print( type(mat['CYratioCyt']) )
# Print the shape of the value corresponding to the key 'CYratioCyt'
print(np.shape(mat['CYratioCyt']) )
# Subset the array and plot it
data = mat['CYratioCyt'][25, 5:]
fig = plt.figure()
plt.plot(data)
plt.xlabel('time (min.)')
plt.ylabel('normalized fluorescence (measure of expression)')
plt.show()
Pop quiz: The relational model
Each row or record in a table represents an instance of an entity type.
Each column in a table represents an attribute or feature of an instance.
Every table contains a primary key column, which has a unique entry for each row.
There are relations between tables.
Creating a database engine
Here, you're going to fire up your very first SQL engine. You'll create an engine
to connect to the SQLite database 'Chinook.sqlite', which is in your working
directory. Remember that to create an engine to connect to 'Northwind.sqlite', Hugo
executed the command
engine = create_engine('sqlite:///Northwind.sqlite')
Here, 'sqlite:///Northwind.sqlite' is called the connection string to the SQLite
database Northwind.sqlite. A little bit of background on the Chinook database: the
Chinook database contains information about a semi-fictional digital media store in
which media data is real and customer, employee and sales data has been manually
created.
Why the name Chinook, you ask? According to their website,
The name of this sample database was based on the Northwind database. Chinooks are
winds in the interior West of North America, where the Canadian Prairies and Great
Plains meet various mountain ranges. Chinooks are most prevalent over southern
Alberta in Canada. Chinook is a good name choice for a database that intends to be
an alternative to Northwind.
Import the function create_engine from the module sqlalchemy.
Create an engine to connect to the SQLite database 'Chinook.sqlite' and assign it
to engine.
# Import necessary module
from sqlalchemy import create_engine
# Create engine: engine
engine = create_engine('sqlite:///Chinook.sqlite' )
What are the tables in the database?
In this exercise, you'll once again create an engine to connect to
'Chinook.sqlite'. Before you can get any data out of the database, however, you'll
need to know what tables it contains!
To this end, you'll save the table names to a list using the method table_names()
on the engine and then you will print the list.
Import the function create_engine from the module sqlalchemy.
Create an engine to connect to the SQLite database 'Chinook.sqlite' and assign it
to engine.
Using the method table_names() on the engine engine, assign the table names of
'Chinook.sqlite' to the variable table_names.
Print the object table_names to the shell.
# Import necessary module
from sqlalchemy import create_engine
# Create engine: engine
engine = create_engine('sqlite:///Chinook.sqlite')
# Save the table names to a list: table_names
table_names = engine.table_names()
print(table_names)
The Hello World of SQL Queries!
Now, it's time for liftoff! In this exercise, you'll perform the Hello World of SQL
queries, SELECT, in order to retrieve all columns of the table Album in the Chinook
database. Recall that the query SELECT * selects all columns.
Instructions
100 XP
Open the engine connection as con using the method connect() on the engine.
Execute the query that selects ALL columns from the Album table. Store the results
in rs.
Store all of your query results in the DataFrame df by applying the fetchall()
method to the results rs.
Close the connection!
# Import packages
from sqlalchemy import create_engine
import pandas as pd
# Create engine: engine
engine = create_engine('sqlite:///Chinook.sqlite')
# Open engine connection: con
con = engine.connect()
# Perform query: rs
rs = con.execute('SELECT * FROM Album ')
# Save results of the query to DataFrame: df
df = pd.DataFrame(rs)
# Close connection
con.close()
# Print head of DataFrame df
print(df.head())
Customizing the Hello World of SQL Queries
Select specified columns from a table;
Select a specified number of rows;
Import column names from the database table.
Recall that Hugo performed a very similar query customization in the video:
engine = create_engine('sqlite:///Northwind.sqlite')
with engine.connect() as con:
rs = con.execute("SELECT OrderID, OrderDate, ShipName FROM Orders")
df = pd.DataFrame(rs.fetchmany(size=5))
df.columns = rs.keys()
Packages have already been imported as follows:
from sqlalchemy import create_engine
import pandas as pd
The engine has also already been created:
engine = create_engine('sqlite:///Chinook.sqlite')
The engine connection is already open with the statement
with engine.connect() as con:
All the code you need to complete is within this context.
# Open engine in context manager
# Perform query and save results to DataFrame: df
with engine.connect() as con:
rs = con.execute('SELECT LastName, Title from Employee')
df = pd.DataFrame(rs.fetchmany(size = 3) )
df.columns = rs.keys()
# Print the length of the DataFrame df
print(len(df))
# Print the head of the DataFrame df
print(df.head())
Filtering your database records using SQL's WHERE
You can now execute a basic SQL query to select records from any table in your
database and you can also perform simple query customizations to select particular
columns and numbers of rows.
There are a couple more standard SQL query chops that will aid you in your journey
to becoming an SQL ninja.
Let's say, for example that you wanted to get all records from the Customer table
of the Chinook database for which the Country is 'Canada'. You can do this very
easily in SQL using a SELECT statement followed by a WHERE clause as follows:
SELECT * FROM Customer WHERE Country = 'Canada'
In fact, you can filter any SELECT statement by any condition using a WHERE clause.
This is called filtering your records.
In this interactive exercise, you'll select all records of the Employee table for
which 'EmployeeId' is greater than or equal to 6.
# Create engine: engine
engine = create_engine('sqlite:///Chinook.sqlite')
# Open engine in context manager
# Perform query and save results to DataFrame: df
with engine.connect() as con:
rs = con.execute('SELECT * FROM Employee WHERE EmployeeId >= 6')
df = pd.DataFrame(rs.fetchall())
df.columns = rs.keys()
# Print the head of the DataFrame df
print(df.head())
Ordering your SQL records with ORDER BY
You can also order your SQL query results. For example, if you wanted to get all
records from the Customer table of the Chinook database and order them in
increasing order by the column SupportRepId, you could do so with the following
query:
"SELECT * FROM Customer ORDER BY SupportRepId"
In fact, you can order any SELECT statement by any column.
In this interactive exercise, you'll select all records of the Employee table and
order them in increasing order by the column BirthDate.
# Create engine: engine
engine = create_engine('sqlite:///Chinook.sqlite')
# Open engine in context manager
with engine.connect() as con:
rs = con.execute('Select * from Employee order by BirthDate' )
df = pd.DataFrame(rs.fetchall())
# Set the DataFrame's column names
df.columns = rs.keys()
# Print head of DataFrame
print(df.head())
Pandas and The Hello World of SQL Queries!
Here, you'll take advantage of the power of pandas to write the results of your SQL
query to a DataFrame in one swift line of Python code!
You'll first import pandas and create the SQLite 'Chinook.sqlite' engine. Then
you'll query the database to select all records from the Album table.
# Import packages
from sqlalchemy import create_engine
import pandas as pd
# Create engine: engine
engine = create_engine('sqlite:///Chinook.sqlite')
# Execute query and store records in DataFrame: df
df = pd.read_sql_query('Select * from Album', engine)
# Execute query and store records in DataFrame: df
df2 = pd.read_sql_query('Select * from Employee where EmployeeId >= 6 order by
BirthDate', engine)
# Open engine in context manager
# Perform query and save results to DataFrame: df
with engine.connect() as con:
rs = con.execute('SELECT Title, Name FROM Album INNER JOIN Artist ON
Album.ArtistID = Artist.ArtistID')
df = pd.DataFrame(rs.fetchall())
df.columns = rs.keys()
# Execute query and store records in DataFrame: df
df = pd.read_sql_query('SELECT * FROM PlaylistTrack INNER JOIN Track on
PlaylistTrack.TrackId = Track.TrackId where Milliseconds < 250000', engine)
Importing flat files from the web: your turn!
You are about to import your first file from the web! The flat file you will import
will be 'winequality-red.csv' from the University of California, Irvine's Machine
Learning repository. The flat file contains tabular data of physiochemical
properties of red wine, such as pH, alcohol content and citric acid content, along
with wine quality rating.
The URL of the file is
'https://s3.amazonaws.com/assets.datacamp.com/production/course_1606/datasets/wineq
uality-red.csv'
After you import it, you'll check your working directory to confirm that it is
there and then you'll load it into a pandas DataFrame.
Instructions
100 XP
Import the function urlretrieve from the subpackage urllib.request.
Assign the URL of the file to the variable url.
Use the function urlretrieve() to save the file locally as 'winequality-red.csv'.
Execute the remaining code to load 'winequality-red.csv' in a pandas DataFrame and
to print its head to the shell.
# Import package
from urllib.request import urlretrieve
import pandas as pd
# Assign url of file: url
url =
'https://s3.amazonaws.com/assets.datacamp.com/production/course_1606/datasets/wineq
uality-red.csv'
# Save file locally
urlretrieve(url, 'winequality-red.csv')
df = pd.read_csv('winequality-red.csv', sep=';')
print(df.head())
Opening and reading flat files from the web
You have just imported a file from the web, saved it locally and loaded it into a
DataFrame. If you just wanted to load a file from the web into a DataFrame without
first saving it locally, you can do that easily using pandas. In particular, you
can use the function pd.read_csv() with the URL as the first argument and the
separator sep as the second argument.
Assign the URL of the file to the variable url.
Read file into a DataFrame df using pd.read_csv(), recalling that the separator in
the file is ';'.
Print the head of the DataFrame df.
Execute the rest of the code to plot histogram of the first feature in the
DataFrame df.
# Import packages
import matplotlib.pyplot as plt
import pandas as pd
# Assign url of file: url
url =
'https://s3.amazonaws.com/assets.datacamp.com/production/course_1606/datasets/wineq
uality-red.csv'
# Read file into a DataFrame: df
df = pd.read_csv(url, sep = ';')
print(df.head() )
# Plot first column of df
pd.DataFrame.hist(df.ix[:, 0:1])
plt.xlabel('fixed acidity (g(tartaric acid)/dm$^3$)')
plt.ylabel('count')
plt.show()
Importing non-flat files from the web
Congrats! You've just loaded a flat file from the web into a DataFrame without
first saving it locally using the pandas function pd.read_csv(). This function is
super cool because it has close relatives that allow you to load all types of
files, not only flat ones. In this interactive exercise, you'll use pd.read_excel()
to import an Excel spreadsheet.
Your job is to use pd.read_excel() to read in all of its sheets, print the sheet
names and then print the head of the first sheet using its name, not its index.
Note that the output of pd.read_excel() is a Python dictionary with sheet names as
keys and corresponding DataFrames as corresponding values.
Assign the URL of the file to the variable url.
Read the file in url into a dictionary xl using pd.read_excel() recalling that, in
order to import all sheets you need to pass None to the argument sheetname.
Print the names of the sheets in the Excel spreadsheet; these will be the keys of
the dictionary xl.
Print the head of the first sheet using the sheet name, not the index of the sheet!
The sheet name is '1700'
# Import package
import pandas as pd
# Assign url of file: url
url =
'http://s3.amazonaws.com/assets.datacamp.com/course/importing_data_into_r/latitude.
xls'
# Read in all sheets of Excel file: xl
xl = pd.read_excel(url, sheetname = None)
# Print the sheetnames to the shell
print(xl.keys())
# Print the head of the first sheet (using its name, NOT its index)
print(xl['1700'].head())
Performing HTTP requests in Python using urllib
Now that you know the basics behind HTTP GET requests, it's time to perform some of
your own. In this interactive exercise, you will ping our very own DataCamp servers
to perform a GET request to extract information from our teach page,
"http://www.datacamp.com/teach/documentation".
In the next exercise, you'll extract the HTML itself. Right now, however, you are
going to package and send the request and then catch the response.
Instructions
100 XP
Import the functions urlopen and Request from the subpackage urllib.request.
Package the request to the url "http://www.datacamp.com/teach/documentation" using
the function Request() and assign it to request.
Send the request and catch the response in the variable response with the function
urlopen().
Run the rest of the code to see the datatype of response and to close the
connection!
# Import packages
from urllib.request import urlopen
from urllib.request import Request
# Specify the url
url = "http://www.datacamp.com/teach/documentation"
# This packages the request: request
request = Request(url)
# Sends the request and catches the response: response
response = urlopen(request)
# Print the datatype of response
print(type(response))
# Be polite and close the response!
response.close()
Printing HTTP request results in Python using urllib
You have just packaged and sent a GET request to
"http://www.datacamp.com/teach/documentation" and then caught the response. You saw
that such a response is a http.client.HTTPResponse object. The question remains:
what can you do with this response?
Well, as it came from an HTML page, you could read it to extract the HTML and, in
fact, such a http.client.HTTPResponse object has an associated read() method. In
this exercise, you'll build on your previous great work to extract the response and
print the HTML.
Instructions
100 XP
Send the request and catch the response in the variable response with the function
urlopen(), as in the previous exercise.
Extract the response using the read() method and store the result in the variable
html.
Print the string html.
Hit submit to perform all of the above and to close the response: be tidy!
# Import packages
from urllib.request import urlopen, Request
# Specify the url
url = "http://www.datacamp.com/teach/documentation"
# This packages the request
request = Request(url)
# Sends the request and catches the response: response
response = urlopen(request)
# Extract the response: html
html = response.read()
# Print the html
print(html)
# Be polite and close the response!
response.close()
Performing HTTP requests in Python using requests
Now that you've got your head and hands around making HTTP requests using the
urllib package, you're going to figure out how to do the same using the higher-
level requests library. You'll once again be pinging DataCamp servers for their
"http://www.datacamp.com/teach/documentation" page.
Note that unlike in the previous exercises using urllib, you don't have to close
the connection when using requests!
Import the package requests.
Assign the URL of interest to the variable url.
Package the request to the URL, send the request and catch the response with a
single function requests.get(), assigning the response to the variable r.
Use the text attribute of the object r to return the HTML of the webpage as a
string; store the result in a variable text.
Hit submit to print the HTML of the webpage.
# Import package
import requests
# Specify the url: url
url = "http://www.datacamp.com/teach/documentation"
# Packages the request, send the request and catch the response: r
r = requests.get(url)
# Extract the response: text
text = r.text
# Print the html
print(text)
Parsing HTML with BeautifulSoup
In this interactive exercise, you'll learn how to use the BeautifulSoup package to
parse, prettify and extract information from HTML. You'll scrape the data from the
webpage of Guido van Rossum, Python's very own Benevolent Dictator for Life. In the
following exercises, you'll prettify the HTML and then extract the text and the
hyperlinks.
The URL of interest is url = 'https://www.python.org/~guido/'.
Import the function BeautifulSoup from the package bs4.
Assign the URL of interest to the variable url.
Package the request to the URL, send the request and catch the response with a
single function requests.get(), assigning the response to the variable r.
Use the text attribute of the object r to return the HTML of the webpage as a
string; store the result in a variable html_doc.
Create a BeautifulSoup object soup from the resulting HTML using the function
BeautifulSoup().
Use the method prettify() on soup and assign the result to pretty_soup.
Hit submit to print to prettified HTML to your shell!
# Import packages
import requests
from bs4 import BeautifulSoup
# Specify url: url
url = 'https://www.python.org/~guido/'
# Package the request, send the request and catch the response: r
r = requests.get(url)
# Extracts the response as html: html_doc
html_doc = r.text
# Create a BeautifulSoup object from the HTML: soup
soup = BeautifulSoup(html_doc)
# Prettify the BeautifulSoup object: pretty_soup
pretty_soup = soup.prettify()
# Print the response
print(pretty_soup)
Turning a webpage into data using BeautifulSoup: getting the text
As promised, in the following exercises, you'll learn the basics of extracting
information from HTML soup. In this exercise, you'll figure out how to extract the
text from the BDFL's webpage, along with printing the webpage's title.
Instructions
100 XP
In the sample code, the HTML response object html_doc has already been created:
your first task is to Soupify it using the function BeautifulSoup() and to assign
the resulting soup to the variable soup.
Extract the title from the HTML soup soup using the attribute title and assign the
result to guido_title.
Print the title of Guido's webpage to the shell using the print() function.
Extract the text from the HTML soup soup using the method get_text() and assign to
guido_text.
Hit submit to print the text from Guido's webpage to the shell.
# Import packages
import requests
from bs4 import BeautifulSoup
# Specify url: url
url = 'https://www.python.org/~guido/'
# Package the request, send the request and catch the response: r
r = requests.get(url)
# Extract the response as html: html_doc
html_doc = r.text
# Create a BeautifulSoup object from the HTML: soup
soup = BeautifulSoup(html_doc)
# Get the title of Guido's webpage: guido_title
guido_title = soup.title
# Print the title of Guido's webpage to the shell
print(guido_title)
# Get Guido's text: guido_text
guido_text = soup.text
# Print Guido's text to the shell
print(guido_text)
Turning a webpage into data using BeautifulSoup: getting the hyperlinks
In this exercise, you'll figure out how to extract the URLs of the hyperlinks from
the BDFL's webpage. In the process, you'll become close friends with the soup
method find_all().
Instructions
100 XP
Use the method find_all() to find all hyperlinks in soup, remembering that
hyperlinks are defined by the HTML tag <a> but passed to find_all() without angle
brackets; store the result in the variable a_tags.
The variable a_tags is a results set: your job now is to enumerate over it, using a
for loop and to print the actual URLs of the hyperlinks; to do this, for every
element link in a_tags, you want to print() link.get('href').
# Import packages
import requests
from bs4 import BeautifulSoup
# Specify url
url = 'https://www.python.org/~guido/'
# Package the request, send the request and catch the response: r
r = requests.get(url)
html_doc = r.text
# create a BeautifulSoup object from the HTML: soup
soup = BeautifulSoup(html_doc)
print(soup.title)
# Find all 'a' tags (which define hyperlinks): a_tags
a_tags = soup.find_all(s'a')
# Print the URLs to the shell
for link in a_tags:
print(link.get('href') )
Loading and exploring a JSON
Load the JSON 'a_movie.json' into the variable json_data, which will be a
dictionary. You'll then explore the JSON contents by printing the key-value pairs
of json_data to the shell.
Load the JSON 'a_movie.json' into the variable json_data within the context
provided by the with statement. To do so, use the function json.load() within the
context manager.
Use a for loop to print all key-value pairs in the dictionary json_data. Recall
that you can access a value in a dictionary using the syntax: dictionary[key].
# Load JSON: json_data
with open("a_movie.json") as json_file:
json_data = json.load(json_file)
# Print each key-value pair in json_data
for k in json_data.keys():
print(k + ': ', json_data[k])
Pop quiz: Exploring your JSON
Load the JSON 'a_movie.json' into a variable, which will be a dictionary. Do so by
copying, pasting and executing the following code in the IPython Shell:
import json
with open("a_movie.json") as json_file:
json_data = json.load(json_file)
Print the values corresponding to the keys 'Title' and 'Year' and answer the
following question about the movie that the JSON describes:
Which of the following statements is true of the movie in question?
The title is 'Kung Fu Panda' and the year is 2010.
The title is 'Kung Fu Panda' and the year is 2008.
The title is 'The Social Network' and the year is 2010.
The title is 'The Social Network' and the year is 2008.
import json
with open("a_movie.json") as json_file:
json_data = json.load(json_file)
print(json_data.keys())
print(json_data.values())
print(json_data.keys(), json_data.values())
print(json_data['Title'])
API requests
Now it's your turn to pull some movie data down from the Open Movie Database (OMDB)
using their API. The movie you'll query the API about is The Social Network. Recall
that, in the video, to query the API about the movie Hackers, Hugo's query string
was 'http://www.omdbapi.com/?t=hackers' and had a single argument t=hackers.
Note: recently, OMDB has changed their API: you now also have to specify an API
key. This means you'll have to add another argument to the URL: apikey=72bc447a.
Import the requests package.
Assign to the variable url the URL of interest in order to query
'http://www.omdbapi.com' for the data corresponding to the movie The Social
Network. The query string should have two arguments: apikey=72bc447a and
t=the+social+network. You can combine them as follows:
apikey=72bc447a&t=the+social+network.
Print the text of the reponse object r by using its text attribute and passing the
result to the print() function.
# Import requests package
import requests
# Assign URL to variable: url
url = 'http://www.omdbapi.com/?apikey=72bc447a&t=the+social+network'
# Package the request, send the request and catch the response: r
r = requests.get(url)
# Print the text of the response
print(r.text)
JSON–from the web to Python
You've just queried your first API programmatically in Python and printed the text
of the response to the shell. However, as you know, your response is actually a
JSON, so you can do one step better and decode the JSON. You can then print the
key-value pairs of the resulting dictionary. That's what you're going to do now!
Pass the variable url to the requests.get() function in order to send the relevant
request and catch the response, assigning the resultant response message to the
variable r.
Apply the json() method to the response object r and store the resulting dictionary
in the variable json_data.
Hit Submit Answer to print the key-value pairs of the dictionary json_data to the
shell.
# Import package
import requests
# Assign URL to variable: url
url = 'http://www.omdbapi.com/?apikey=72bc447a&t=social+network'
# Package the request, send the request and catch the response: r
r = requests.get(url)
# Decode the JSON data into a dictionary: json_data
json_data = r.json()
# Print each key-value pair in json_data
for k in json_data.keys():
print(k + ': ', json_data[k])
Checking out the Wikipedia API
You're doing so well and having so much fun that we're going to throw one more API
at you: the Wikipedia API (documented here). You'll figure out how to find and
extract information from the Wikipedia page for Pizza. What gets a bit wild here is
that your query will return nested JSONs, that is, JSONs with JSONs, but Python can
handle that because it will translate them into dictionaries within dictionaries.
The URL that requests the relevant query from the Wikipedia API is
https://en.wikipedia.org/w/api.php?
action=query&prop=extracts&format=json&exintro=&titles=pizza
Assign the relevant URL to the variable url.
Apply the json() method to the response object r and store the resulting dictionary
in the variable json_data.
The variable pizza_extract holds the HTML of an extract from Wikipedia's Pizza page
as a string; use the function print() to print this string to the shell.
# Import package
import requests
# Assign URL to variable: url
url = 'https://en.wikipedia.org/w/api.php?
action=query&prop=extracts&format=json&exintro=&titles=pizza'
# Package the request, send the request and catch the response: r
r = requests.get(url)
# Decode the JSON data into a dictionary: json_data
json_data = r.json()
# Print the Wikipedia page extract
pizza_extract = json_data['query']['pages']['24768']['extract']
print(pizza_extract)
API Authentication
The package tweepy is great at handling all the Twitter API OAuth Authentication
details for you. All you need to do is pass it your authentication credentials. In
this interactive exercise, we have created some mock authentication credentials (if
you wanted to replicate this at home, you would need to create a Twitter App as
Hugo detailed in the video). Your task is to pass these credentials to tweepy's
OAuth handler.
Import the package tweepy.
Pass the parameters consumer_key and consumer_secret to the function
tweepy.OAuthHandler().
Complete the passing of OAuth credentials to the OAuth handler auth by applying to
it the method set_access_token(), along with arguments access_token and
access_token_secret.
# Import package
import tweepy
# Store OAuth authentication credentials in relevant variables
access_token = "1092294848-aHN7DcRP9B4VMTQIhwqOYiB14YkW92fFO8k8EPy"
access_token_secret = "X4dHmhPfaksHcQ7SCbmZa2oYBBVSD2g8uIHXsp5CTaksx"
consumer_key = "nZ6EA0FxZ293SxGNg8g8aP0HM"
consumer_secret = "fJGEodwe3KiKUnsYJC3VRndj7jevVvXbK2D5EiJ2nehafRgA6i"
# Pass OAuth details to tweepy's OAuth handler
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
Streaming tweets
Now that you have set up your authentication credentials, it is time to stream some
tweets! We have already defined the tweet stream listener class, MyStreamListener,
just as Hugo did in the introductory video. You can find the code for the tweet
stream listener class here.
Your task is to create the Streamobject and to filter tweets according to
particular keywords.
Create your Stream object with authentication by passing tweepy.Stream() the
authentication handler auth and the Stream listener l;
To filter Twitter streams, pass to the track argument in stream.filter() a list
containing the desired keywords 'clinton', 'trump', 'sanders', and 'cruz'.
# Initialize Stream listener
l = MyStreamListener()
# Create your Stream object with authentication
stream = tweepy.Stream(auth, l)
# Filter Twitter Streams to capture data by the keywords:
stream.filter(['clinton', 'trump', 'sanders', 'cruz'])
Load and explore your Twitter data
Now that you've got your Twitter data sitting locally in a text file, it's time to
explore it! This is what you'll do in the next few interactive exercises. In this
exercise, you'll read the Twitter data into a list: tweets_data.
Assign the filename 'tweets.txt' to the variable tweets_data_path.
Initialize tweets_data as an empty list to store the tweets in.
Within the for loop initiated by for line in tweets_file:, load each tweet into a
variable, tweet, using json.loads(), then append tweet to tweets_data using the
append() method.
Hit submit and check out the keys of the first tweet dictionary printed to the
shell.
# Import package
import json
# String of path to file: tweets_data_path
tweets_data_path = 'tweets.txt'
# Initialize empty list to store tweets: tweets_data
tweets_data = []
# Open connection to file
tweets_file = open(tweets_data_path, "r")
# Read in tweets and store in list: tweets_data
for line in tweets_file:
tweet = json.loads(line)
tweets_data = tweet.append()
# Close connection to file
tweets_file.close()
# Print the keys of the first tweet dict
print(tweets_data[0].keys())
Load and explore your Twitter data
Now that you've got your Twitter data sitting locally in a text file, it's time to
explore it! This is what you'll do in the next few interactive exercises. In this
exercise, you'll read the Twitter data into a list: tweets_data.
Instructions
70 XP
Assign the filename 'tweets.txt' to the variable tweets_data_path.
Initialize tweets_data as an empty list to store the tweets in.
Within the for loop initiated by for line in tweets_file:, load each tweet into a
variable, tweet, using json.loads(), then append tweet to tweets_data using the
append() method.
Hit submit and check out the keys of the first tweet dictionary printed to the
shell.
# Import package
import json
# String of path to file: tweets_data_path
tweets_data_path = 'tweets.txt'
# Initialize empty list to store tweets: tweets_data
tweets_data = []
# Open connection to file
tweets_file = open(tweets_data_path, "r")
# Read in tweets and store in list: tweets_data
for line in tweets_file:
tweet = json.loads(line)
tweets_data.append(tweet)
# Close connection to file
tweets_file.close()
# Print the keys of the first tweet dict
print(tweets_data[0].keys())
Twitter data to DataFrame
Now you have the Twitter data in a list of dictionaries, tweets_data, where each
dictionary corresponds to a single tweet. Next, you're going to extract the text
and language of each tweet. The text in a tweet, t1, is stored as the value
t1['text']; similarly, the language is stored in t1['lang']. Your task is to build
a DataFrame in which each row is a tweet and the columns are 'text' and 'lang'.
Instructions
100 XP
Use pd.DataFrame() to construct a DataFrame of tweet texts and languages; to do so,
the first argument should be tweets_data, a list of dictionaries. The second
argument to pd.DataFrame() is a list of the keys you wish to have as columns.
Assign the result of the pd.DataFrame() call to df.
Print the head of the DataFrame.
# Import package
import pandas as pd
# Build DataFrame of tweet texts and languages
df = pd.DataFrame(tweets_data, columns=['text', 'lang' ])
# Print head of DataFrame
print(df.head() )
A little bit of Twitter text analysis
Now that you have your DataFrame of tweets set up, you're going to do a bit of text
analysis to count how many tweets contain the words 'clinton', 'trump', 'sanders'
and 'cruz'. In the pre-exercise code, we have defined the following function
word_in_text(), which will tell you whether the first argument (a word) occurs
within the 2nd argument (a tweet).
import re
def word_in_text(word, text):
word = word.lower()
text = tweet.lower()
match = re.search(word, text)
if match:
return True
return False
You're going to iterate over the rows of the DataFrame and calculate how many
tweets contain each of our keywords! The list of objects for each candidate has
been initialized to 0.
Instructions
100 XP
Within the for loop for index, row in df.iterrows():, the code currently increases
the value of clinton by 1 each time a tweet (text row) mentioning 'Clinton' is
encountered; complete the code so that the same happens for trump, sanders and
cruz.
# Initialize list to store tweet counts
[clinton, trump, sanders, cruz] = [0, 0, 0, 0]
# Iterate through df, counting the number of tweets in which
# each candidate is mentioned
for index, row in df.iterrows():
clinton += word_in_text('clinton', row['text'])
trump += word_in_text('trump', row['text'])
sanders += word_in_text('sanders', row['text'])
cruz += word_in_text('cruz', row['text'])
Plotting your Twitter data
Now that you have the number of tweets that each candidate was mentioned in, you
can plot a bar chart of this data. You'll use the statistical data visualization
library seaborn, which you may not have seen before, but we'll guide you through.
You'll first import seaborn as sns. You'll then construct a barplot of the data
using sns.barplot, passing it two arguments:
a list of labels and
a list containing the variables you wish to plot (clinton, trump and so on.)
Import both matplotlib.pyplot and seaborn using the aliases plt and sns,
respectively.
Complete the arguments of sns.barplot: the first argument should be the labels to
appear on the x-axis; the second argument should be the list of the variables you
wish to plot, as produced in the previous exercise.
# Import packages
import matplotlib.pyplot as plt
import seaborn as sns
# Set seaborn style
sns.set(color_codes=True)
# Create a list of labels:cd
cd = ['clinton', 'trump', 'sanders', 'cruz']
# Plot histogram
ax = sns.barplot(cd, [clinton, trump, sanders, cruz])
ax.set(ylabel="count")
plt.show()