0% found this document useful (0 votes)
3 views

Importing data in python

Uploaded by

saadia
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Importing data in python

Uploaded by

saadia
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Second Course:

Importing data in python

In this course we will learn to import data from large variety of sources

for example,
(i) flat files such as .txts and .csvs;
(ii) files native to other software such as Excel spreadsheets, Stata, SAS and
MATLAB files;

First off, we're going to learn how to import basic text files
which we can broadly classify into 2 types of files –
1. those containing plain text,
such as the opening of Mark Twain's novel The
Adventures of Huckleberry Finn, which you can see
here,

2. Table data
column is a characteristic or feature, such
as gender, cabin and 'survived or not'. The
latter is known as a flat file
open a connection to the file.
To do so,

you assign the filename to a


variable as a string, pass the
filename to the function

open and also pass it the


argument mode equals 'r',

line3: assign text from a file to a variable text by applying a method read

now print and check the text

It is good to know how to write data


on file but we will not use it in
course

You can avoid having to close the connection to the file by

What you're doing here is called 'binding' a variable in the context manager construct;
while still within this construct, the variable file will be bound to open(filename, 'r'). It is
best practice to use the with statement as you never have to concern yourself with
closing the files again.
The importance of flat files in data
science
Flat Files:
Flat files are basic text files containing

row or record is a unique passenger onboard


and each column is a feature or attribute, such
as

name, gender and cabin.

It is also essential to note that


a flat file can have a header,
such as in 'titanic dot csv',

It will be important to know


whether or not your file has a
header as it may alter your
data import.

File extension:

The values in each row are separated by


commas. Another common extension for a
flat file is dot txt, which means a text file.
Values in flat files can be separated by
characters or sequences of characters
other than commas, such as a tab, and
the character or characters in question is
called a delimiter.
See here an example of a tab-
delimited file. The data consists of the
famous MNIST digit recognition
images, where

each row contains the pixel values of a


given image. Note that all fields in the
MNIST data are numeric, while the
'titanic dot csv' also contained strings.

If they consist entirely of numbers and


we want to store them as a numpy
array, we could use numpy.

If, instead, we want to store the data in a


dataframe, we could use pandas.

In the rest of this Chapter, you'll learn


how to import flat files that contain only
numerical data, such as the MNIST
data, and import flat files that contain
both numerical data and strings, such as
'titanic dot csv'.
Importing flat files using NumPy
if you want to import a flat file and assign it to a variable? If all the data are numerical,
you can use the package numpy to import the data as a numpy array.

Why NumPy?

numpy arrays are often essential for other packages, such as


- scikit-learn, a popular Machine Learning package for Python.
Numpy itself has a number of built-in functions that make it far easier and more efficient
for us to import data as arrays.

Enter the NumPy functions


- loadtxt and
- genfromtxt

To use either of these we


first need to import NumPy.

We then call loadtxt and


pass it the filename as the
first argument, along with
the delimiter as the 2nd
argument.

Note that the default


delimiter is any white
space so we’ll usually need
to specify it explicitly.
If You want to set usecols equals the list containing ints 0 and 2.

You can also import different datatypes into NumPy arrays: for example, setting the
argument dtype equals 'str' will ensure that all entries are imported as strings.

This can we see when we have mix data


Strings and floats in table as below
Importing flat files using pandas
prompted Wes McKinney to develop
the pandas library for Python.

Nothing speaks to the project of


pandas more than the
documentation itself:

As Hadley Wickham tweeted,


"A matrix has rows and
columns. A data frame has
observations and variables."

For all of these below reasons, it is now


standard and best practice in Data
Science to use pandas to import flat files
as DataFrames.
To use pandas, you first need to
import it.
Then, if we wish to import a CSV in
the most basic case all we need to
do is to call the function read_csv()

and supply it with a single argument,


the name of the file. Having
assigned the DataFrame to the
variable data, we can check the first
5 rows of the DataFrame, including
the header, with the command
'data.head'.

You might also like