lecture-week4
lecture-week4
Getting the
Data In this lecture, we’ll look at finding,
importing, and using different types of
data.
Objectives
1. Use the Pandas read methods to import data into a DataFrame.
2. Download a file to disk before importing it into a DataFrame.
3. Unzip a zip file to access the files that it contains.
4. Use a SQL query to import database data into a DataFrame.
5. Use the metadata of a Stata file to analyze the data, and then read selected columns of the Stata data into a
DataFrame.
6. Drill down into the data in a JSON file that has more than two levels of data, convert the JSON file to a dictionary,
and then build a DataFrame from portions of the data in the dictionary.
What Kind of Data?
What kind of Data?
● Structured data
● Numerical data
● Categorical data
● Time series data
● Text data
● Spatial data
● Image data
● Audio data
Structured Data
• Our primary focus is on structured data:
• Tabular or spreadsheet-like data in which each column may be a different type (string, numeric, date, or
otherwise). This includes most kinds of data commonly stored in relational databases or tab- or
comma-separated text files.
• Multidimensional arrays (matrices).
• Multiple tables of data interrelated by key columns (primary or foreign keys for a SQL user).
• Evenly or unevenly spaced time series.
Structured Data (Cont.)
• Many datasets can be transformed into a structured form that is more suitable for analysis and
modeling.
• If not, it may be possible to extract features from a dataset into a structured form.
• E.g., a collection of news articles could be processed into a word frequency table, which could then be
used to perform sentiment analysis.
Numerical Data
• Consists of numbers, such as height, weight, or age.
• Can be continuous or discrete.
• Continuous numerical data can take any value within a certain range (e.g. temperature),
• Discrete numerical data can only take certain specific values (e.g. number of pets).
• Can be analyzed using regression analysis, correlation analysis, and hypothesis testing.
Categorical Data
• Consists of categories or groups, such as gender, race, or occupation. This data is often represented
using labels or strings.
• Includes categories or labels that describe certain characteristics or attributes of an object or event.
Categorical data can be further divided into two groups:
• Nominal data (no order or ranking among categories, e.g., colors) and
• Ordinal data (categories can be ranked or ordered, e.g., education level).
• Categorical data can be analyzed using contingency table analysis, logistic regression, and chi-squared
tests.
Movie Time
☺
https://www.foreseemed.com/natural-language-processing-in-healthcare
https://industrywired.com/the-era-of-computer-vision-is-here/
Audio Data
• Includes sound or voice
recordings, and can be used
in applications such as
speech recognition and
music analysis.
• Audio data can be analyzed
using techniques such as
signal processing and
Fourier analysis.
Finding the Data
Common Sources of Data
• Internal datasets and databases: This can include everything from departmental
spreadsheets to any of the databases used by a corporation.
• Third-party websites: This includes the hundreds of websites that let you download data
for your own analysis.
• Kaggle: https://www.kaggle.com/datasets
• Google Dataset Search: https://datasetsearch.research.google.com/
• Registry of Open Data on AWS: https://registry.opendata.aws
Getting the Data via APIs
• On some sites, you will need to use an API
(Application Programming Interface) to
get the data that you want. It works like a
Data Plug …
• Depending on the API, you may need to provide an API key or authenticate yourself in some
way before you can access the data.
• It requires some programming knowledge and skill, as well as an understanding of the specific
API you are working with.
Getting the Data via APIs - Twitter API
Getting the Data via APIs - YouTube API
Getting the Data
via APIs -
YouTube API
(Cont.)
Importing the Data into a DataFrame
Direct Import vs Download
• As a result, Pandas methods work the best on tabular files like CSV (comma-separated values)
and Excel files. But they will fail on files with complex or nested data.
Importing a CSV file from a Website
Importing the First Sheet
of a downloaded Excel
File
Downloading a File
using the urlretrieve()
Method of the
urllib.request Module
Download and
Extract a zip file
Download and Extract a zip file (Cont.)
Getting DataBase Data into a DataFrame
Running Queries Against a DataBase
sqlite_master is a special table in SQLite that contains information about the structure of a database. It is
created automatically when a new database is created, and it contains one row for each table.
Get the Table
Information
Get the Table Information (cont.)
• Most of the time, though, a JSON file has two or more levels of
nesting so it isn’t tabular.
• The data has to be downloaded to disk and read into a
Python dictionary before it can be used to build a DataFrame
object.
Downloading a JSON File
Build a DataFrame
form the JSON File
References
• Data science from scratch: first principles with Python, Joel Grus, O'Reilly Media, 2019
• Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython, Wes McKinney, O'Reilly
Media, 3rd edition, 2022.
• Python Data Science Handbook: Essential Tools for Working with Data, Jake Vanderplas, O'Reilly Media,
2022
• Murach’s Python for Data Analysis, Scott McCoy, Mike Murach & Associates, Incorporated, 2021.
• Intro to Python for Computer Science and Data Science: Learning to Program with AI, Big Data and The
Cloud, Paul Deitel, Pearson Education Limited, 2021.
• ChatGPT
End of lecture …