0% found this document useful (0 votes)
2 views

lecture-week4

This lecture focuses on the processes of acquiring, cleaning, and transforming data for data science, emphasizing the use of Pandas for importing various data types into DataFrames. It covers structured, numerical, categorical, time series, text, spatial, image, and audio data, along with methods for accessing and importing data from different sources, including APIs and file formats. The lecture also highlights the importance of understanding data structures and the appropriate techniques for analysis.

Uploaded by

trminhselflearn
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

lecture-week4

This lecture focuses on the processes of acquiring, cleaning, and transforming data for data science, emphasizing the use of Pandas for importing various data types into DataFrames. It covers structured, numerical, categorical, time series, text, spatial, image, and audio data, along with methods for accessing and importing data from different sources, including APIs and file formats. The lecture also highlights the importance of understanding data structures and the appropriate techniques for analysis.

Uploaded by

trminhselflearn
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

Getting the Data

SIT112 | Data Science Concepts


Lecture Week 4
As a data scientist you will spend a
large fraction of your time acquiring,
cleaning, and transforming data.

Getting the
Data In this lecture, we’ll look at finding,
importing, and using different types of
data.
Objectives
1. Use the Pandas read methods to import data into a DataFrame.
2. Download a file to disk before importing it into a DataFrame.
3. Unzip a zip file to access the files that it contains.
4. Use a SQL query to import database data into a DataFrame.
5. Use the metadata of a Stata file to analyze the data, and then read selected columns of the Stata data into a
DataFrame.
6. Drill down into the data in a JSON file that has more than two levels of data, convert the JSON file to a dictionary,
and then build a DataFrame from portions of the data in the dictionary.
What Kind of Data?
What kind of Data?
● Structured data
● Numerical data
● Categorical data
● Time series data
● Text data
● Spatial data
● Image data
● Audio data
Structured Data
• Our primary focus is on structured data:
• Tabular or spreadsheet-like data in which each column may be a different type (string, numeric, date, or
otherwise). This includes most kinds of data commonly stored in relational databases or tab- or
comma-separated text files.
• Multidimensional arrays (matrices).
• Multiple tables of data interrelated by key columns (primary or foreign keys for a SQL user).
• Evenly or unevenly spaced time series.
Structured Data (Cont.)
• Many datasets can be transformed into a structured form that is more suitable for analysis and
modeling.
• If not, it may be possible to extract features from a dataset into a structured form.
• E.g., a collection of news articles could be processed into a word frequency table, which could then be
used to perform sentiment analysis.
Numerical Data
• Consists of numbers, such as height, weight, or age.
• Can be continuous or discrete.
• Continuous numerical data can take any value within a certain range (e.g. temperature),
• Discrete numerical data can only take certain specific values (e.g. number of pets).

• Can be analyzed using regression analysis, correlation analysis, and hypothesis testing.
Categorical Data
• Consists of categories or groups, such as gender, race, or occupation. This data is often represented
using labels or strings.
• Includes categories or labels that describe certain characteristics or attributes of an object or event.
Categorical data can be further divided into two groups:
• Nominal data (no order or ranking among categories, e.g., colors) and
• Ordinal data (categories can be ranked or ordered, e.g., education level).

• Categorical data can be analyzed using contingency table analysis, logistic regression, and chi-squared
tests.
Movie Time

Working with contingency tables


Movie Time

What is logistic regression?


Time Series Data
• Data that is collected over time, such as stock prices or weather measurements. This data can be
used to analyze trends, patterns, or seasonality.
• This type of data is collected over time at regular or irregular intervals. Time series data is often
used to analyze trends, patterns, and seasonality in data.
• Time series data can be analyzed using moving averages, trend analysis, and seasonal
decomposition.
What is Seasonality?

Seasonality refers to the pattern of regular and predictable


fluctuations in a time series data that occur at regular intervals or
over a specific period of time, such as a year, a month, a week, or
even a day. These fluctuations are usually caused by a combination
of factors, such as weather patterns, holidays, cultural events, and
business cycles.
Seasonality is commonly observed in various types of data,
including sales figures, customer traffic, website traffic, and stock
prices. By identifying and understanding the patterns of
seasonality in the data, businesses and analysts can better predict
future trends and make more informed decisions about planning,
marketing, and operations.
Text Data
• Includes unstructured text, such as customer
reviews or social media posts.
• Can be analyzed using natural language processing
techniques.
• E.g., Medical NLP

https://www.foreseemed.com/natural-language-processing-in-healthcare

HCC Risk Adjustment Coding Software for Healthcare | ForeSee Medical


Spatial Data
• Includes geographic or location-based
data, such as GPS coordinates or
satellite imagery.
• Spatial data can be analyzed using
techniques such as geographic
information systems (GIS) and spatial
statistics.

Solving problems with spatial statistics


Image Data

• Includes digital images or videos, and


can be used in applications such as
computer vision and image
recognition.
• Image data can be analyzed using
techniques such as deep learning.

https://industrywired.com/the-era-of-computer-vision-is-here/
Audio Data
• Includes sound or voice
recordings, and can be used
in applications such as
speech recognition and
music analysis.
• Audio data can be analyzed
using techniques such as
signal processing and
Fourier analysis.
Finding the Data
Common Sources of Data
• Internal datasets and databases: This can include everything from departmental
spreadsheets to any of the databases used by a corporation.

• Third-party websites: This includes the hundreds of websites that let you download data
for your own analysis.
• Kaggle: https://www.kaggle.com/datasets
• Google Dataset Search: https://datasetsearch.research.google.com/
• Registry of Open Data on AWS: https://registry.opendata.aws
Getting the Data via APIs
• On some sites, you will need to use an API
(Application Programming Interface) to
get the data that you want. It works like a
Data Plug …

• An API (Application Programming


Interface) is a set of protocols, routines,
and tools that specify how to interact with
the data provider.

• Some websites (e.g., Twitter and


YouTube) may provide public APIs that
allow you access and use their data in a
controlled manner.
Getting the Data via APIs
• You need to use programming to make a request to the API and retrieve the data you need.

• Depending on the API, you may need to provide an API key or authenticate yourself in some
way before you can access the data.

• It requires some programming knowledge and skill, as well as an understanding of the specific
API you are working with.
Getting the Data via APIs - Twitter API
Getting the Data via APIs - YouTube API
Getting the Data
via APIs -
YouTube API
(Cont.)
Importing the Data into a DataFrame
Direct Import vs Download

Sometimes, you will have to


In many cases, you will be able
download the data to your
to import data from a website
computer and/or unzip a file
directly into a DataFrame.
before you can import the data.
Common File Formats for Data
Type Extension Description Contents
CSV .csv Comma-separated values One table
TSV .tsv Tab-separated values One table
Excel .xlsx, .xls Excel spreadsheet One or more sheets
Stata .dta Stata statistical package Complex data
JSON .json JavaScript Object Notation Nested data
XML .xml Extensible Markup Language Nested data
SAS .sd7, .sd6 SAS statistical package Complex data
SPSS .sav SPSS statistical package Complex data
HDF5 .h5 Structured format for large datasets Complex data
Zip .zip Archive format One or more files
Pandas Methods
for Importing
Data into a
DataFrame
Pandas Methods
for Importing Data
into a DataFrame
(Cont.)
Importing Data into a DataFrame using
Pandas Methods
• Pandas methods only work when the data is in a tabular form; If the data isn’t tabular, the read
method will throw an error.
• E.g., the read_JSON() method will fail if the JSON file has more than one level of nesting.

• As a result, Pandas methods work the best on tabular files like CSV (comma-separated values)
and Excel files. But they will fail on files with complex or nested data.
Importing a CSV file from a Website
Importing the First Sheet
of a downloaded Excel
File
Downloading a File
using the urlretrieve()
Method of the
urllib.request Module
Download and
Extract a zip file
Download and Extract a zip file (Cont.)
Getting DataBase Data into a DataFrame
Running Queries Against a DataBase

sqlite_master is a special table in SQLite that contains information about the structure of a database. It is
created automatically when a new database is created, and it contains one row for each table.
Get the Table
Information
Get the Table Information (cont.)

Each row contains information about a single column, including:


• cid: the column ID (a sequential integer starting from 0)
• name: the column name
• type: the column data type (e.g. INTEGER, REAL, TEXT)
• notnull: a flag indicating whether the column is defined as NOT NULL
• dflt_value: the default value of the column (if any)
• pk: a flag indicating whether the column is part of the primary key for the table
Import the Data
from a Query into
a DataFrame
Get and Explore the Metadata of a Stata file
Get metadata from a Stata file
Build a DataFrame for the Column
Descriptions in the Metadata
Import the Columns of the Data into
a DataFrame
Working with a JSON File
Multi-Level JSON
Files

• If a JSON file has just one level of nesting, it is tabular so it can


be imported by the Pandas read_json() method.

• Most of the time, though, a JSON file has two or more levels of
nesting so it isn’t tabular.
• The data has to be downloaded to disk and read into a
Python dictionary before it can be used to build a DataFrame
object.
Downloading a JSON File
Build a DataFrame
form the JSON File
References
• Data science from scratch: first principles with Python, Joel Grus, O'Reilly Media, 2019
• Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython, Wes McKinney, O'Reilly
Media, 3rd edition, 2022.
• Python Data Science Handbook: Essential Tools for Working with Data, Jake Vanderplas, O'Reilly Media,
2022
• Murach’s Python for Data Analysis, Scott McCoy, Mike Murach & Associates, Incorporated, 2021.
• Intro to Python for Computer Science and Data Science: Learning to Program with AI, Big Data and The
Cloud, Paul Deitel, Pearson Education Limited, 2021.
• ChatGPT
End of lecture …

You might also like