0% found this document useful (0 votes)
4 views

Lecture 2 - Collecting, Analyzing, and Visualizing Data with Python Part I

Uploaded by

Werd We
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Lecture 2 - Collecting, Analyzing, and Visualizing Data with Python Part I

Uploaded by

Werd We
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

COLLECTING, ANALYZING, AND VISUALIZING DATA WITH PYTHON PART I

DR. MICHAEL FIRE


Collecting Data
There several ways to collect data:
• Using existing datasets
• Create/Simulate your own dataset
• Using Web scraping
• Using API
Web Scraping
We can collect data using web scraping using one
of the following methods:
• Using simple tools like wget
• Using Selenium for dynamic loaded pages
• Using web scraping frameworks like Scrapy
• Writing your own code
Using Application Programming Interfaces

We can use various websites’ Application Programming Interfaces (APIs) to


collect data from various platforms, such as:
• Twitter
• Reddit
• Google Maps
• Kaggle
• Github
Recommended Read
• Python Data Science Handbook, Chapter 1 IPython: Beyond Normal Python by
Jake VanderPlas
• The Unix Shell by Software Carpentry Foundation
• Practical Introduction to Web Scraping in Python by Colin OKeefe
MANIPULATING DATA
NUMERICAL PYTHON (NUMPY)
Source: Python Data Science Handbook, Chapter 1 IPython: Beyond Normal Python by Jake VanderPlas
NumPy - The Basics
• Supports large multi-dimensional arrays and matrices
• Contains large collection of high-level mathematical
functions to operate on these arrays
• Tools for reading / writing array data to disk

Useful Reading:
• Chapter 4. NumPy Basics: Arrays and Vectorized Computation, Python for Data
Analysis
by Wes McKinney
• Chapter 2. Introduction to Numpy, Python Data Science Handbook, by Jake
VanderPlas
WORKING WITH PANDAS &
DATAFRAMES
Pandas
Pros:
• Provides flexible and expressive data
structures
• Easy to handle missing data
• Columns can easily be added and deleted

Cons:
• Good for several gigabytes of data
• Mostly single threaded
• Complex Group By operations
“My rule of thumb for pandas is that you should have 5 to 10 times as much
RAM as the size of your dataset”
Wes McKinney, 2017
Pandas Objects

DataFrame

Column

Series

Values

NumPy
Let’s move to the Notebook

You might also like