0% found this document useful (0 votes)
4 views21 pages

Statistical Computing With Python

The document outlines a seminar on Statistical Computing with Python, focusing on topics such as web scraping, JSON data structures, and natural language processing. It discusses methods for accessing data through APIs, the importance of tokenization and stop word removal in text processing, and introduces sentiment analysis as a supervised machine learning technique. The seminar is scheduled for October 22-24, 2020, and emphasizes practical applications of statistical computing in Python.

Uploaded by

fe90131
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views21 pages

Statistical Computing With Python

The document outlines a seminar on Statistical Computing with Python, focusing on topics such as web scraping, JSON data structures, and natural language processing. It discusses methods for accessing data through APIs, the importance of tokenization and stop word removal in text processing, and introduces sentiment analysis as a supervised machine learning technique. The seminar is scheduled for October 22-24, 2020, and emphasizes practical applications of statistical computing in Python.

Uploaded by

fe90131
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Statistical Computing

with Python
Jason Anastasopoulos, Ph.D.

Upcoming Seminar:
October 22-24, 2020, Remote Seminar
Statistical Computing in
Python
Semi-Structured Data and Databases
HTML and Markup Languages
- Tree-structured (hierarchical) format
- Elements surrounded by opening & closing tags.
- Values embedded in open tags <tag-name attr-name=“attribute”> data
</tag-name>
Web Scraping Basics
- Use HTML and JSON data structures to build databases
- JSON is used to extract:
- Most data from APIs
- Data exchange systems like databases (SQL and MongoDB)
Getting data
Easiest: JSON from APIs

HTML - Very difficult, last resort if not available in APIs (rare now)

Other options:
- Write a bot.
- Pretend to be a browser (Selenium)
HTML
Hyper Text Markup Language

Formatting Web pages

Uses tags
Example
<html>
<head>
<title>Page title here</title>
</head>
<body>
This is sample text...
<!--We use this syntax to write comments -->
<p>This is text within a paragraph.</p>
<em>I <strong>really</strong> mean that</em>
<img src="smileyface.jpg" alt="Smiley face" >
</body>
</html>
Webpage example
view-source:https://anastasopoulos.io/research
urllib package
Request package to retrieve file from ftp server

Connect with web servers using http protocol

Use of request and response data types


JSON
JavaScript Object Notation

Data interchange format

“Lightweight” format
- Data representations
- Easy for users to read
- Easy for parsers to translate
Main Structures
Object
- Uses {}, identical to a dictionary structure with key names and values separated
by comma.
Array
- List structure
- Uses []
- Contains values.
Value
- Lowest level.
- Values such as strings, numbers etc.
Simple JSON Sample
Accessing JSON Data
Data accessed via APIs formatted in JSON

Easy to access using Python ‘json’ package

Data accessed as in a dictionary.


Databases
- Means of exchanging information.
- SQL: Structured Query Language.
- MongoDB: NoSQL database, uses JSON-like ways of storing data.
- Brief code demonstration but each of these databases require more time to
cover.
Statistical Computing in
Python
Unstructured Data and Natural Language Processing
Text processing
1.Tokenization - splits the document into tokens which can be words or n-grams
(phrases).
2.Formatting - punctuation, numbers, case, spacing.
3.Stop word removal - removal of “stop words”
Tokenization
“Bag of words” model - most text analysis methods treat documents as a big
bunch of words or terms.

Order is generally not taken into account, just word and term frequencies.

There are ways to parse documents into ngrams or words but we’ll stick with
words for now.
Tokenization
Tokenized tweet (1 gram): [“I”, “don’t”, “think”,
“you’re”, “the”....]

Tokenized tweet (2-gram): [“I don’t”, “don’t


think”, “think you’re”, “you’re the”, …]

Stop words
Stop words are simply words that removed during text processing.

They tend to be words that are very common “the”, “and”, “is” etc.

These common words can cause problems for machine learning algorithms
and search engines because they add noise.

BEWARE Each package defines different lists of stop words and sometimes
removal can decrease performance of supervised mechine learning classifiers.
Sentiment Analysis
- Sentiment analysis is a type of supervised machine learning that is used to
predict the sentiment of texts.

- Without going into too much detail, we will use what is known as a pretrained
sentiment analysis algorithm.

- This is basically how it works...


Sentiment Analysis

You might also like