Statistical Computing With Python

The document outlines a seminar on Statistical Computing with Python, focusing on topics such as web scraping, JSON data structures, and natural language processing. It discusses methods for accessing data through APIs, the importance of tokenization and stop word removal in text processing, and introduces sentiment analysis as a supervised machine learning technique. The seminar is scheduled for October 22-24, 2020, and emphasizes practical applications of statistical computing in Python.

Uploaded by

fe90131

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views21 pages

Statistical Computing With Python

Uploaded by

fe90131

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

Statistical Computing

with Python
Jason Anastasopoulos, Ph.D.

Upcoming Seminar:
October 22-24, 2020, Remote Seminar
Statistical Computing in
Python
Semi-Structured Data and Databases
HTML and Markup Languages
- Tree-structured (hierarchical) format
- Elements surrounded by opening & closing tags.
- Values embedded in open tags <tag-name attr-name=“attribute”> data
</tag-name>
Web Scraping Basics
- Use HTML and JSON data structures to build databases
- JSON is used to extract:
- Most data from APIs
- Data exchange systems like databases (SQL and MongoDB)
Getting data
Easiest: JSON from APIs

HTML - Very diﬃcult, last resort if not available in APIs (rare now)

Other options:
- Write a bot.
- Pretend to be a browser (Selenium)
HTML
Hyper Text Markup Language

Formatting Web pages

Uses tags
Example
<html>
<head>
<title>Page title here</title>
</head>
<body>
This is sample text...

This is text within a paragraph.
I really mean that
<img src="smileyface.jpg" alt="Smiley face" >
</body>
</html>
Webpage example
view-source:https://anastasopoulos.io/research
urllib package
Request package to retrieve ﬁle from ftp server

Connect with web servers using http protocol

Use of request and response data types

JSON
JavaScript Object Notation

Data interchange format

“Lightweight” format
- Data representations
- Easy for users to read
- Easy for parsers to translate
Main Structures
Object
- Uses {}, identical to a dictionary structure with key names and values separated
by comma.
Array
- List structure
- Uses []
- Contains values.
Value
- Lowest level.
- Values such as strings, numbers etc.
Simple JSON Sample
Accessing JSON Data
Data accessed via APIs formatted in JSON

Easy to access using Python ‘json’ package

Data accessed as in a dictionary.

Databases
- Means of exchanging information.
- SQL: Structured Query Language.
- MongoDB: NoSQL database, uses JSON-like ways of storing data.
- Brief code demonstration but each of these databases require more time to
cover.
Statistical Computing in
Python
Unstructured Data and Natural Language Processing
Text processing
1.Tokenization - splits the document into tokens which can be words or n-grams
(phrases).
2.Formatting - punctuation, numbers, case, spacing.
3.Stop word removal - removal of “stop words”
Tokenization
“Bag of words” model - most text analysis methods treat documents as a big
bunch of words or terms.

Order is generally not taken into account, just word and term frequencies.

There are ways to parse documents into ngrams or words but we’ll stick with
words for now.
Tokenization
Tokenized tweet (1 gram): [“I”, “don’t”, “think”,
“you’re”, “the”....]

Tokenized tweet (2-gram): [“I don’t”, “don’t

think”, “think you’re”, “you’re the”, …]
“
Stop words
Stop words are simply words that removed during text processing.

They tend to be words that are very common “the”, “and”, “is” etc.

These common words can cause problems for machine learning algorithms
and search engines because they add noise.

BEWARE Each package defines different lists of stop words and sometimes
removal can decrease performance of supervised mechine learning classifiers.
Sentiment Analysis
- Sentiment analysis is a type of supervised machine learning that is used to
predict the sentiment of texts.

- Without going into too much detail, we will use what is known as a pretrained
sentiment analysis algorithm.

- This is basically how it works...

Sentiment Analysis

Big Data Analytics (CS443) IV B.Tech (IT) 2018-19 I Semester
No ratings yet
Big Data Analytics (CS443) IV B.Tech (IT) 2018-19 I Semester
72 pages
Hypertext Markup Language (HTML) Fundamentals: How to Master HTML with Ease
From Everand
Hypertext Markup Language (HTML) Fundamentals: How to Master HTML with Ease
Steven Bright
No ratings yet
Text Mining and Dataset Creation in Python
No ratings yet
Text Mining and Dataset Creation in Python
13 pages
Social Media
No ratings yet
Social Media
7 pages
XML Programming: The Ultimate Guide to Fast, Easy, and Efficient Learning of XML Programming
From Everand
XML Programming: The Ultimate Guide to Fast, Easy, and Efficient Learning of XML Programming
Christopher Right
2.5/5 (2)
Doyle 2014 Art Talk
No ratings yet
Doyle 2014 Art Talk
29 pages
TXSA Lecture-7-9-2023 PDF
No ratings yet
TXSA Lecture-7-9-2023 PDF
8 pages
BDA Unit 5 Notes
No ratings yet
BDA Unit 5 Notes
9 pages
Text Mining Problems-4
No ratings yet
Text Mining Problems-4
59 pages
Lecture 8 - Text Analytics NLP
No ratings yet
Lecture 8 - Text Analytics NLP
24 pages
Facets of Data
0% (1)
Facets of Data
22 pages
I
No ratings yet
I
54 pages
Understanding Language Model
No ratings yet
Understanding Language Model
5 pages
Python NLP
No ratings yet
Python NLP
15 pages
Week 2 PSOSM - NPTEL
No ratings yet
Week 2 PSOSM - NPTEL
8 pages
DATA WRANGLING
No ratings yet
DATA WRANGLING
4 pages
Text Cleaning Methods in NLP - Part-2
No ratings yet
Text Cleaning Methods in NLP - Part-2
5 pages
Lab Manual
No ratings yet
Lab Manual
10 pages
Data - Collection Python
No ratings yet
Data - Collection Python
40 pages
Python
No ratings yet
Python
23 pages
Introduction To Text Mining
No ratings yet
Introduction To Text Mining
6 pages
Analytics and Tech Mining For Engineering Managers 9781606505113 1606505114 9781606505106
No ratings yet
Analytics and Tech Mining For Engineering Managers 9781606505113 1606505114 9781606505106
146 pages
Easy html and css
From Everand
Easy html and css
S VASIST
No ratings yet
CSDM2-Text Preprocessing For NL Data - 011050
No ratings yet
CSDM2-Text Preprocessing For NL Data - 011050
6 pages
Lecture10_Mining Text and Images_8575d15246e0f4b48a674e6e6f45c634
No ratings yet
Lecture10_Mining Text and Images_8575d15246e0f4b48a674e6e6f45c634
25 pages
IE Python
No ratings yet
IE Python
26 pages
ece 2318 GENERAL DATA AND ITS TYPES
No ratings yet
ece 2318 GENERAL DATA AND ITS TYPES
34 pages
SocrAI Day 3
No ratings yet
SocrAI Day 3
43 pages
SQL and NoSQL
No ratings yet
SQL and NoSQL
5 pages
Christos Chen
No ratings yet
Christos Chen
42 pages
Programming 2 Lectures
No ratings yet
Programming 2 Lectures
52 pages
Data Mining News Article
No ratings yet
Data Mining News Article
30 pages
Pyxml
No ratings yet
Pyxml
18 pages
DA Unit 4
No ratings yet
DA Unit 4
46 pages
ThuyếtTrinh asm3 TextAnalysis
No ratings yet
ThuyếtTrinh asm3 TextAnalysis
3 pages
Natural Language Processing manual
No ratings yet
Natural Language Processing manual
39 pages
AI Zone: Log in Sign Up
No ratings yet
AI Zone: Log in Sign Up
24 pages
Natural+Language+Processing+in+Python
No ratings yet
Natural+Language+Processing+in+Python
214 pages
EXP5
No ratings yet
EXP5
15 pages
Unit_IV IoT (1)
No ratings yet
Unit_IV IoT (1)
41 pages
Lecture03 Data II
No ratings yet
Lecture03 Data II
42 pages
Introduction To Python
No ratings yet
Introduction To Python
18 pages
Business Analytics & Text Mining Modeling Using Python: Dr. Gaurav Dixit
No ratings yet
Business Analytics & Text Mining Modeling Using Python: Dr. Gaurav Dixit
17 pages
Text Analysis in Business Using Python
No ratings yet
Text Analysis in Business Using Python
5 pages
Lecture 2 - Collecting, Analyzing, and Visualizing Data with Python Part I
No ratings yet
Lecture 2 - Collecting, Analyzing, and Visualizing Data with Python Part I
15 pages
Python Ecosystem
No ratings yet
Python Ecosystem
11 pages
Data Analysis by Web Scraping Using Python
No ratings yet
Data Analysis by Web Scraping Using Python
6 pages
Elasticsearch Essentials: Harness the power of ElasticSearch to build and manage scalable search and analytics solutions with this fast-paced guide
From Everand
Elasticsearch Essentials: Harness the power of ElasticSearch to build and manage scalable search and analytics solutions with this fast-paced guide
Bharvi Dixit
No ratings yet
Top 18 Python Libraries
100% (1)
Top 18 Python Libraries
11 pages
Great Big Natural Language Processing Primer KDnuggets
No ratings yet
Great Big Natural Language Processing Primer KDnuggets
25 pages
TensorFlow All-Around
100% (1)
TensorFlow All-Around
132 pages
Data Wrangling With Python Lab Manual
No ratings yet
Data Wrangling With Python Lab Manual
29 pages
UNIT V (1)
No ratings yet
UNIT V (1)
22 pages
Python For Data Analysis The Python Crash Course Comprehensive The Programming From The Ground Up To Python by Cannon, Jason
No ratings yet
Python For Data Analysis The Python Crash Course Comprehensive The Programming From The Ground Up To Python by Cannon, Jason
167 pages
A Novel Approach For Filtering Unrelated Data From Websites Using Natural Language Processing
No ratings yet
A Novel Approach For Filtering Unrelated Data From Websites Using Natural Language Processing
4 pages
Data Structures I Essentials
From Everand
Data Structures I Essentials
Dennis Smolarski
No ratings yet
Hands-On Data Science With R Text Mining
No ratings yet
Hands-On Data Science With R Text Mining
41 pages
DWV_UNIT_II
No ratings yet
DWV_UNIT_II
37 pages
ML Sentimentanalysis
No ratings yet
ML Sentimentanalysis
5 pages
Algorithms and Data Structures: An Easy Guide to Programming Skills
From Everand
Algorithms and Data Structures: An Easy Guide to Programming Skills
Rigdon Jonathan
No ratings yet
수능 기출숙어 정리
100% (1)
수능 기출숙어 정리
13 pages
Meta 2023 Sustainability Report
No ratings yet
Meta 2023 Sustainability Report
60 pages
Quante C
No ratings yet
Quante C
4 pages
2017 Kawasaki Versys 1000 Abs 28
No ratings yet
2017 Kawasaki Versys 1000 Abs 28
195 pages
Final Black Book
No ratings yet
Final Black Book
75 pages
A Lesson Plan
No ratings yet
A Lesson Plan
4 pages
Assessing-Learning-in-Social-Studies
No ratings yet
Assessing-Learning-in-Social-Studies
6 pages
A6V10061990 - Flame Detector Addressed or Collective ASAtechnol - en
No ratings yet
A6V10061990 - Flame Detector Addressed or Collective ASAtechnol - en
8 pages
Dr. Majid Naini - Brief Biography
No ratings yet
Dr. Majid Naini - Brief Biography
4 pages
Rehabilitation Nursing
No ratings yet
Rehabilitation Nursing
11 pages
LUXEON Versat 3030 HP CW 150: Industry-Leading Solutions For Exterior Automotive Lighting
No ratings yet
LUXEON Versat 3030 HP CW 150: Industry-Leading Solutions For Exterior Automotive Lighting
17 pages
How To Wire A 4 Channel Amp To 4 Speakers and A Sub
No ratings yet
How To Wire A 4 Channel Amp To 4 Speakers and A Sub
3 pages
Science, Teachnology, & Society (Section 1 Module 3)
No ratings yet
Science, Teachnology, & Society (Section 1 Module 3)
2 pages
Destroy Your Enemies: Pmyv 12 Pmyv 56
No ratings yet
Destroy Your Enemies: Pmyv 12 Pmyv 56
2 pages
Science Revision
No ratings yet
Science Revision
4 pages
Artificial Sources of Light
No ratings yet
Artificial Sources of Light
3 pages
PROCEDURE FOR CALIBRATION OF MULTIMETER
No ratings yet
PROCEDURE FOR CALIBRATION OF MULTIMETER
6 pages
Week 1
No ratings yet
Week 1
8 pages
Game Theory Applications in Construction
No ratings yet
Game Theory Applications in Construction
16 pages
Dimensional Homogeneity & Dimensionless Numbers
No ratings yet
Dimensional Homogeneity & Dimensionless Numbers
74 pages
Monikesh Patel - CV
No ratings yet
Monikesh Patel - CV
2 pages
Stock Logos
No ratings yet
Stock Logos
538 pages
9700 s13 QP 33
No ratings yet
9700 s13 QP 33
12 pages
Life Cycles of Humans and Animals Presentation
No ratings yet
Life Cycles of Humans and Animals Presentation
16 pages
Economics Is The Study of How Individuals and Societies Choose To Use The Scarce Resources That Nature and Previous Generations Have Provided
No ratings yet
Economics Is The Study of How Individuals and Societies Choose To Use The Scarce Resources That Nature and Previous Generations Have Provided
4 pages
Flows Trigger July 25, 2018 Flightdeck Preparation
No ratings yet
Flows Trigger July 25, 2018 Flightdeck Preparation
2 pages
The Downfall of The Protagonist in Dr. Faustus
No ratings yet
The Downfall of The Protagonist in Dr. Faustus
13 pages
Electric Fields: Sir Michael Faraday's Electric Lines of Force
No ratings yet
Electric Fields: Sir Michael Faraday's Electric Lines of Force
48 pages
The Self As Cognitive Construct
No ratings yet
The Self As Cognitive Construct
9 pages
Edelman - Neural Darwinism PDF
100% (1)
Edelman - Neural Darwinism PDF
17 pages