SlideShare a Scribd company logo
Practical Data Analysis in Python
Hilary Mason
@hmason
www.hilarymason.com
hilary@path101.com
Data is ubiquitous.
The ability and tools to use it are not.
(Focused) Data == Intelligence
Data Analysis on the Web
Data items change rapidly.
Data items are not independent.
There’s a lot of semi-structured data around.
There’s a LOT of data around.
==
Too many problems, few tools, and few experts.
Entity Disambiguation
This is important.
ME
UGLY HAG
Entity Disambiguation
This is important.
Company disambiguation is a very common
problem – Are “Microsoft”, “Microsoft
Corporation”, and “MS” the same company?
This is a hard problem.
SPAM sucks
Classification
Document classification.
Image recognition.
Topic recognition.
Text Parsing
Recommendation Systems
Product recommendations.
Disease predictions.
Behavior analysis.
IEEE Tag Clustering
immunity
ultrasound
medical
imaging
medical
devices
thermoelectric
devices
fault-tolerant
circuits
low power
devices
Python for Data Analysis
import why_python_is_awesome
Python is readable.
Easy to transition from Matlab or R.
Numerical computing support.
Growing set of machine learning libraries.
Libraries
NLTK (Natural Language Toolkit) – www.nltk.org
mlpy (Machine Learning PY) – mlpy.fbk.eu
numpy & scipy – scipy.org
An EC2 AMI provisioned with all of the toys you
need:
http://blog.infochimps.org/2009/02/06/start-
hacking-machetec2-released/
MachetEC2
Practical Data Analysis in Python
Supervised Classification
Text
Feature
Extractor
Trained
Classifier
Spam
Not Spam
Training
Data
Feature
Extractor
Data: Tweets
Hand-classified. For example, some spam:
| don't disrespect me. I just wanted yall to get a head start so
don't feel bad when I have more followers in two days.
http://xyyx.eu/a1ha |
| oh yay more new followers..hiii...if u want go to
http://xyyx.eu/a1hb
|
| My friend made this new tool to get more twitter followers,
http://xyyx.eu/a1ht
|
| Yes, Twitter is doing some Follower/Following count
corrections. Get it back at: http://xyyx.eu/a1h8
|
| man if i see one more person cry about losing followers!!!
http://xyyx.eu/a1h4
|
Features
def document_features(self, document):
document_words = set(document)
features = {}
for word in self.word_features:
features['contains(%s)' % word] = (word in document_words)
return features
Break tweets into lists of relevant words.
Naïve Bayesian Classifer
P(A|B) = the conditional probability of A given B
http://yudkowsky.net/rational/bayes
http://blog.oscarbonilla.com/2009/05/visualizin
g-bayes-theorem/
classifier = nltk.NaiveBayesClassifier.train(train_set)
Classifer Accuracy
Use a hand-classified test set to see the accuracy
of the classifier:
nltk.classify.accuracy(classifier, test_set)
Feature Relevance
contains(') = True not_s : spam = 53.6 : 1.4
contains(") = True not_s : spam = 32.2 : 1.1
contains(#) = True not_s : spam = 22.0 : 1.0
contains(!) = True not_s : spam = 10.8 : 1.0
contains(*) = True spam : not_s = 7.4 : 1.0
contains(=) = True not_s : spam = 5.5 : 1.0
contains(i) = False spam : not_s = 5.2 : 1.0
contains(?) = True not_s : spam = 2.4 : 1.0
contains(:) = True spam : not_s = 2.3 : 1.0
contains(&) = True not_s : spam = 1.8 : 1.0
contains(;) = True not_s : spam = 1.6 : 1.0
contains($) = True spam : not_s = 1.5 : 1.0
contains(u) = True spam : not_s = 1.5 : 1.0
contains(2.0) = False not_s : spam = 1.4 : 1.0
contains(saw) = False not_s : spam = 1.4 : 1.0
contains(noble) = False not_s : spam = 1.4 : 1.0
contains(sound) = False not_s : spam = 1.3 : 1.0
contains(approach) = False not_s : spam = 1.3 : 1.0
contains(finally) = False not_s : spam = 1.3 : 1.0
contains(more) = False spam : not_s = 1.3 : 1.0
Kitchen Sink
wash, rinse, repeat
Results
90% accuracy on spam tweets – not bad!
Other possibilities:
categorization – what do you tweet about?
human vs bot?
which celebrity tweeter are you?
<3 Data
Thank you!

More Related Content

PPTX
Analyzing Adverse Drug Events Using Data Mining Approach
Rupal7
 
PPTX
Say "Hi!" to Your New Boss
Andreas Dewes
 
PDF
Machine learning in the life sciences with knime
Greg Landrum
 
PPTX
Icse2014 v3
SAIL_QU
 
PDF
Implementing and analyzing online experiments
Sean Taylor
 
PDF
Fairly Measuring Fairness In Machine Learning
HJ van Veen
 
PDF
Data analysis_PredictingActivity_SamsungSensorData
Karen Yang
 
PDF
Interpretable Machine Learning Using LIME Framework - Kasia Kulma (PhD), Data...
Sri Ambati
 
Analyzing Adverse Drug Events Using Data Mining Approach
Rupal7
 
Say "Hi!" to Your New Boss
Andreas Dewes
 
Machine learning in the life sciences with knime
Greg Landrum
 
Icse2014 v3
SAIL_QU
 
Implementing and analyzing online experiments
Sean Taylor
 
Fairly Measuring Fairness In Machine Learning
HJ van Veen
 
Data analysis_PredictingActivity_SamsungSensorData
Karen Yang
 
Interpretable Machine Learning Using LIME Framework - Kasia Kulma (PhD), Data...
Sri Ambati
 

Viewers also liked (20)

PDF
pandas - Python Data Analysis
Andrew Henshaw
 
PDF
Parsing real-time data using Twitter Streaming API
Ram Parthasarathy
 
ODP
Data Analysis in Python
Richard Herrell
 
PPTX
Python and Data Analysis
Praveen Nair
 
PPTX
Intro to Python Data Analysis in Wakari
Karissa Rae McKelvey
 
PDF
Getting started with pandas
maikroeder
 
PDF
pandas: Powerful data analysis tools for Python
Wes McKinney
 
PDF
Python for Financial Data Analysis with pandas
Wes McKinney
 
PPTX
CLASSIFICATION OF TWEETS
Mukul Jha
 
PPTX
Webinar: Mastering Python - An Excellent tool for Web Scraping and Data Anal...
Edureka!
 
PPTX
Python for Data Analysis: Chapter 2
智哉 今西
 
PDF
Creative Data Analysis with Python
Grant Paton-Simpson
 
PDF
Researh toolbox-data-analysis-with-python
Waternomics
 
PDF
Making your-very-own-android-apps-for-waternomics-using-app-inventor-2
Waternomics
 
PPTX
Data analysis with pandas
Outreach Digital
 
PDF
Creating Your First Predictive Model In Python
Robert Dempsey
 
PDF
Categorical Data Analysis in Python
Jaidev Deshpande
 
PDF
Big data analysis in python @ PyCon.tw 2013
Jimmy Lai
 
PPTX
Analyzing Data With Python
Sarah Guido
 
PDF
Data Structures for Statistical Computing in Python
Wes McKinney
 
pandas - Python Data Analysis
Andrew Henshaw
 
Parsing real-time data using Twitter Streaming API
Ram Parthasarathy
 
Data Analysis in Python
Richard Herrell
 
Python and Data Analysis
Praveen Nair
 
Intro to Python Data Analysis in Wakari
Karissa Rae McKelvey
 
Getting started with pandas
maikroeder
 
pandas: Powerful data analysis tools for Python
Wes McKinney
 
Python for Financial Data Analysis with pandas
Wes McKinney
 
CLASSIFICATION OF TWEETS
Mukul Jha
 
Webinar: Mastering Python - An Excellent tool for Web Scraping and Data Anal...
Edureka!
 
Python for Data Analysis: Chapter 2
智哉 今西
 
Creative Data Analysis with Python
Grant Paton-Simpson
 
Researh toolbox-data-analysis-with-python
Waternomics
 
Making your-very-own-android-apps-for-waternomics-using-app-inventor-2
Waternomics
 
Data analysis with pandas
Outreach Digital
 
Creating Your First Predictive Model In Python
Robert Dempsey
 
Categorical Data Analysis in Python
Jaidev Deshpande
 
Big data analysis in python @ PyCon.tw 2013
Jimmy Lai
 
Analyzing Data With Python
Sarah Guido
 
Data Structures for Statistical Computing in Python
Wes McKinney
 
Ad

Similar to Practical Data Analysis in Python (20)

PDF
maxbox_starter138_top7_statistical_methods.pdf
MaxKleiner3
 
PDF
AI and ML Skills for the Testing World Tutorial
Tariq King
 
PPT
Static Analysis
alice yang
 
DOCX
First ML Experience
Amrith Kumar
 
PDF
It Probably Works - QCon 2015
Fastly
 
PPTX
A Fairness-aware Machine Learning Interface for End-to-end Discrimination Dis...
wajrcs
 
PDF
yelp data challenge
AMR koura
 
PDF
Computational decision making
Boris Adryan
 
PDF
Debugging AI
Dr. Christian Betz
 
PPTX
EVERYTHING ABOUT STATIC CODE ANALYSIS FOR A JAVA PROGRAMMER
Andrey Karpov
 
PPTX
Ember
mrphilroth
 
PPT
A Fast Flowgraph Based Classification System for Packed and Polymorphic Malwa...
Silvio Cesare
 
PPTX
Using the Machine to predict Testability
Miguel Lopez
 
PPT
Machine Learning, Data Mining, Genetic Algorithms, Neural ...
butest
 
PDF
Machine Learning with Python- Machine Learning Algorithms.pdf
KalighatOkira
 
PDF
Neural networks, naïve bayes and decision tree machine learning
Francisco E. Figueroa-Nigaglioni
 
PDF
Automated Machine Learning Applied to Diverse Materials Design Problems
Anubhav Jain
 
PDF
BlueHat v18 || Protecting the protector, hardening machine learning defenses ...
BlueHat Security Conference
 
PDF
Introduction to Data Mining
Kai Koenig
 
PPTX
B4UConference_machine learning_deeplearning
Hoa Le
 
maxbox_starter138_top7_statistical_methods.pdf
MaxKleiner3
 
AI and ML Skills for the Testing World Tutorial
Tariq King
 
Static Analysis
alice yang
 
First ML Experience
Amrith Kumar
 
It Probably Works - QCon 2015
Fastly
 
A Fairness-aware Machine Learning Interface for End-to-end Discrimination Dis...
wajrcs
 
yelp data challenge
AMR koura
 
Computational decision making
Boris Adryan
 
Debugging AI
Dr. Christian Betz
 
EVERYTHING ABOUT STATIC CODE ANALYSIS FOR A JAVA PROGRAMMER
Andrey Karpov
 
Ember
mrphilroth
 
A Fast Flowgraph Based Classification System for Packed and Polymorphic Malwa...
Silvio Cesare
 
Using the Machine to predict Testability
Miguel Lopez
 
Machine Learning, Data Mining, Genetic Algorithms, Neural ...
butest
 
Machine Learning with Python- Machine Learning Algorithms.pdf
KalighatOkira
 
Neural networks, naïve bayes and decision tree machine learning
Francisco E. Figueroa-Nigaglioni
 
Automated Machine Learning Applied to Diverse Materials Design Problems
Anubhav Jain
 
BlueHat v18 || Protecting the protector, hardening machine learning defenses ...
BlueHat Security Conference
 
Introduction to Data Mining
Kai Koenig
 
B4UConference_machine learning_deeplearning
Hoa Le
 
Ad

More from Hilary Mason (12)

PDF
Grace Hopper Conference Opening Keynote
Hilary Mason
 
PPTX
Short URLs, Big Fun
Hilary Mason
 
PPTX
Strata NY Sep 2011: Big Data, Short URLs: Learning in Realtime
Hilary Mason
 
PPTX
PyCon 2011 Keynote
Hilary Mason
 
PPTX
Machine Learning for Web Data
Hilary Mason
 
PPTX
A Data-driven Look at the Realtime Web
Hilary Mason
 
PDF
IgniteNYC: How to Replace Yourself With a Very Small Shell Script
Hilary Mason
 
PPT
Have data? What now?!
Hilary Mason
 
PPT
JWU Guest Talk: JavaScript and AJAX
Hilary Mason
 
PPT
Analytics for Virtual Worlds
Hilary Mason
 
PPT
Experiential Learning in Second Life
Hilary Mason
 
PPT
Virtual Worlds in Education
Hilary Mason
 
Grace Hopper Conference Opening Keynote
Hilary Mason
 
Short URLs, Big Fun
Hilary Mason
 
Strata NY Sep 2011: Big Data, Short URLs: Learning in Realtime
Hilary Mason
 
PyCon 2011 Keynote
Hilary Mason
 
Machine Learning for Web Data
Hilary Mason
 
A Data-driven Look at the Realtime Web
Hilary Mason
 
IgniteNYC: How to Replace Yourself With a Very Small Shell Script
Hilary Mason
 
Have data? What now?!
Hilary Mason
 
JWU Guest Talk: JavaScript and AJAX
Hilary Mason
 
Analytics for Virtual Worlds
Hilary Mason
 
Experiential Learning in Second Life
Hilary Mason
 
Virtual Worlds in Education
Hilary Mason
 

Recently uploaded (20)

PDF
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
PPTX
Coupa-Overview _Assumptions presentation
annapureddyn
 
PDF
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
 
PDF
Chapter 1 Introduction to CV and IP Lecture Note.pdf
Getnet Tigabie Askale -(GM)
 
PDF
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
PDF
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
PDF
SparkLabs Primer on Artificial Intelligence 2025
SparkLabs Group
 
PPTX
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
PDF
Event Presentation Google Cloud Next Extended 2025
minhtrietgect
 
PPTX
How to Build a Scalable Micro-Investing Platform in 2025 - A Founder’s Guide ...
Third Rock Techkno
 
PDF
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
PPT
L2 Rules of Netiquette in Empowerment technology
Archibal2
 
PDF
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
PPTX
Comunidade Salesforce São Paulo - Desmistificando o Omnistudio (Vlocity)
Francisco Vieira Júnior
 
PPTX
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 
PPTX
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
PDF
Unlocking the Future- AI Agents Meet Oracle Database 23ai - AIOUG Yatra 2025.pdf
Sandesh Rao
 
PDF
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 
PDF
Cloud-Migration-Best-Practices-A-Practical-Guide-to-AWS-Azure-and-Google-Clou...
Artjoker Software Development Company
 
PPTX
Stamford - Community User Group Leaders_ Agentblazer Status, AI Sustainabilit...
Amol Dixit
 
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
Coupa-Overview _Assumptions presentation
annapureddyn
 
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
 
Chapter 1 Introduction to CV and IP Lecture Note.pdf
Getnet Tigabie Askale -(GM)
 
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
SparkLabs Primer on Artificial Intelligence 2025
SparkLabs Group
 
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
Event Presentation Google Cloud Next Extended 2025
minhtrietgect
 
How to Build a Scalable Micro-Investing Platform in 2025 - A Founder’s Guide ...
Third Rock Techkno
 
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
L2 Rules of Netiquette in Empowerment technology
Archibal2
 
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
Comunidade Salesforce São Paulo - Desmistificando o Omnistudio (Vlocity)
Francisco Vieira Júnior
 
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
Unlocking the Future- AI Agents Meet Oracle Database 23ai - AIOUG Yatra 2025.pdf
Sandesh Rao
 
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 
Cloud-Migration-Best-Practices-A-Practical-Guide-to-AWS-Azure-and-Google-Clou...
Artjoker Software Development Company
 
Stamford - Community User Group Leaders_ Agentblazer Status, AI Sustainabilit...
Amol Dixit
 

Practical Data Analysis in Python

Editor's Notes

  • #4: 1) Access to the data, and 2) CPU power/algorithms that are robust enough to analyze it
  • #15: NLTK – in development since 2001