Mastering Data Mining with Python – Find patterns hidden in your data

Ebook605 pages4 hours

Mastering Data Mining with Python – Find patterns hidden in your data

Name: Mastering Data Mining with Python – Find patterns hidden in your data
Author: Megan Squire
ISBN: 9781785885914

By Megan Squire

Rating: 0 out of 5 stars

()

Read preview

About this ebook

About This Book

Dive deeper into data mining with Python – don’t be complacent, sharpen your skills!
From the most common elements of data mining to cutting-edge techniques, we’ve got you covered for any data-related challenge
Become a more fluent and confident Python data-analyst, in full control of its extensive range of libraries

Who This Book Is For

This book is for data scientists who are already familiar with some basic data mining techniques such as SQL and machine learning, and who are comfortable with Python. If you are ready to learn some more advanced techniques in data mining in order to become a data mining expert, this is the book for you!

Skip carousel

LanguageEnglish

PublisherPackt Publishing

Release dateAug 29, 2016

ISBN9781785885914

Author

Megan Squire

Megan Squire is deputy director for analytics at the Southern Poverty Law Center.

Related authors

Skip carousel

Related to Mastering Data Mining with Python – Find patterns hidden in your data

Related ebooks

Skip carousel

Text Analytics with Python: A Brief Introduction to Text Analytics with Python
Ebook
Text Analytics with Python: A Brief Introduction to Text Analytics with Python
byAnthony S. Williams
Rating: 0 out of 5 stars
0 ratings
Python Data Science Essentials - Second Edition
Ebook
Python Data Science Essentials - Second Edition
byAlberto Boschetti
Rating: 4 out of 5 stars
4/5
Python 3 Text Processing with NLTK 3 Cookbook
Ebook
Python 3 Text Processing with NLTK 3 Cookbook
byJacob Perkins
Rating: 4 out of 5 stars
4/5
Python Programming, Deep Learning: 3 Books in 1: A Complete Guide for Beginners, Python Coding for Ai, Neural Networks, & Machine Learning, Data Science/Analysis with Practical Exercises for Learners
Ebook
Python Programming, Deep Learning: 3 Books in 1: A Complete Guide for Beginners, Python Coding for Ai, Neural Networks, & Machine Learning, Data Science/Analysis with Practical Exercises for Learners
byAnthony Adams
Rating: 4 out of 5 stars
4/5
Practical Data Analysis Cookbook
Ebook
Practical Data Analysis Cookbook
byTomasz Drabas
Rating: 0 out of 5 stars
0 ratings
Python Data Science Essentials
Ebook
Python Data Science Essentials
byAlberto Boschetti
Rating: 0 out of 5 stars
0 ratings
Machine Learning with R - Third Edition: Expert techniques for predictive modeling, 3rd Edition
Ebook
Machine Learning with R - Third Edition: Expert techniques for predictive modeling, 3rd Edition
byBrett Lantz
Rating: 0 out of 5 stars
0 ratings
Python Data Analysis Cookbook
Ebook
Python Data Analysis Cookbook
byIvan Idris
Rating: 4 out of 5 stars
4/5
Learning Data Mining with Python - Second Edition
Ebook
Learning Data Mining with Python - Second Edition
byRobert Layton
Rating: 0 out of 5 stars
0 ratings
Modular Programming with Python
Ebook
Modular Programming with Python
byErik Westra
Rating: 0 out of 5 stars
0 ratings
Hands-On Data Analysis with Pandas: Efficiently perform data collection, wrangling, analysis, and visualization using Python
Ebook
Hands-On Data Analysis with Pandas: Efficiently perform data collection, wrangling, analysis, and visualization using Python
byStefanie Molin
Rating: 0 out of 5 stars
0 ratings
Python Business Intelligence Cookbook: Leverage the computational power of Python with more than 60 recipes that arm you with the required skills to make informed business decisions
Ebook
Python Business Intelligence Cookbook: Leverage the computational power of Python with more than 60 recipes that arm you with the required skills to make informed business decisions
byRobert Dempsey
Rating: 0 out of 5 stars
0 ratings
Learning Jupyter
Ebook
Learning Jupyter
byDan Toomey
Rating: 3 out of 5 stars
3/5
Bayesian Analysis with Python
Ebook
Bayesian Analysis with Python
byOsvaldo Martin
Rating: 4 out of 5 stars
4/5
Python Data Visualization Cookbook
Ebook
Python Data Visualization Cookbook
byMilovanović Igor
Rating: 4 out of 5 stars
4/5
Learning Data Mining with Python
Ebook
Learning Data Mining with Python
byRobert Layton
Rating: 0 out of 5 stars
0 ratings
Getting Started with Python Data Analysis
Ebook
Getting Started with Python Data Analysis
byVo.T.H Phuong
Rating: 0 out of 5 stars
0 ratings
Practical Data Science with Jupyter: Explore Data Cleaning, Pre-processing, Data Wrangling, Feature Engineering and Machine Learning using Python and Jupyter (English Edition)
Ebook
Practical Data Science with Jupyter: Explore Data Cleaning, Pre-processing, Data Wrangling, Feature Engineering and Machine Learning using Python and Jupyter (English Edition)
byPrateek Gupta
Rating: 0 out of 5 stars
0 ratings
Advanced Machine Learning with Python
Ebook
Advanced Machine Learning with Python
byJohn Hearty
Rating: 0 out of 5 stars
0 ratings
Large Scale Machine Learning with Python
Ebook
Large Scale Machine Learning with Python
byBastiaan Sjardin
Rating: 2 out of 5 stars
2/5
Django 1.1 Testing and Debugging
Ebook
Django 1.1 Testing and Debugging
byKaren M. Tracey
Rating: 4 out of 5 stars
4/5
Principles of Data Science
Ebook
Principles of Data Science
bySinan Ozdemir
Rating: 4 out of 5 stars
4/5
Learning Predictive Analytics with Python
Ebook
Learning Predictive Analytics with Python
byKumar Ashish
Rating: 4 out of 5 stars
4/5
A Python Guide for Web Scraping: Explore Python Tools, Web Scraping Techniques, and How to Automata Data for Industrial Applications (English Edition)
Ebook
A Python Guide for Web Scraping: Explore Python Tools, Web Scraping Techniques, and How to Automata Data for Industrial Applications (English Edition)
byPradumna Milind Panditrao
Rating: 0 out of 5 stars
0 ratings
R High Performance Programming
Ebook
R High Performance Programming
byAloysius Lim
Rating: 4 out of 5 stars
4/5
Data Science with Jupyter: Master Data Science skills with easy-to-follow Python examples
Ebook
Data Science with Jupyter: Master Data Science skills with easy-to-follow Python examples
byPrateek Gupta
Rating: 0 out of 5 stars
0 ratings
Web Scraping with Python
Ebook
Web Scraping with Python
byRichard Lawson
Rating: 4 out of 5 stars
4/5
Mastering Python Data Analysis
Ebook
Mastering Python Data Analysis
byMagnus Vilhelm Persson
Rating: 0 out of 5 stars
0 ratings
Practical Data Science Cookbook - Second Edition
Ebook
Practical Data Science Cookbook - Second Edition
byTony Ojeda
Rating: 0 out of 5 stars
0 ratings
Python Unlocked
Ebook
Python Unlocked
byTigeraniya Arun
Rating: 0 out of 5 stars
0 ratings

Programming For You

Skip carousel

SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL
Ebook
SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL
byWalter Shields
Rating: 4 out of 5 stars
4/5
Python: Learn Python in 24 Hours
Ebook
Python: Learn Python in 24 Hours
byAlex Nordeen
Rating: 4 out of 5 stars
4/5
Data Science from Scratch: The #1 Data Science Guide for Everything A Data Scientist Needs to Know: Python, Linear Algebra, Statistics, Coding, Applications, Neural Networks, and Decision Trees
Ebook
Data Science from Scratch: The #1 Data Science Guide for Everything A Data Scientist Needs to Know: Python, Linear Algebra, Statistics, Coding, Applications, Neural Networks, and Decision Trees
bySteven Cooper
Rating: 4 out of 5 stars
4/5
Excel : The Ultimate Comprehensive Step-By-Step Guide to the Basics of Excel Programming: 1
Ebook
Excel : The Ultimate Comprehensive Step-By-Step Guide to the Basics of Excel Programming: 1
byKevin Clark
Rating: 5 out of 5 stars
5/5
Microsoft Azure For Dummies
Ebook
Microsoft Azure For Dummies
byJack A. Hyman
Rating: 0 out of 5 stars
0 ratings
Learn to Code. Get a Job. The Ultimate Guide to Learning and Getting Hired as a Developer.
Ebook
Learn to Code. Get a Job. The Ultimate Guide to Learning and Getting Hired as a Developer.
byGwendolyn Faraday
Rating: 5 out of 5 stars
5/5
Coding All-in-One For Dummies
Ebook
Coding All-in-One For Dummies
byNikhil Abraham
Rating: 4 out of 5 stars
4/5
Python Programming for Beginners: A Comprehensive Crash Course With Practical Exercises to Quickly Learn Coding and Programming for Data Analysis and Machine Learning
Ebook
Python Programming for Beginners: A Comprehensive Crash Course With Practical Exercises to Quickly Learn Coding and Programming for Data Analysis and Machine Learning
byAnthony Adams
Rating: 4 out of 5 stars
4/5
Python Programming For Beginners: Learn The Basics Of Python Programming (Python Crash Course, Programming for Dummies)
Ebook
Python Programming For Beginners: Learn The Basics Of Python Programming (Python Crash Course, Programming for Dummies)
byJames Tudor
Rating: 5 out of 5 stars
5/5
Excel Essentials: A Step-by-Step Guide with Pictures for Absolute Beginners to Master the Basics and Start Using Excel with Confidence
Ebook
Excel Essentials: A Step-by-Step Guide with Pictures for Absolute Beginners to Master the Basics and Start Using Excel with Confidence
byNigel Tillery
Rating: 5 out of 5 stars
5/5
SQL All-in-One For Dummies
Ebook
SQL All-in-One For Dummies
byAllen G. Taylor
Rating: 3 out of 5 stars
3/5
Learn SQL in 24 Hours
Ebook
Learn SQL in 24 Hours
byAlex Nordeen
Rating: 5 out of 5 stars
5/5
Python Programming : How to Code Python Fast In Just 24 Hours With 7 Simple Steps
Ebook
Python Programming : How to Code Python Fast In Just 24 Hours With 7 Simple Steps
byJason Scotts
Rating: 4 out of 5 stars
4/5
The JavaScript Workshop: Learn to develop interactive web applications with clean and maintainable JavaScript code
Ebook
The JavaScript Workshop: Learn to develop interactive web applications with clean and maintainable JavaScript code
byJoseph Labrecque
Rating: 4 out of 5 stars
4/5
JavaScript All-in-One For Dummies
Ebook
JavaScript All-in-One For Dummies
byChris Minnick
Rating: 5 out of 5 stars
5/5
The Advanced Roblox Coding Book: An Unofficial Guide, Updated Edition: Learn How to Script Games, Code Objects and Settings, and Create Your Own World!
Ebook
The Advanced Roblox Coding Book: An Unofficial Guide, Updated Edition: Learn How to Script Games, Code Objects and Settings, and Create Your Own World!
byHeath Haskins
Rating: 4 out of 5 stars
4/5
Godot from Zero to Proficiency (Foundations): Godot from Zero to Proficiency, #1
Ebook
Godot from Zero to Proficiency (Foundations): Godot from Zero to Proficiency, #1
byPatrick Felicia
Rating: 5 out of 5 stars
5/5
Microsoft Office 365 Bible: 10:1 Mastery | Excel in Your Profession, Enhance Time Management, and Foster Exceptional Collaboration [III EDITION]
Ebook
Microsoft Office 365 Bible: 10:1 Mastery | Excel in Your Profession, Enhance Time Management, and Foster Exceptional Collaboration [III EDITION]
byKevin Pitch
Rating: 5 out of 5 stars
5/5
Linux: Learn in 24 Hours
Ebook
Linux: Learn in 24 Hours
byAlex Nordeen
Rating: 5 out of 5 stars
5/5
PYTHON: Practical Python Programming For Beginners & Experts With Hands-on Project
Ebook
PYTHON: Practical Python Programming For Beginners & Experts With Hands-on Project
byMark Chan
Rating: 5 out of 5 stars
5/5
PYTHON PROGRAMMING
Ebook
PYTHON PROGRAMMING
byRamsey Hamilton
Rating: 4 out of 5 stars
4/5
C All-in-One Desk Reference For Dummies
Ebook
C All-in-One Desk Reference For Dummies
byDan Gookin
Rating: 5 out of 5 stars
5/5
Python Data Structures and Algorithms
Ebook
Python Data Structures and Algorithms
byBenjamin Baka
Rating: 5 out of 5 stars
5/5
Excel 101: A Beginner's & Intermediate's Guide for Mastering the Quintessence of Microsoft Excel (2010-2019 & 365) in no time!
Ebook
Excel 101: A Beginner's & Intermediate's Guide for Mastering the Quintessence of Microsoft Excel (2010-2019 & 365) in no time!
byJohannes Wild
Rating: 0 out of 5 stars
0 ratings
Algorithms For Dummies
Ebook
Algorithms For Dummies
byJohn Paul Mueller
Rating: 4 out of 5 stars
4/5
The Ultimate Roblox Book: An Unofficial Guide, Updated Edition: Learn How to Build Your Own Worlds, Customize Your Games, and So Much More!
Ebook
The Ultimate Roblox Book: An Unofficial Guide, Updated Edition: Learn How to Build Your Own Worlds, Customize Your Games, and So Much More!
byDavid Jagneaux
Rating: 0 out of 5 stars
0 ratings
Game Development with Unreal Engine 5: Learn the Basics of Game Development in Unreal Engine 5 (English Edition)
Ebook
Game Development with Unreal Engine 5: Learn the Basics of Game Development in Unreal Engine 5 (English Edition)
byMitchell Lynn
Rating: 3 out of 5 stars
3/5
Mastering JavaScript: The Complete Guide to JavaScript Mastery
Ebook
Mastering JavaScript: The Complete Guide to JavaScript Mastery
byTim Robards
Rating: 5 out of 5 stars
5/5
Learn Python Programming for Beginners: The Best Step-by-Step Guide for Coding with Python, Great for Kids and Adults. Includes Practical Exercises on Data Analysis, Machine Learning and More.
Ebook
Learn Python Programming for Beginners: The Best Step-by-Step Guide for Coding with Python, Great for Kids and Adults. Includes Practical Exercises on Data Analysis, Machine Learning and More.
byFlynn Fisher
Rating: 4 out of 5 stars
4/5
Excel 2021
Ebook
Excel 2021
byJIAYI SIMONDS
Rating: 4 out of 5 stars
4/5

Related categories

Skip carousel

Reviews for Mastering Data Mining with Python – Find patterns hidden in your data

Rating: 0 out of 5 stars

0 ratings

0 ratings0 reviews

Book preview

Mastering Data Mining with Python – Find patterns hidden in your data - Megan Squire

(missing alt)

Mastering Data Mining with Python – Find patterns hidden in your data

Credits

About the Author

About the Reviewers

www.PacktPub.com

eBooks, discount offers, and more

Why subscribe?

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Reader feedback

Customer support

Downloading the example code

Errata

Piracy

Questions

1. Expanding Your Data Mining Toolbox

What is data mining?

How do we do data mining?

The Fayyad et al. KDD process

The Han et al. KDD process

The CRISP-DM process

The Six Steps process

Which data mining methodology is the best?

What are the techniques used in data mining?

What techniques are we going to use in this book?

How do we set up our data mining work environment?

Summary

2. Association Rule Mining

What are frequent itemsets?

The diapers and beer urban legend

Frequent itemset mining basics

Towards association rules

Support

Confidence

Association rules

An example with data

Added value – fixing a flaw in the plan

Methods for finding frequent itemsets

A project – discovering association rules in software project tags

Summary

3. Entity Matching

What is entity matching?

Merging data

Merging datasets vertically

Merging datasets horizontally

Techniques for matching

Attribute-based similarity matching

Be careful of pairwise comparisons

Leverage rare values

Methods for matching attributes

Range-based or distance from target

String edit distance

Hamming distance

Levenshtein distance

Soundex

Leveraging disjoint sets

Context-based similarity matching

Machine learning-based entity matching

Evaluation of entity matching techniques

Efficiency – how long does it take to do the matching?

Effectiveness – how accurate are the matches that we generate?

Usefulness – how practical is the matching procedure to use?

Entity matching project

Difficulties with matching software projects

Two examples

Matching on project names

Matching on people names

Matching on URLs

Matching on topics and description keywords

The dataset

The code

The results

How many entity matches did we find?

How good are the pairs we found?

Summary

4. Network Analysis

What is a network?

Measuring a network

Degree of a network

Diameter of a network

Walks, paths, and trails in a network

Components of a network

Centrality of a network

Closeness centrality

Degree centrality

Betweenness centrality

Other measures of centrality

Representing graph data

Adjacency matrix

Edge lists and adjacency lists

Differences between graph data structures

Importing data into a graph structure

Adjacency list format

Edge list format

GEXF and GraphML

GDF

Python pickle

JSON

JSON node and link series

JSON trees

Pajek format

A real project

Exploring the data

Generating the network files

Understanding our data as a network

Generating simple network metrics

Playing with the parameters of a network

Analyzing subgraphs

Analyzing cliques and centrality in the subgraphs

Looking for change over time

Summary

5. Sentiment Analysis in Text

What is sentiment analysis?

The basics of sentiment analysis

The structure of an opinion

Document-level and sentence-level analysis

Important features of opinions

Sentiment analysis algorithms

General-purpose data collections

Hu and Liu's sentiment analysis lexicon

SentiWordNet

Vader sentiment

Sentiment mining application

Motivating the project

Data preparation

Data analysis of chat messages

Data analysis of e-mail messages

Summary

6. Named Entity Recognition in Text

Why look for named entities?

Techniques for named entity recognition

Tagging parts of speech

Classes of named entities

Building and evaluating NER systems

NER and partial matches

Handling partial matches

Named entity recognition project

A simple NER tool

Apache Board meeting minutes

Django IRC chat

GnuIRC summaries

LKML e-mails

Summary

7. Automatic Text Summarization

What is automatic text summarization?

Tools for text summarization

Naive text summarization using NLTK

Text summarization using Gensim

Text summarization using Sumy

Sumy's Luhn summarizer

Sumy's TextRank summarizer

Sumy's LSA summarizer

Sumy's Edmundson summarizer

Summary

8. Topic Modeling in Text

What is topic modeling?

Latent Dirichlet Allocation

Gensim for topic modeling

Understanding Gensim LDA topics

Understanding Gensim LDA passes

Applying a Gensim LDA model to new documents

Serializing Gensim LDA objects

Serializing a dictionary

Serializing a corpus

Serializing a model

Gensim LDA for a larger project

Summary

9. Mining for Data Anomalies

What are data anomalies?

Missing data

Locating missing data

Zero values

Fixing missing data

Ignore the problem rows

Fix the problem manually

Use a fabricated value

Use a central measure

Use Last Observation Carried Forward

Use a similar value

Use the most likely value

Data errors

Truncated fields

Data type and character set errors

Logic or semantic errors

Outliers

Visual mining for outliers

Statistical detection of outliers

Detecting outliers with modified z-scores

Detecting outliers by combining statistics and visual mining

Detecting outliers with machine learning

Summary

Index

Mastering Data Mining with Python – Find patterns hidden in your data

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author(s), nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

First published: August 2016

Production reference: 1240816

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham B3 2PB, UK.

ISBN 978-1-78588-995-0

www.packtpub.com

Credits

Author

Megan Squire

Reviewers

Sanjeev Jaiswal

Ron Mitsugo Zacharski

Commissioning Editor

Veena Pagare

Acquisition Editor

Lester Frias

Content Development Editor

Mamata Walkar

Technical Editor

Naveenkumar Jain

Copy Editors

Safis Editing

Sneha Singh

Project Coordinator

Shweta H Birwatkar

Proofreader

Safis Editing

Indexer

Pratik Shirodkar

Graphics

Kirk D'Penha

Production Coordinator

Shantanu N. Zagade

Cover Work

Shantanu N. Zagade

About the Author

Megan Squire is a professor of computing sciences at Elon University.

Her primary research interest is in collecting, cleaning, and analyzing data about how free and open source software is made. She is one of the leaders of the FLOSSmole.org, FLOSSdata.org, and FLOSSpapers.org projects.

About the Reviewers

Sanjeev Jaiswal is a computer graduate with 7 years of industrial experience. His works involves Perl, Python, and GNU/Linux. He is currently working on projects involving penetration testing, source code review, and security design and implementations.

He is very much interested in web and cloud security. He is also learning NodeJS and cloud security.

Sanjeev loves teaching engineering students and IT professionals. He has been teaching for the last 8 years in his free time. He founded Alien Coders (http://www.aliencoders.org), based on the learning through sharing principle for computer science students and IT professionals in 2010, which became a huge hit in India among engineering students.

You can follow him on Facebook at http://www.facebook.com/aliencoders, on Twitter at @aliencoders, and on GitHub at https://github.com/jassics.

Sanjeev wrote Instant PageSpeed Optimization and co-authored Learning Django Web Development for Packt Publishing. He has reviewed more than 5 books for Packt and looks forward to more such opportunities.

Ron Mitsugo Zacharski is a computational linguist working in the areas of information extraction and machine learning (zacharski.org). He has a BFA in music from the University of Wisconsin at Milwaukee and a PhD in computer science from the University of Minnesota, and he completed a post doctorate in linguistics at the University of Edinburgh. He authored the free online book A Programmer's Guide to Data Mining: The Ancient Art of the Numerati (www.guidetodatamining.com) and co-edited The Grammar-Pragmatics Interface: Essays in Honor of Jeanette K. Gundel, published by John Benjamins. For the majority of his academic life, he has focused on multilingual natural language processing, particularly with lesser-studied languages. Dr. Zacharski is a Zen monk in the Sōtō School lineage of Soyu Matsuoka. He lives in New Mexico.

www.PacktPub.com

eBooks, discount offers, and more

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.

eBooks, discount offers, and more

https://www2.packtpub.com/books/subscription/packtlib

Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library. Here, you can search, access, and read Packt's entire library of books.

Why subscribe?

Fully searchable across every book published by Packt

Copy and paste, print, and bookmark content

On demand and accessible via a web browser

Preface

Over the past decade, cheaper data storage, faster hardware, and impressive advances in algorithms have combined to pave the way for a rapid ascendance of data science as one of the most important opportunities in computing. While the term data science can include everything from cleaning data and storing data to visualizing it in graphs and charts, the area that has made the most significant gain is the invention of intelligent and sophisticated algorithms for analyzing data. Using computers to find the interesting patterns buried within massive amounts of data is called data mining, an area that encompasses elements of database systems, statistics, and machine learning.

Right now there are dozens of great data mining and machine learning books available for software developers to get up to date on all these advances in the field. What most of these books have in common is that they all cover a small set of tried-and-true methods for finding patterns in data: classification, clustering, decision trees, and regression. Of course, all of these are critically important methods for any data miner to know and they are popular because they can be effective. But these same few techniques are not the whole story. Data mining is a rich field encompassing many dozens of techniques to uncover patterns and make predictions. A true master of data mining should have many tools in her toolbox, not just a few. Thus, the mission of this book, Mastering Data Mining with Python, is to introduce some of the lesser-known data mining concepts that are typically only covered in academic textbooks.

This book uses the Python programming language and a project-based approach to introduce diverse and often overlooked data mining concepts, such as association rules, entity matching, network analysis, text mining, and anomaly detection. Each chapter thoroughly illustrates the basics of one particular data mining technique, provides alternatives for evaluating its effectiveness, and then implements the technique using real-world data.

Our focus on real-world data is another feature of this book that sets it apart from many other data mining books. The true test of whether we have mastered a concept is whether we can apply a method to a new, unknown problem. In our case, this means applying each data mining method to a new problem area or a new data set. The emphasis on real data also means that our results may not always be as clean and tidy as results that come from a canned, example data set. For this reason, each chapter includes a discussion for how to critically evaluate the method. Do the results make sense? What do the results mean? How can the results be improved?

So, in many ways, this book picks up where some of the other data mining books leave off. If you want to round up your growing data mining toolbox with a set of interesting but often overlooked techniques, then read on to learn the specific topics we will cover and how they will be applied in each chapter.

What this book covers

Chapter 1, Expanding Your Data Mining Toolbox, gives an introduction to the field of data mining. In this chapter we pay special attention to how data mining relates to similar topics, such as machine learning and data science. We also review many different data mining methodologies, and talk about their various strengths and weaknesses. This foundational knowledge is important as we transition into the remaining chapters of the book, which are much more technique-oriented and focus on the application of specific data mining tools.

Chapter 2, Association Rule Mining, introduces our first data mining tool: mining for co-occurring sets of items, sometimes called frequent itemsets. We extend our understanding of frequent itemset mining to include mining for association rules, and we learn how to evaluate whether the rules we have found are helpful or not. To put our knowledge into practice, at the end of the chapter we implement a small project wherein we find association rules in the keywords chosen to describe a large set of software projects.

Chapter 3, Entity Matching, focuses on finding matching pairs of data elements that may look slightly different but are actually the same. We learn how to determine whether two items are actually the same thing by using the attributes of the data. At the end of the chapter, we implement an entity matching project where we learn to find the software projects that have moved from one hosting service to another, even after changing their names and other important attributes.

Chapter 4, Network Analysis, is a tour through the basics of network or graph analysis, as used to describe the relationships between various interconnected groups of entities. We investigate the various types of network and learn how to describe and measure them. Then we put our learning into practice to describe how a network of software developers has changed over time.

Chapter 5, Sentiment Analysis in Text, is the first of four text mining chapters in this book. This chapter serves as an introduction to the growing field of sentiment, or mood, analysis in text. After comparing various approaches to sentiment mining and learning how to evaluate the results, we practice using a machine learning classifier to determine the sentiment of a set of software developer chat logs and e-mail logs.

Chapter 6, Named Entity Recognition in Text, is about finding proper nouns and proper names in text. We spend some time learning why this task is useful, and why finding named entities can sometimes be more difficult than it sounds. At the end of the chapter we implement a named entity recognition system on several different types of real-world text data including e-mail, chat logs, and board meeting minutes. Along the way we apply different techniques for quantifying the success or failure of our results.

Chapter 7, Automatic Text Summarization, presents several strategies for automatically create condensed summaries of text. This chapter emphasizes extractive summarization tools, which are designed to find the most important sentences in a text sample. To this end, we experiment with three different tools for accomplishing this goal, testing the summarization methods, and learning how they differ. Following the introduction of each tool, we attempt to summarize a common set of text documents and compare the results.

Chapter 8, Topic Modeling in Text, shows how to use software tools to reveal what topics or concepts are present in a given text. Can we train a computer program to infer the themes that are present in large amounts of text? In a series of experiments, we learn how to use common topic modeling libraries to reveal the topics present in software developer e-mails, and how those topics change over time.

Chapter 9, Mining for Data Anomalies, is where we learn how to use data mining and statistical techniques to improve our own data mining process. While all of the other chapters in this book deal with finding different types of patterns in data, here we focus on finding data that is anomalous or that does not match a particular pattern. Whether it is because the data is empty, missing, or just plain weird, this chapter presents strategies for finding or fixing this type of data so that the rest of your data can be mined more effectively.

What you need for this book

To complete the projects in this book, you will need a version of Python 3.5 or higher. I recommend using Anaconda Python, but any Python distribution will do as long as it is updated and contains the following packages: Numpy, Matplotlib, NetworkX, PyMySQL, Gensim, and NLTK. In Chapter 1, Expanding Your Data Mining Toolbox, we will walk through an easy installation of Python and all these libraries, and each time a library is used later in the book, we will install it or upgrade it together.

Because data mining is obviously data-centric, and because the data sets we are working with are sometimes large or require some type of persistent data storage, I chose to implement some of the data mining algorithms alongside a relational database system. I chose MySQL for accomplishing this since it is an established, easy-to-download and install piece of infrastructure. The chapters where MySQL comes into play are in working with the memory-intensive algorithms in Chapter 2, Association Rule Mining, and Chapter 3, Entity Matching. I also use MySQL for some of the examples in Chapter 9, Mining for Data Anomalies, but it is possible to go through that chapter without MySQL.

Who this book is for

If you picked up a book on mastering data mining, you are probably familiar with the basics of data analysis and you have likely experimented with machine learning techniques such as regression, decision trees, classification, and cluster analysis. If you have intermediate experience with Python, understand basic relational database terminology, have some exposure to basic statistics, and can understand the rudiments of how supervised and unsupervised machine learning techniques work, then you are ready for this book. Let's build on what you already know to learn some more exotic, unusual strategies for mining your data!

Conventions

In this book, you will find a number of text styles that distinguish between different kinds of information. Here are some examples of these styles and an explanation of their meaning.

Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: We can include other contexts through the use of the include directive.

A block of code is set as follows:

MINSUPPORTPCT = 5

allSingletonTags = []

allDoubletonTags = set()

doubletonSet = set()

Any command-line input or output is written as follows:

conda install pymysql

New terms and important words are shown in bold. Words that you see on the screen, for example, in menus or dialog boxes, appear in the text like this: Clicking the Next button moves you to the next screen.

Note

Warnings or important notes appear in a box like this.

Tip

Tips and tricks appear like this.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of.

To send us general feedback, simply e-mail <[email protected]>, and mention the book's title in the subject of your message.

If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

You can download the code files by following these steps:

Log in or register to our website using your e-mail address and password.

Hover the mouse pointer on the SUPPORT tab at the top.

Click on Code Downloads & Errata.

Enter the name of the book in the Search box.

Select the book for which you're looking to download the code files.

Choose from the drop-down menu where you purchased this book from.

Click on Code Download.

You can also download the code files by clicking on the Code Files button on the book's webpage at the Packt Publishing website. This page can be accessed by entering the book's name in the Search box. Please note that you need to be logged in to your Packt account.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR / 7-Zip for Windows

Zipeg / iZip / UnRarX for Mac

7-Zip / PeaZip for Linux

The code bundle for the book is also hosted on GitHub at https://github.com/megansquire/masteringDM. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.

To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.

Piracy

Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at <[email protected]> with a link to the suspected pirated material.

We appreciate your help in protecting our authors and our ability to bring you valuable content.

Questions

If you have a problem with any aspect of this book, you can contact us at <[email protected]>, and we will do our best to address the problem.

Chapter 1. Expanding Your Data Mining Toolbox

When faced with sensory information, human beings naturally want to find patterns to explain, differentiate, categorize, and predict. This process of looking for patterns all around us is a fundamental human activity, and the human brain is quite good at it. With this skill, our ancient ancestors became better at hunting, gathering, cooking, and organizing. It is no wonder that pattern recognition and pattern prediction were some of the first tasks humans set out to computerize, and this desire continues in earnest today. Depending on the goals of a given project, finding patterns in data using computers nowadays involves database systems, artificial intelligence, statistics, information retrieval, computer vision, and any number of other various subfields of computer science, information systems, mathematics, or business, just to name a few. No matter what we call this activity – knowledge discovery in databases, data mining, data science – its primary mission is always to find interesting patterns.

Despite this humble-sounding mission, data mining has existed for long enough and has built up enough variation in how it is implemented that it has now become a large and complicated field to master. We can think of a

Enjoying the preview?

Page 1 of 1

Mastering Data Mining with Python – Find patterns hidden in your data

About this ebook

Megan Squire

Related authors

Related to Mastering Data Mining with Python – Find patterns hidden in your data

Related ebooks

Text Analytics with Python: A Brief Introduction to Text Analytics with Python

Python Data Science Essentials - Second Edition

Python 3 Text Processing with NLTK 3 Cookbook

Python Programming, Deep Learning: 3 Books in 1: A Complete Guide for Beginners, Python Coding for Ai, Neural Networks, & Machine Learning, Data Science/Analysis with Practical Exercises for Learners

Practical Data Analysis Cookbook

Python Data Science Essentials

Machine Learning with R - Third Edition: Expert techniques for predictive modeling, 3rd Edition

Python Data Analysis Cookbook

Learning Data Mining with Python - Second Edition

Modular Programming with Python

Hands-On Data Analysis with Pandas: Efficiently perform data collection, wrangling, analysis, and visualization using Python

Python Business Intelligence Cookbook: Leverage the computational power of Python with more than 60 recipes that arm you with the required skills to make informed business decisions

Learning Jupyter

Bayesian Analysis with Python

Python Data Visualization Cookbook

Learning Data Mining with Python

Getting Started with Python Data Analysis

Practical Data Science with Jupyter: Explore Data Cleaning, Pre-processing, Data Wrangling, Feature Engineering and Machine Learning using Python and Jupyter (English Edition)

Advanced Machine Learning with Python

Large Scale Machine Learning with Python

Django 1.1 Testing and Debugging

Principles of Data Science

Learning Predictive Analytics with Python

A Python Guide for Web Scraping: Explore Python Tools, Web Scraping Techniques, and How to Automata Data for Industrial Applications (English Edition)

R High Performance Programming

Data Science with Jupyter: Master Data Science skills with easy-to-follow Python examples

Web Scraping with Python

Mastering Python Data Analysis

Practical Data Science Cookbook - Second Edition

Python Unlocked

Programming For You

SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL

Python: Learn Python in 24 Hours

Data Science from Scratch: The #1 Data Science Guide for Everything A Data Scientist Needs to Know: Python, Linear Algebra, Statistics, Coding, Applications, Neural Networks, and Decision Trees

Excel : The Ultimate Comprehensive Step-By-Step Guide to the Basics of Excel Programming: 1

Microsoft Azure For Dummies

Learn to Code. Get a Job. The Ultimate Guide to Learning and Getting Hired as a Developer.

Coding All-in-One For Dummies

Python Programming for Beginners: A Comprehensive Crash Course With Practical Exercises to Quickly Learn Coding and Programming for Data Analysis and Machine Learning

Python Programming For Beginners: Learn The Basics Of Python Programming (Python Crash Course, Programming for Dummies)

Excel Essentials: A Step-by-Step Guide with Pictures for Absolute Beginners to Master the Basics and Start Using Excel with Confidence

SQL All-in-One For Dummies

Learn SQL in 24 Hours

Python Programming : How to Code Python Fast In Just 24 Hours With 7 Simple Steps

The JavaScript Workshop: Learn to develop interactive web applications with clean and maintainable JavaScript code

JavaScript All-in-One For Dummies

The Advanced Roblox Coding Book: An Unofficial Guide, Updated Edition: Learn How to Script Games, Code Objects and Settings, and Create Your Own World!

Godot from Zero to Proficiency (Foundations): Godot from Zero to Proficiency, #1

Microsoft Office 365 Bible: 10:1 Mastery | Excel in Your Profession, Enhance Time Management, and Foster Exceptional Collaboration [III EDITION]

Linux: Learn in 24 Hours

PYTHON: Practical Python Programming For Beginners & Experts With Hands-on Project

PYTHON PROGRAMMING

C All-in-One Desk Reference For Dummies

Python Data Structures and Algorithms

Excel 101: A Beginner's & Intermediate's Guide for Mastering the Quintessence of Microsoft Excel (2010-2019 & 365) in no time!

Algorithms For Dummies

The Ultimate Roblox Book: An Unofficial Guide, Updated Edition: Learn How to Build Your Own Worlds, Customize Your Games, and So Much More!

Game Development with Unreal Engine 5: Learn the Basics of Game Development in Unreal Engine 5 (English Edition)

Mastering JavaScript: The Complete Guide to JavaScript Mastery

Learn Python Programming for Beginners: The Best Step-by-Step Guide for Coding with Python, Great for Kids and Adults. Includes Practical Exercises on Data Analysis, Machine Learning and More.

Excel 2021

Related categories

Reviews for Mastering Data Mining with Python – Find patterns hidden in your data

What did you think?

Book preview

Mastering Data Mining with Python – Find patterns hidden in your data - Megan Squire

Table of Contents

Mastering Data Mining with Python – Find patterns hidden in your data

Mastering Data Mining with Python – Find patterns hidden in your data

Credits

About the Author

About the Reviewers

eBooks, discount offers, and more

Why subscribe?