Explore 1.5M+ audiobooks & ebooks free for days

Only $12.99 CAD/month after trial. Cancel anytime.

Mastering Data Mining with Python – Find patterns hidden in your data
Mastering Data Mining with Python – Find patterns hidden in your data
Mastering Data Mining with Python – Find patterns hidden in your data
Ebook605 pages4 hours

Mastering Data Mining with Python – Find patterns hidden in your data

Rating: 0 out of 5 stars

()

Read preview

About this ebook

About This Book
  • Dive deeper into data mining with Python – don’t be complacent, sharpen your skills!
  • From the most common elements of data mining to cutting-edge techniques, we’ve got you covered for any data-related challenge
  • Become a more fluent and confident Python data-analyst, in full control of its extensive range of libraries
Who This Book Is For

This book is for data scientists who are already familiar with some basic data mining techniques such as SQL and machine learning, and who are comfortable with Python. If you are ready to learn some more advanced techniques in data mining in order to become a data mining expert, this is the book for you!

LanguageEnglish
PublisherPackt Publishing
Release dateAug 29, 2016
ISBN9781785885914
Mastering Data Mining with Python – Find patterns hidden in your data
Author

Megan Squire

Megan Squire is deputy director for analytics at the Southern Poverty Law Center.

Related to Mastering Data Mining with Python – Find patterns hidden in your data

Related ebooks

Programming For You

View More

Reviews for Mastering Data Mining with Python – Find patterns hidden in your data

Rating: 0 out of 5 stars
0 ratings

0 ratings0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    Mastering Data Mining with Python – Find patterns hidden in your data - Megan Squire

    (missing alt)

    Table of Contents

    Mastering Data Mining with Python – Find patterns hidden in your data

    Credits

    About the Author

    About the Reviewers

    www.PacktPub.com

    eBooks, discount offers, and more

    Why subscribe?

    Preface

    What this book covers

    What you need for this book

    Who this book is for

    Conventions

    Reader feedback

    Customer support

    Downloading the example code

    Errata

    Piracy

    Questions

    1. Expanding Your Data Mining Toolbox

    What is data mining?

    How do we do data mining?

    The Fayyad et al. KDD process

    The Han et al. KDD process

    The CRISP-DM process

    The Six Steps process

    Which data mining methodology is the best?

    What are the techniques used in data mining?

    What techniques are we going to use in this book?

    How do we set up our data mining work environment?

    Summary

    2. Association Rule Mining

    What are frequent itemsets?

    The diapers and beer urban legend

    Frequent itemset mining basics

    Towards association rules

    Support

    Confidence

    Association rules

    An example with data

    Added value – fixing a flaw in the plan

    Methods for finding frequent itemsets

    A project – discovering association rules in software project tags

    Summary

    3. Entity Matching

    What is entity matching?

    Merging data

    Merging datasets vertically

    Merging datasets horizontally

    Techniques for matching

    Attribute-based similarity matching

    Be careful of pairwise comparisons

    Leverage rare values

    Methods for matching attributes

    Range-based or distance from target

    String edit distance

    Hamming distance

    Levenshtein distance

    Soundex

    Leveraging disjoint sets

    Context-based similarity matching

    Machine learning-based entity matching

    Evaluation of entity matching techniques

    Efficiency – how long does it take to do the matching?

    Effectiveness – how accurate are the matches that we generate?

    Usefulness – how practical is the matching procedure to use?

    Entity matching project

    Difficulties with matching software projects

    Two examples

    Matching on project names

    Matching on people names

    Matching on URLs

    Matching on topics and description keywords

    The dataset

    The code

    The results

    How many entity matches did we find?

    How good are the pairs we found?

    Summary

    4. Network Analysis

    What is a network?

    Measuring a network

    Degree of a network

    Diameter of a network

    Walks, paths, and trails in a network

    Components of a network

    Centrality of a network

    Closeness centrality

    Degree centrality

    Betweenness centrality

    Other measures of centrality

    Representing graph data

    Adjacency matrix

    Edge lists and adjacency lists

    Differences between graph data structures

    Importing data into a graph structure

    Adjacency list format

    Edge list format

    GEXF and GraphML

    GDF

    Python pickle

    JSON

    JSON node and link series

    JSON trees

    Pajek format

    A real project

    Exploring the data

    Generating the network files

    Understanding our data as a network

    Generating simple network metrics

    Playing with the parameters of a network

    Analyzing subgraphs

    Analyzing cliques and centrality in the subgraphs

    Looking for change over time

    Summary

    5. Sentiment Analysis in Text

    What is sentiment analysis?

    The basics of sentiment analysis

    The structure of an opinion

    Document-level and sentence-level analysis

    Important features of opinions

    Sentiment analysis algorithms

    General-purpose data collections

    Hu and Liu's sentiment analysis lexicon

    SentiWordNet

    Vader sentiment

    Sentiment mining application

    Motivating the project

    Data preparation

    Data analysis of chat messages

    Data analysis of e-mail messages

    Summary

    6. Named Entity Recognition in Text

    Why look for named entities?

    Techniques for named entity recognition

    Tagging parts of speech

    Classes of named entities

    Building and evaluating NER systems

    NER and partial matches

    Handling partial matches

    Named entity recognition project

    A simple NER tool

    Apache Board meeting minutes

    Django IRC chat

    GnuIRC summaries

    LKML e-mails

    Summary

    7. Automatic Text Summarization

    What is automatic text summarization?

    Tools for text summarization

    Naive text summarization using NLTK

    Text summarization using Gensim

    Text summarization using Sumy

    Sumy's Luhn summarizer

    Sumy's TextRank summarizer

    Sumy's LSA summarizer

    Sumy's Edmundson summarizer

    Summary

    8. Topic Modeling in Text

    What is topic modeling?

    Latent Dirichlet Allocation

    Gensim for topic modeling

    Understanding Gensim LDA topics

    Understanding Gensim LDA passes

    Applying a Gensim LDA model to new documents

    Serializing Gensim LDA objects

    Serializing a dictionary

    Serializing a corpus

    Serializing a model

    Gensim LDA for a larger project

    Summary

    9. Mining for Data Anomalies

    What are data anomalies?

    Missing data

    Locating missing data

    Zero values

    Fixing missing data

    Ignore the problem rows

    Fix the problem manually

    Use a fabricated value

    Use a central measure

    Use Last Observation Carried Forward

    Use a similar value

    Use the most likely value

    Data errors

    Truncated fields

    Data type and character set errors

    Logic or semantic errors

    Outliers

    Visual mining for outliers

    Statistical detection of outliers

    Detecting outliers with modified z-scores

    Detecting outliers by combining statistics and visual mining

    Detecting outliers with machine learning

    Summary

    Index

    Mastering Data Mining with Python – Find patterns hidden in your data


    Mastering Data Mining with Python – Find patterns hidden in your data

    Copyright © 2016 Packt Publishing

    All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

    Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author(s), nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

    Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

    First published: August 2016

    Production reference: 1240816

    Published by Packt Publishing Ltd.

    Livery Place

    35 Livery Street

    Birmingham B3 2PB, UK.

    ISBN 978-1-78588-995-0

    www.packtpub.com

    Credits

    Author

    Megan Squire

    Reviewers

    Sanjeev Jaiswal

    Ron Mitsugo Zacharski

    Commissioning Editor

    Veena Pagare

    Acquisition Editor

    Lester Frias

    Content Development Editor

    Mamata Walkar

    Technical Editor

    Naveenkumar Jain

    Copy Editors

    Safis Editing

    Sneha Singh

    Project Coordinator

    Shweta H Birwatkar

    Proofreader

    Safis Editing

    Indexer

    Pratik Shirodkar

    Graphics

    Kirk D'Penha

    Production Coordinator

    Shantanu N. Zagade

    Cover Work

    Shantanu N. Zagade

    About the Author

    Megan Squire is a professor of computing sciences at Elon University.

    Her primary research interest is in collecting, cleaning, and analyzing data about how free and open source software is made. She is one of the leaders of the FLOSSmole.org, FLOSSdata.org, and FLOSSpapers.org projects.

    About the Reviewers

    Sanjeev Jaiswal is a computer graduate with 7 years of industrial experience. His works involves Perl, Python, and GNU/Linux. He is currently working on projects involving penetration testing, source code review, and security design and implementations.

    He is very much interested in web and cloud security. He is also learning NodeJS and cloud security.

    Sanjeev loves teaching engineering students and IT professionals. He has been teaching for the last 8 years in his free time. He founded Alien Coders (http://www.aliencoders.org), based on the learning through sharing principle for computer science students and IT professionals in 2010, which became a huge hit in India among engineering students.

    You can follow him on Facebook at http://www.facebook.com/aliencoders, on Twitter at @aliencoders, and on GitHub at https://github.com/jassics.

    Sanjeev wrote Instant PageSpeed Optimization and co-authored Learning Django Web Development for Packt Publishing. He has reviewed more than 5 books for Packt and looks forward to more such opportunities.

    Ron Mitsugo Zacharski is a computational linguist working in the areas of information extraction and machine learning (zacharski.org). He has a BFA in music from the University of Wisconsin at Milwaukee and a PhD in computer science from the University of Minnesota, and he completed a post doctorate in linguistics at the University of Edinburgh. He authored the free online book A Programmer's Guide to Data Mining: The Ancient Art of the Numerati (www.guidetodatamining.com) and co-edited The Grammar-Pragmatics Interface: Essays in Honor of Jeanette K. Gundel, published by John Benjamins. For the majority of his academic life, he has focused on multilingual natural language processing, particularly with lesser-studied languages. Dr. Zacharski is a Zen monk in the Sōtō School lineage of Soyu Matsuoka. He lives in New Mexico.

    www.PacktPub.com

    eBooks, discount offers, and more

    Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at for more details.

    At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.

    eBooks, discount offers, and more

    https://www2.packtpub.com/books/subscription/packtlib

    Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library. Here, you can search, access, and read Packt's entire library of books.

    Why subscribe?

    Fully searchable across every book published by Packt

    Copy and paste, print, and bookmark content

    On demand and accessible via a web browser

    Preface

    Over the past decade, cheaper data storage, faster hardware, and impressive advances in algorithms have combined to pave the way for a rapid ascendance of data science as one of the most important opportunities in computing. While the term data science can include everything from cleaning data and storing data to visualizing it in graphs and charts, the area that has made the most significant gain is the invention of intelligent and sophisticated algorithms for analyzing data. Using computers to find the interesting patterns buried within massive amounts of data is called data mining, an area that encompasses elements of database systems, statistics, and machine learning.

    Right now there are dozens of great data mining and machine learning books available for software developers to get up to date on all these advances in the field. What most of these books have in common is that they all cover a small set of tried-and-true methods for finding patterns in data: classification, clustering, decision trees, and regression. Of course, all of these are critically important methods for any data miner to know and they are popular because they can be effective. But these same few techniques are not the whole story. Data mining is a rich field encompassing many dozens of techniques to uncover patterns and make predictions. A true master of data mining should have many tools in her toolbox, not just a few. Thus, the mission of this book, Mastering Data Mining with Python, is to introduce some of the lesser-known data mining concepts that are typically only covered in academic textbooks.

    This book uses the Python programming language and a project-based approach to introduce diverse and often overlooked data mining concepts, such as association rules, entity matching, network analysis, text mining, and anomaly detection. Each chapter thoroughly illustrates the basics of one particular data mining technique, provides alternatives for evaluating its effectiveness, and then implements the technique using real-world data.

    Our focus on real-world data is another feature of this book that sets it apart from many other data mining books. The true test of whether we have mastered a concept is whether we can apply a method to a new, unknown problem. In our case, this means applying each data mining method to a new problem area or a new data set. The emphasis on real data also means that our results may not always be as clean and tidy as results that come from a canned, example data set. For this reason, each chapter includes a discussion for how to critically evaluate the method. Do the results make sense? What do the results mean? How can the results be improved?

    So, in many ways, this book picks up where some of the other data mining books leave off. If you want to round up your growing data mining toolbox with a set of interesting but often overlooked techniques, then read on to learn the specific topics we will cover and how they will be applied in each chapter.

    What this book covers

    Chapter 1, Expanding Your Data Mining Toolbox, gives an introduction to the field of data mining. In this chapter we pay special attention to how data mining relates to similar topics, such as machine learning and data science. We also review many different data mining methodologies, and talk about their various strengths and weaknesses. This foundational knowledge is important as we transition into the remaining chapters of the book, which are much more technique-oriented and focus on the application of specific data mining tools.

    Chapter 2, Association Rule Mining, introduces our first data mining tool: mining for co-occurring sets of items, sometimes called frequent itemsets. We extend our understanding of frequent itemset mining to include mining for association rules, and we learn how to evaluate whether the rules we have found are helpful or not. To put our knowledge into practice, at the end of the chapter we implement a small project wherein we find association rules in the keywords chosen to describe a large set of software projects.

    Chapter 3, Entity Matching, focuses on finding matching pairs of data elements that may look slightly different but are actually the same. We learn how to determine whether two items are actually the same thing by using the attributes of the data. At the end of the chapter, we implement an entity matching project where we learn to find the software projects that have moved from one hosting service to another, even after changing their names and other important attributes.

    Chapter 4, Network Analysis, is a tour through the basics of network or graph analysis, as used to describe the relationships between various interconnected groups of entities. We investigate the various types of network and learn how to describe and measure them. Then we put our learning into practice to describe how a network of software developers has changed over time.

    Chapter 5, Sentiment Analysis in Text, is the first of four text mining chapters in this book. This chapter serves as an introduction to the growing field of sentiment, or mood, analysis in text. After comparing various approaches to sentiment mining and learning how to evaluate the results, we practice using a machine learning classifier to determine the sentiment of a set of software developer chat logs and e-mail logs.

    Chapter 6, Named Entity Recognition in Text, is about finding proper nouns and proper names in text. We spend some time learning why this task is useful, and why finding named entities can sometimes be more difficult than it sounds. At the end of the chapter we implement a named entity recognition system on several different types of real-world text data including e-mail, chat logs, and board meeting minutes. Along the way we apply different techniques for quantifying the success or failure of our results.

    Chapter 7, Automatic Text Summarization, presents several strategies for automatically create condensed summaries of text. This chapter emphasizes extractive summarization tools, which are designed to find the most important sentences in a text sample. To this end, we experiment with three different tools for accomplishing this goal, testing the summarization methods, and learning how they differ. Following the introduction of each tool, we attempt to summarize a common set of text documents and compare the results.

    Chapter 8, Topic Modeling in Text, shows how to use software tools to reveal what topics or concepts are present in a given text. Can we train a computer program to infer the themes that are present in large amounts of text? In a series of experiments, we learn how to use common topic modeling libraries to reveal the topics present in software developer e-mails, and how those topics change over time.

    Chapter 9, Mining for Data Anomalies, is where we learn how to use data mining and statistical techniques to improve our own data mining process. While all of the other chapters in this book deal with finding different types of patterns in data, here we focus on finding data that is anomalous or that does not match a particular pattern. Whether it is because the data is empty, missing, or just plain weird, this chapter presents strategies for finding or fixing this type of data so that the rest of your data can be mined more effectively.

    What you need for this book

    To complete the projects in this book, you will need a version of Python 3.5 or higher. I recommend using Anaconda Python, but any Python distribution will do as long as it is updated and contains the following packages: Numpy, Matplotlib, NetworkX, PyMySQL, Gensim, and NLTK. In Chapter 1, Expanding Your Data Mining Toolbox, we will walk through an easy installation of Python and all these libraries, and each time a library is used later in the book, we will install it or upgrade it together.

    Because data mining is obviously data-centric, and because the data sets we are working with are sometimes large or require some type of persistent data storage, I chose to implement some of the data mining algorithms alongside a relational database system. I chose MySQL for accomplishing this since it is an established, easy-to-download and install piece of infrastructure. The chapters where MySQL comes into play are in working with the memory-intensive algorithms in Chapter 2, Association Rule Mining, and Chapter 3, Entity Matching. I also use MySQL for some of the examples in Chapter 9, Mining for Data Anomalies, but it is possible to go through that chapter without MySQL.

    Who this book is for

    If you picked up a book on mastering data mining, you are probably familiar with the basics of data analysis and you have likely experimented with machine learning techniques such as regression, decision trees, classification, and cluster analysis. If you have intermediate experience with Python, understand basic relational database terminology, have some exposure to basic statistics, and can understand the rudiments of how supervised and unsupervised machine learning techniques work, then you are ready for this book. Let's build on what you already know to learn some more exotic, unusual strategies for mining your data!

    Conventions

    In this book, you will find a number of text styles that distinguish between different kinds of information. Here are some examples of these styles and an explanation of their meaning.

    Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: We can include other contexts through the use of the include directive.

    A block of code is set as follows:

    MINSUPPORTPCT = 5

    allSingletonTags = []

    allDoubletonTags = set()

    doubletonSet = set()

    Any command-line input or output is written as follows:

    conda install pymysql

    New terms and important words are shown in bold. Words that you see on the screen, for example, in menus or dialog boxes, appear in the text like this: Clicking the Next button moves you to the next screen.

    Note

    Warnings or important notes appear in a box like this.

    Tip

    Tips and tricks appear like this.

    Reader feedback

    Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of.

    To send us general feedback, simply e-mail <[email protected]>, and mention the book's title in the subject of your message.

    If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.

    Customer support

    Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

    Downloading the example code

    You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

    You can download the code files by following these steps:

    Log in or register to our website using your e-mail address and password.

    Hover the mouse pointer on the SUPPORT tab at the top.

    Click on Code Downloads & Errata.

    Enter the name of the book in the Search box.

    Select the book for which you're looking to download the code files.

    Choose from the drop-down menu where you purchased this book from.

    Click on Code Download.

    You can also download the code files by clicking on the Code Files button on the book's webpage at the Packt Publishing website. This page can be accessed by entering the book's name in the Search box. Please note that you need to be logged in to your Packt account.

    Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

    WinRAR / 7-Zip for Windows

    Zipeg / iZip / UnRarX for Mac

    7-Zip / PeaZip for Linux

    The code bundle for the book is also hosted on GitHub at https://github.com/megansquire/masteringDM. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

    Errata

    Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.

    To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.

    Piracy

    Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

    Please contact us at <[email protected]> with a link to the suspected pirated material.

    We appreciate your help in protecting our authors and our ability to bring you valuable content.

    Questions

    If you have a problem with any aspect of this book, you can contact us at <[email protected]>, and we will do our best to address the problem.

    Chapter 1. Expanding Your Data Mining Toolbox

    When faced with sensory information, human beings naturally want to find patterns to explain, differentiate, categorize, and predict. This process of looking for patterns all around us is a fundamental human activity, and the human brain is quite good at it. With this skill, our ancient ancestors became better at hunting, gathering, cooking, and organizing. It is no wonder that pattern recognition and pattern prediction were some of the first tasks humans set out to computerize, and this desire continues in earnest today. Depending on the goals of a given project, finding patterns in data using computers nowadays involves database systems, artificial intelligence, statistics, information retrieval, computer vision, and any number of other various subfields of computer science, information systems, mathematics, or business, just to name a few. No matter what we call this activity – knowledge discovery in databases, data mining, data science – its primary mission is always to find interesting patterns.

    Despite this humble-sounding mission, data mining has existed for long enough and has built up enough variation in how it is implemented that it has now become a large and complicated field to master. We can think of a

    Enjoying the preview?
    Page 1 of 1