0% found this document useful (0 votes)
164 views64 pages

Slide Isp610

ISP610 slide - business data analystics
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
164 views64 pages

Slide Isp610

ISP610 slide - business data analystics
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 64

At the end of this lesson you should know..

• The types of data.


Introduction to Data • The value of data in the present and future.
• The importance of analytics in relation to the value of data.
Analytics • Four types of analytics.
ISP610 BUSINESS DATA ANALYTICS
• Analytics methods.
Prepared by: Ruhaila Maskat (PhD)
• The people involved in business data analytics.
References: • Your role in business data analytics.
EMC Data Scientist Associate • Necessary skill set to be in business data analytics.
Bain Research
Ingram Micro Advisor • Questions to ask customers.

October 18 1 October 18 2

Types of Data Types of Data


• 80-90% of the future data growth coming from non structured data types
(semi, quasi and unstructured).
• In reality, these four types of data sources can be mixed together.
• Example: you have a typical RDBMS which stores call logs for a support call
centre. Here, you may have structured data such as date/time stamps,
machine types, problem type, operating system, which were probably
entered by the support desk person from a pull-down menu GUI. Then you
will have unstructured or semi-structured data, such as free form call log
information, taken from an email ticket of the problem or an actual phone
call description of a technical problem and a solution. You could also have
voice logs or audio transcripts of the actual call that might be associated
with the structured data.

October 18 3 October 18 4
October 18 5 October 18 6

October 18 7 October 18 8
October 18 9 October 18 10

October 18 11 October 18 12
Data Sources The Value of Data
• These data come from multiple sources, including: • Everyone and everything is leaving a digital footprint. The graph
• Medical Information, such as genomic sequencing and MRIs shows the different forms of data being generated by new
• Increased use of broadband on the Web – including the 2 billion photos each applications and the scale and growth rate of the data. By analysing
month that Facebook users currently upload as well as the innumerable
videos uploaded to YouTube and other multimedia sites. these immense data, organisations can reap value.
• Video surveillance. • Industry case studies:
• Increased global use of mobile devices – the torrent of texting is not likely to • Health care – Reducing cost of care
cease.
• Public services – Preventing pandemics
• Smart devices – sensor-based collection of information from smart electric
grids, smart buildings and many other public and industry infrastructure. • Life sciences – Genomic mapping
• Non-traditional IT devices – including the use of RFID readers, GPS navigation
systems, and seismic processing.

October 18 13 October 18 14

Health care Public Services

Data
Data

VALUE!
VALUE!
October 18 15 October 18 16
Life Sciences Competitive Advantage
• To a profit-making organisation, value of data comes in the form of an
advantage over their competitors.
• According to Bain Research, top performing organisations tend to
make decisions based on what their data tells them. By having a good
Data basis to work on, these organisations tend to make decisions faster.

VALUE!
October 18 17 October 18 18

Competitive Advantage: Airlines


• Call centres, for instance, can be made more effective and efficient by
capitalizing on what the company can know about the caller ahead of time.
And airlines have for years been able to route premium-status fliers to
higher-level customer service representatives by recognizing their caller
IDs. Now they can do even more: By making a quick correlation between
your ID, your booked flights and the status of those flights, they may be
able to determine why you’re calling, even before the second ring. If your
next flight has just been delayed, the representative could answer the
phone with a pretty good idea of why you’re calling. More in-depth analysis
could correlate your ID with your social media presence. If you’ve just
tweeted an irate message about being booted from a flight, the rep
answering your call may have already read it.

October 18 19 October 18 20
What’s Driving Analytics in Organisations? Analytics
• More than just an OLTP MIS reporting.
• Rather than doing standard reporting on these areas, organizations can
apply advanced analytical techniques to optimize processes and derive
more value from these typical tasks.
• Analytics examine large amounts of data to uncover hidden patterns,
correlations and other insights.
• Analytics help organisations to make more accurate decisions when faced
with problems.
• Analytics helps organizations harness their data and use it to identify new
opportunities. That, in turn, leads to smarter business moves, more
efficient operations, higher profits and happier customers

October 18 21 October 18 22

WHO ARE THE PEOPLE INVOLVED IN


BUSINESS DATA ANALYTICS
AND
WHAT IS YOUR ROLE?

October 18 23 October 18 24
YOU

October 18 25 October 18 26

SKILL SET

YOU

October 18 27 October 18 28
• Quantitative skills, such as mathematics or statistics.
Important Questions to Ask Your Customer
• Technical aptitude, such as software engineering, machine learning
and programming skills.
• What is the business problem you’re trying to solve?
• Sceptical…. This may be a counterintuitive trait, although it is
important that data scientists can examine their work critically rather • What is your desired outcome?
than in a one-sided way. • Will the focus and scope of the problem change if the following
• Curious & Creative. Must be passionate about data and finding dimensions change:
creative ways to solve problems and portray information. • Time
• People
• Communicative & Collaborative: It is not enough to have strong
• Risk
quantitative skills or engineering skills. To make a project resonate,
you must be to articulate the business value in a clear way, and work • Resources
collaboratively with project sponsors and key stakeholders. • Size and attributes of data

October 18 29 October 18 30

Important Questions to Ask Your Customer Group Discussion


• What data sources do you have?
Think of a business process
that interests you and answer
• What industry issues may impact the analysis?
What value the given questions.
• What timelines are you up against ?
can be Who are the
• Who could provide insight into the project?
gained? people
• Who has the final say on the project?
involved?
What questions
would you ask?

October 18 31 October 18 32
By the end of this lesson, you should know:
• What NoSQL databases are.
• How are they different from SQL databases.
Unstructured Data • Types of NoSQL databases.
ISP610 BUSINESS DATA ANALYTICS
Prepared by: Ruhaila Maskat (PhD)

References:
Wikipedia
SearchDataManagement
3pillarglobal

October 18 1 October 18 2

NoSQL databases Applications of NoSQL databases


• Non SQL or Non relational or Not only SQL. • The NoSQL distributed database infrastructure has been the solution
• Stores and retrieves data that is not modelled in rows and columns. to handling some of the biggest data warehouses on the planet – i.e.
the likes of Google, Amazon, and the CIA.
• "Not only SQL“ - may support SQL-like query languages.
• Airbus
http://medianetwork.oracle.com/video/player/4662924811001

October 18 3 October 18 4
NoSQL vs SQL
1. Non-relational model. 1. Relational model.
2. Stores data in JSON, key/value, 2. Stores data in a table.
graphs, columns.

SQL

NoSQL
3. Adding a new property may
3. New properties can be added on require altering schemas.
the fly. 4. Good for structured data.
4. Good for semi-structured, 5. Relationships are captured in
complex or nested data. normalised model using joins to
5. Relationships are captured by resolve references across tables.
denormalizing data and 6. Strict schema.
presenting all data for an object
in a single record.
6. Dynamic/flexible schema.

October 18 5 October 18 6

Case study: Building a social media website Relational model


• Users can post articles with related media like, pictures, videos, or
even music.
• Users can comment on posts and give points for ratings.
• Users can see a feed of posts.
• Users can interact with the main website.

October 18 7 October 18 8
NoSQL model
In general: SQL NoSQL
• One query.
• No JOINS.
• No schema is maintained.

October 18 9 October 18 10

Types of NoSQL databases


• Key-value
• Column / BigTable
• Document
• Graph

October 18 11 October 18 12
Key-value database Example data
• Most basic and a backbone implementation of NoSQL.
• Underlying is a hash table which consists of a unique key that points
to a specific item of data.
• Work by matching keys with values like a dictionary.
• Give a key (e.g. the_answer_to_life) and receives a matching value
(e.g.24).
• Database is a global collection of key-value pairs.
• As the volume of data increases, maintaining unique values as keys
may become more difficult.
• Riak, Amazon S3 (Dynamo), Oracle NoSQL.
October 18 13 October 18 14

Storage Basic reading and writing


• Get(key), returns the value associated with the provided key.
• Put(key, value), associates the value with the key.
• Any reads and • Multi-get(key1, key2, .., keyN), returns the list of values associated
writes of values with the list of keys.
uses the key. • Delete(key), removes the entry for the key from the data store.
• Key can be
synthetic or Value can be String,
auto-generated. JSON, BLOB etc

October 18 15 October 18 16
Column/BigTable
• Advance the simple nature of key / value based.
• Do not require a pre-structured table to work with the data.
• Work by creating collections of one or more key / value pairs. KEY
• Two dimensional arrays whereby each key has one or more key /
value pairs attached to it.
Column-store,
• Two groups: column-store and column-family store.
• Column-family store: Bigtable, HBase, Hypertable, and Cassandra.
position-based
• Column-store: Sybase IQ, C-store, Vertica, VectorWise, MonetDB,
ParAccel and Infobright.
October 18 17 October 18 18

Column-family
Column-store,
rowid-based

KEY
VALUE
October 18 19 October 18 20
• The outermost keys 3PillarNoida,
3PillarCluj, 3PillarTimisoara and
3PillarFairfax are analogues to
rows.

• ‘address’ and ‘details’ are


called column families.

• The column-family ‘address’


has columns ‘city’ and ‘pincode’.

• The column-family details’


has columns ‘strength’ and
‘projects’.

October 18 21 October 18 22

Document database Document database


• A collection of key value pairs but the values stored (referred to as • Document is the most basic unit of data.
“documents”) provide some structure and encoding of the managed • Documents are ordered sets of key-value pairs.
data i.e. XML, JSON, BSON. A unique key is a simple identifier (string,
URI, path). • Each document contains one or more name-value pairs.
• Embeds attribute metadata associated with content, this provides a
way to query data based on contents. API is used to retrieve data Example:
based on content. Also allows editing of content and metadata. { KEY
• While key-value stores require the key to access data value, _id : 978 NAME-VALUES
document store has metadata which allows data access directly to the “Title” : “The Linux Command Line”, Document 1
attribute instead of through a key. “Author” : “William Shotts”
• CouchDB, Apache Cassandra, MongoDB. }
October 18 23 October 18 24
Documents are gathered together in collections within the database. Since we are so used to relational db…
Relational database NoSQL document database

Collections should make sense e.g. books, webstore, retail store, fruits.
Hence, document database is unstructured and schemaless.

October 18 25 October 18 26

Since we are so used to relational db… Document database


• We can store different schemas in different documents and these documents reside in the same
collection.
Relational Databases Document Databases Example:
{
Databases Databases or Buckets _id : 1
Tables Collections or Type Signifiers “ISBN” : “978”,
Document 1
“Title” : “The Linux Command Line”
Rows Documents }
Columns Attributes/Names Collection
{
Index Index _id : 2
“ASIN” : “B00J”, Document 2
“Item” : “Cherry Barbeque Sauce”
}

October 18 27 October 18 28
Document database Use cases
• We can have more complicated structure. • Event logging
Example: • Blogs and Website content management
{ • Web-analytics or real time analytics
_id : “978”, • E-commerce applications e.g. shopping cart.
“Title” : “Data Science”,
“Author” : [“William Jackson”,
List of values
“Ben Ten”]
}

October 18 29 October 18 30

Graph database Use cases


• Use graph structures with edges, nodes and properties. • People who likes this product, usually like that product.
• Nodes are organised based on their relationships with one another. • Mary is friends with George. George likes pizza. George has visited
• These relationships are represented by edges between the nodes. Japan. Thus, we can ask the question of who are the friends of Mary’s
friends who likes the food that Mary’s friend likes but have not visited
• Relationship defines social connectivities. the place that Mary’s friend has visited.
• Both nodes and relationships have defined properties. • You are more likely to be friends with Abu because you know Ali since
• Neo4j. Abu is Ali’s friend.

October 18 31 October 18 32
Graph database

Unstructured Data
ISP610 BUSINESS DATA ANALYTICS
Prepared by: Ruhaila Maskat (PhD)

References:
Wikipedia
SearchDataManagement
3pillarglobal

October 18 33 October 18 1

NoSQL data modelling vs.


By the end of this lesson, you should know:
Relational modelling
• How to model a document NoSQL database. • NoSQL data modeling often starts from the application-specific
queries as opposed to relational modelling:
• Relational modeling is typically driven by the structure of available data. The
main design theme is “What answers do I have?”
• NoSQL data modeling is typically driven by application-specific access
patterns, i.e. the types of queries to be supported. The main design theme
is “What questions do I have?”

October 18 2 October 18 3
October 18 4 October 18 5

Modelling techniques
• Referencing documents.
• Embedding documents.
• Denormalisation.
• Heterogeneous collection.

October 18 6 October 18 7
{
Referencing documents Example _id : 1,
name : Ryan,
• You can reference another document using the document key. This is thumbnailUrl : ….,
similar to normalisation in relational db. { shortProfile : ….
• Referencing enables document databases to cache, store and retrieve sessionId : session1 }
the documents independently. refer
sessionName : Document modelling, ence
• Provides better write speed/performance. speakers : [{ id: 1 },{ id: 2 }] {
• Reading may require more round trips to the server. } _id : 2,
name : David,
thumbnailUrl : ….,
shortProfile : ….
}
October 18 8 October 18 9

Referencing documents can be beneficial… 1-to-many relationships (unbounded)


• 1-to-many relationships (unbounded).
• Many-to-many relationships.
• Related data changes with differing volatility (speed of change or
update).

October 18 10 October 18 11
1-to-many relationships (unbounded) Many-to-many relationships

Not efficient, requires two references.


First to speaker documents,
Second to session documents.

October 18 12 October 18 13

Many-to-many relationships Related data changes with differing volatility

Lower volatility

Greater volatility

Reference by session Reference by speaker


More efficient, requires only one reference.
October 18 14 October 18 15
Related data changes with differing volatility Embedding documents
• You can embed a document in another document by simply defining
an attribute to be an embedded document.
• Embedding enables document databases to cache, store and retrieve
the complex document with embedded documents as a single piece.
• Eliminates the need to retrieve two separate documents and join
them.
• Provides better read speed/performance.

October 18 16 October 18 17

{
Example sessionId : session1 Embedding can be advantageous when….
sessionName : Document modelling,
speakers : [ • Two data items are often queried together.
{ _id : 1,
name : Ryan,
• One data item is dependent on another.
thumbnailUrl : …., • 1:1 relationship.
shortProfile : …. • Similar volatility (speed of change or update).
},
{ _id : 2,
name : David,
thumbnailUrl : ….,
shortProfile : ….
}]
October 18
} 18 October 18 19
Two data items are often queried together One data item is dependent on another

Dependent
on Order

October 18 20 October 18 21

1:1 relationship Similar volatility

Both email and


socialIds do not
change very often

October 18 22 October 18 23
Normalised Denormalised
Query

Embeds speaker into


Two
session with summary
reads information.
are If further information
needed about a speaker is
needed, only then it will
be loaded.

October 18 24 October 18 25

Normalisation vs. Denormalisation Homogeneous collections


• Normalised: • One collection per data type.
• Requires multiple reads. • Speaker
• Doesn’t align with instances. • Session
• Provides faster write speed. • Room

• Denormalised: • But, this would require three different queries over three different
• Requires updates in multiple places. collections.
• Provides faster read speed.

October 18 26 October 18 27
Heterogeneous collections
• Multiple types in a single collection.

Unstructured Data
ISP610 BUSINESS DATA ANALYTICS
Prepared by: Ruhaila Maskat (PhD)

References:
Datasciencedojo
Wiki
October 18 28 April 19 1

By the end of this lesson, you should know: Web scraping


• What is web scraping. • Also known as web harvesting, web data extraction used for
• Processes of web scraping. extracting data from websites.
• Techniques to web scrape. • Web scraping software may access the World Wide Web directly
using the Hypertext Transfer Protocol, or through a web browser.
• Ways to scrape.
• While web scraping can be done manually by a software user, the
term typically refers to automated processes implemented using
a bot or web crawler.
• It is a form of copying, in which specific data is gathered and copied
from the web, typically into a central local database or spreadsheet,
for later retrieval or analysis.
April 19 2 April 19 3
Processes Web scraping techniques
1. Systematically find and download web pages. • Human copy-and-paste.
2. Take the web pages and extract information from them. • Text pattern matching.
• HTTP programming.
• HTML parsing.
• DOM parsing.
• Vertical aggregation.
• Semantic annotation recognising.
• Computer vision web-page.

April 19 4 April 19 5

Popular ways to scrape Components of Web Scraping


• Web Scraper Chrome extension • Manual web scraping requires knowledge of the chosen web page’s
• Import.io structure.
• BeautifulSoup, Python • Writing up a code to extract data values based on that structure.
• APIs

April 19 6 April 19 7
How to view a web page’s structure
• Assuming we are to scrape this web page.
http://stackoverflow.com/questions/19957194/inst
all-beautiful-soup-using-pip
• Open this page using Google Chrome, right mouse click and choose
Inspect.
• The structure of most web pages are in the HTML form. Here are
good tutorials on HTML.
• https://websitesetup.org/html-tutorial-beginners/
• https://www.w3schools.com/tags/tag_div.asp

April 19 8 April 19 9

Web scraping with Python BeautifulSoup4 Web scraping with Python BeautifulSoup4
1. Install Anaconda3 in C:\Anaconda3 folder. 4. To know if BeautifulSoup is successfully installed, type,
C:\Anaconda3\Scripts>python
2. Open Windows command line.
You will get,
3. To install BeautifulSoup, go to C:\Anaconda3\Scripts. Type, Python 3.6.0 |Anaconda 4.3.1 (64-bit)| (default, Dec 23
C:\Anaconda3\Scripts>pip.exe install beautifulsoup4 2016, 11:57:41) [MSC v.1900 64 bit (AMD64)] on win32
You will get, Type "help", "copyright", "credits" or "license" for
more information.
Requirement already satisfied: beautifulsoup4 in Then type,
c:\anaconda3\lib\site-packages
>>> import bs4
5. To grab a html page, type,
>>> from urllib.request import urlopen as uReq
April 19 10 April 19 11
Web scraping with Python BeautifulSoup4 Web scraping with Python BeautifulSoup4
6. To parse html tags, call BeautifulSoup by typing, 9. To open a connection to the web page and downloading into our machine,
type,
>>> from bs4 import BeautifulSoup as soup
>>> uClient = uReq(my_url)
7. To define the html page’s url, type
10. To read the scraped contents, type,
>>> my_url =
'http://stackoverflow.com/questions/19957194/install- >>> page_html = uClient.read()
beautiful-soup-using-pip' Warning, do not view the contents at this point in time, because if the web
8. To check contents of my_url variable, type page is huge, the command prompt will crash.
>>> my_url 11. To close the connection, type,
You will get, >>> uClient.close()
'http://stackoverflow.com/questions/19957194/install- 12. To parse the contents, type,
beautiful-soup-using-pip' >>> page_soup = soup(page_html,"html.parser")

April 19 12 April 19 13

Web scraping with Python BeautifulSoup4 Web scraping with Python BeautifulSoup4
13. To view the header of the contents, type, 15. To know what elements and tags that the webpage has, right
>>> page_soup.h1 mouse click on Chrome and choose Inspect.
You will get,
16. To check tags in the content’s <body>, type,
<h1 itemprop="name"><a class="question-hyperlink"
href="/questions/19957194/install-beautiful-soup-using- >>> page_soup.body
pip">install beautiful soup using pip</a></h1>
14. To view any paragraph in the contents, type, Again, not advisable unless the body is short.
>>> page_soup.p 17. To check what is in the <span>, type,
You will get, >>> page_soup.body.span
<p>I am trying to install BeautifulSoup using
<code>pip</code> in Python 2.7. I keep getting an error You will get,
message, and can't understand why.</p> <span class="-img">Stack Overflow</span>

April 19 14 April 19 15
Web scraping with Python BeautifulSoup4 Web scraping with Python BeautifulSoup4
18. To focus on a particular part of the webpage, simply highlight the 22. To read what is in the variable, type,
text and right mouse click to choose Inspect. >>> containers[0]
19. Identify html class to be passed. 23. To grab title from the following,
20. To parse a particular class into a variable, type, <img alt=”EVGA” class=”lazy-img” data-effect=”blab la”
>>>containers = page_soup.findAll(“div”,{“class”:”item- title=”EVGA”
container”}) Type,
21. To check the length of a variable, type, >>> container.div.div.a.img[“title”]
>>> len(containers)

April 19 16 April 19 17

Looping Placing scraping result into a file

April 19 18 April 19 19
Closing a file

Analytics Methods
ITS480 BUSINESS DATA ANALYTICS
Prepared by: Ruhaila Maskat (PhD)

References:
IngramMicroAdvisor
KDnuggets
April 19 20 October 18 1

By the end of this lesson, you should know: RECAP: Purpose of data analytics
• Categories of analytics methods. • Support decision-making.
• Methodology for data analytics. • Provide an advantage over competitors.
• Popular analytics methods. • Gives insight into the future.
• Choosing analytical methods.

October 18 2 October 18 3
RECAP: Health care Four Types of Analytics
• Prescriptive – This type of analysis reveals what actions should be
taken. This is the most valuable kind of analysis and usually results in
rules and recommendations for next steps.
• Predictive – An analysis of likely scenarios of what might happen. The
deliverables are usually a predictive forecast.
Data • Diagnostic – A look at past performance to determine what
happened and why. The result of the analysis is often an analytic
dashboard.
• Descriptive – What is happening now based on incoming data. To
mine the analytics, you typically use a real-time dashboard and/or
VALUE! email reports.
October 18 4 October 18 5

Prescriptive Analytics Predictive Analytics


Prescriptive analytics is really valuable, but largely not used. According to Predictive analytics use data to identify past patterns to predict the
Gartner, 13 percent of organizations are using predictive but only 3 percent future.
are using prescriptive analytics. Where analytics in general sheds light on a
subject, prescriptive analytics gives you a laser-like focus to answer specific
questions.
For example, some companies are using predictive analytics for sales
lead scoring. Some companies have gone one step further use
For example, in the health care industry, you can better manage the patient predictive analytics for the entire sales process, analysing lead source,
population by using prescriptive analytics to measure the number of patients number of communications, types of communications, social media,
who are clinically obese, then add filters for factors like diabetes and LDL
cholesterol levels to determine where to focus treatment. The same documents, CRM data, etc. Properly tuned predictive analytics can be
prescriptive model can be applied to almost any industry target group or used to support sales, marketing, or for other types of complex
problem. forecasts.

October 18 6 October 18 7
Diagnostic Analytics Descriptive Analytics
Diagnostic analytics are used for discovery or to determine why Descriptive analytics are valuable for uncovering patterns that offer
something happened. insight.

For example, for a social media marketing campaign, you can use A simple example of descriptive analytics would be assessing credit
descriptive analytics to assess the number of posts, mentions, risk; using past financial performance to predict a customer’s likely
followers, fans, page views, reviews, pins, etc. There can be thousands financial performance. Descriptive analytics can be useful in the sales
of online mentions that can be distilled into a single view to see what cycle, for example, to categorize customers by their likely product
worked in your past campaigns and what didn’t. preferences and sales cycle.

October 18 8 October 18 9

What do we search for in data analytics? Methodology for


analytics, data mining,
and data science projects
• Correlation
• A technique for investigating the relationship between two quantitative,
continuous variables, for example, age and blood pressure.
• Pattern
• A repetitive characteristic.

Cross Industry Standard Process


for Data Mining (CRISP-DM)

October 18 10 October 18 11
Business understanding Data understanding
• Understand the problem to be solved. This may require multiple • Data is the raw material from which the solution will be built.
iterations before an acceptable solution formulation would appear.
• It is important to understand the strengths and limitations of the data
• The design team should think carefully about the problem to be because rarely is there an exact match with the problem. For
solved and about the use scenario. The must ask the questions of: example, historical data often are collected for a different purpose.
• What exactly do we want to do?
• How exactly would we do it?
• What parts of this use scenario constitute possible data mining models?
• It is common for a business problem to have several data mining tasks
and the result of each task solves the problem.

October 18 12 October 18 13

Data preparation Modelling


• Often, data is not in the form that it is required, hence, conversion is • The primary place where data mining techniques are applied to the
necessary to achieve a form that can help yield better results. data.

• Examples, converting data into tabular format, removing or inferring • Typically, the output is some sort of model or pattern capturing
missing values, and converting data to different types. regularities in the data.

October 18 14 October 18 15
Evaluation Deployment
• Aim is to assess the data mining results and to gain confidence that • Data mining results are put into real use in order to realise some
the results are valid and reliable. return on investment. This involves implementing the proposed
• Stakeholders would like to know if the proposed model is going to do model.
more good than harm, or would it be catastrophic. • The observation from this stage may require an iteration back to the
• Evaluating results of data mining includes both quantitative and Business Understanding stage. There, improvements and refinements
qualitative assessments. to the model is made.
• These evaluation techniques are statistical in nature and thus not
covered in this course.

October 18 16 October 18 17

Popular analytics methods Case: MegaTelCo


• Classification and class probability estimations The company has a major problem with customer retention in their
• Regression wireless business. In the mid-Atlantic region, 20% of cell phone
customers leave as soon as their contracts expire, and lately it has been
• Similarity matching getting increasingly difficult to acquire new customers.
• Clustering The cell phone market has become saturated. Telco companies are
• Co-occurrence grouping battling to attract each other’s customers while retaining their own.
• Profiling Customers switching from one company to another is called “churn”,
and it is expensive all around.

October 18 18 October 18 19
Classification and class probability estimation Probability:
80%
• Goal: To predict in which class an individual belongs to.
• Question: Among all the customers of MegaTelecom, which are likely
to respond to a given offer? WILL RESPOND
• Individual is a customer.
• Classes are “will respond” and “will not respond”.
• Classification task: A data mining model predicts which class an If OFFER -
individual belongs to.
• Class probability estimation task: Instead predicting which class an
individual belongs to, here it predicts the “probability” that an
individual will belong to which class. The probability comes as a score Probability:
value. 5%
WILL NOT RESPOND
October 18 20 October 18 21

Regression
A LOT!
• Goal: To predict or estimate, for each individual, the numerical value
of some variable for that variable.
• Question: How much will a given customer use the service?
• Task: Predict the “service usage” property (variable) for a particular WILL RESPOND
individual typically by looking at other similar individuals in the
population and their historical usage. HOW much
service would
she use?
A LITTLE….

October 18 22 October 18 23
Similarity matching Clustering
• Underlies other data mining tasks, such as classification, regression • Goal: To group individuals in a population together by their similarity,
and clustering. but not driven by any specific purpose.
• Goal: To identify similar individuals based on data known about them. • Question: Do our customers form natural groups or segments?
In other words, to find similar individuals. • Useful in preliminary domain exploration which natural groups would
• Most popular methods for making product recommendations (finding later suggest other data mining tasks or approaches.
people who are similar to you when purchasing items).

October 18 24 October 18 25

Texts occasionally
Calls for long hours Co-occurrence grouping
• Also known as frequent itemset mining, association rule discovery,
Only receives calls Intensive data plan market-basket analysis.
• Goal: To find associations between individuals based on transactions
involving them.
• Question: What items are commonly purchased together?
Texts frequently
Seldom calls nor texts • Task: Identify similarity of objects based on their “appearing”
together in transactions.
• Example: people who bought X also bought Y.

October 18 26 October 18 27
Hungry
people who Profiling
bought PIZZA
also bought • Also known as “behaviour description”.
NOODLES, • Goal: To characterise the typical behaviour of an individual, group or
therefore, population.
always offer • Question: What is the typical cell phone usage of this customer
segment?
NOODLES to
someone • Task: Requires a complex description of night and weekend airtime
averages, international usage, roaming charges, text minutes etc.
who bought
PIZZA.

October 18 28 October 18 29

Jane is a student. This is her Jack is a lecturer. This is his


service usage profile purchase profile recorded by
recorded by her telco. his credit card company.

January January

February February

March March

April A mismatch!
A mismatch!
April Does not fit
Does not fit
profile.
profile.
FRAUD!
ALERT!
October 18 30 October 18 31
Q1: Do our customers Q2: Can we find groups of customers who
naturally fall into different have particularly high likelihoods of
Which analytical method? groups? cancelling their service soon after their
contracts expire?
• Often, a data analyst must be able to propose one or multiple Is there a target
Is there a target
analytical methods to solve a business problem. However, this can be
tricky. One way is by identifying if the business problem requires a
supervised or an unsupervised data mining method by determining if Will a customer leave when
the question has a target/purpose for the grouping. her contract expires?
Hence, use unsupervised methods Hence, use supervised methods
Clustering, co-occurrence
Classification, regression
grouping, profiling
Similarity
matching
October 18 32 October 18 33

Q2: Can we find groups of customers who


NEXT thing that you
have particularly high likelihoods of Will this customer purchase
NOW you know need to know….
cancelling their service soon after their
contracts expire?
that…. service S1 if given incentive
X?
Is there a target
Task: To predict the Classification
n orr
target.
Regression?
Will a customer leave when
her contract expires? Which service package (S1,
Requires labelled data S2 or none) will a customer
Hence, use supervised methods on the target.
likely purchase if given
How much will this incentive X?
Classification, regression customer use the service?

October 18 34 October 18 35
Model Evaluation
• Evaluation metrics: How can we measure accuracy? Other metrics to
consider?
• Use test set of class-labeled tuples instead of training set when
Analytics Methods assessing accuracy
ITS480 BUSINESS DATA ANALYTICS • Methods for estimating a classifier’s accuracy:
• Holdout method, random subsampling
Prepared by: Ezzatul Akmal Kamaru Zaman • Cross-validation
• Bootstrap
• Comparing classifiers:
• Confidence intervals
• Cost-benefit analysis and ROC Curves
November 18 1 November 18 2

Classifier Evaluation Metrics: Classifier Evaluation Accuracy: Overall, how often is the classifier correct?
• (TP+TN)/total = (100+50)/165 = 0.91
Accuracy & Error Rate Metrics: Misclassification Rate: Overall, how often is it wrong?
• (FP+FN)/total = (10+5)/165 = 0.09
Confusion Matrix:
• equivalent to 1 minus Accuracy
Actual class\Predicted class C1 ~C1
Example - Confusion • also known as "Error Rate"
True Positive Rate: When it's actually yes, how often does it
C1 True Positives (TP) False Negatives (FN) Matrix predict yes?
~C1 False Positives (FP) True Negatives (TN) • TP/actual yes = 100/105 = 0.95
• also known as "Sensitivity" or "Recall"
Classifier Accuracy, or recognition rate: percentage of test set tuples that are False Positive Rate: When it's actually no, how often does it
correctly classified, predict yes?
• FP/actual no = 10/60 = 0.17
true positives (TP): These are cases in which we predicted yes Specificity: When it's actually no, how often does it predict no?
(they have the disease), and they do have the disease. • TN/actual no = 50/60 = 0.83
true negatives (TN): We predicted no, and they don't have the • equivalent to 1 minus False Positive Rate
disease. Precision: When it predicts yes, how often is it correct?
Error rate: 1 – accuracy, or false positives (FP): We predicted yes, but they don't actually • TP/predicted yes = 100/110 = 0.91
have the disease. (Also known as a "Type I error.") Prevalence: How often does the yes condition actually occur in
false negatives (FN): We predicted no, but they actually do our sample?
have the disease. (Also known as a "Type II error.") • actual yes/total = 105/165 = 0.64
3
November 18 3 November 18 4
3
Sensitivity in yellow, specificity in red Precision in red, recall in yellow

November 18 5 November 18 6

Equations Equations explanation


• sensitivity = recall = tp / t = tp / (tp + fn) • Sensitivity/recall – how good a test is at detecting the positives. A test
• specificity = tn / n = tn / (tn + fp) can cheat and maximize this by always returning “positive”.
• precision = tp / p = tp / (tp + fp) • Specificity – how good a test is at avoiding false alarms. A test can
cheat and maximize this by always returning “negative”.
• Precision – how many of the positively classified were relevant. A test
can cheat and maximize this by only returning positive on one result
it’s most confident in.
• The cheating is resolved by looking at both relevant metrics instead of
just one. E.g. the cheating 100% sensitivity that always says “positive”
has 0% specificity.
November 18 7 November 18 8
Classifier Evaluation Metrics: Classifier Evaluation Metrics:
Sensitivity and Specificity Example Predicted class

class cancer = cancer = no Total Recognition(%)


• Class Imbalance Problem: yes
• one class may be rare, e.g. fraud detection data, cancer = yes 90 TP 210 FN 300 P 30.00
medical data sensitivity
• significant majority of the negative class and minority cancer = no 140 FP 9560 TN 9700 N 98.56
of the positive class specificity

Actual class
• Sensitivity: True Positive recognition rate, Total 230 P’ 9770 N’ 10000 96.40
accuracy

Sensitivity = 90/300 = 30%


• Specificity: True Negative recognition rate Specificity = 9560/9700 = 98.56% HIGH ACCURACY >90%,
Ability to classify positive class is low,
Ability to classify negative is high
Precision = 90/230 = 39.13%; Precision (exactness)
Recall (completeness)
Recall = 90/300 = 30.00%
9 10
November 18 9 November 18 10

Model 1
Evaluating Classifier Accuracy: Model 1 is better
Holdout & Cross-Validation Methods Model Selection: ROC Curves Model 2 than Model 2.
Why?
• Holdout method • ROC (Receiver Operating Characteristics)
• Given data is randomly partitioned into two independent sets Diagonal
curves: for visual comparison of classification
line
• Training set (e.g., 2/3) for model construction models
• Test set (e.g., 1/3) for accuracy estimation
• Originated from signal detection theory
• Random sampling: a variation of holdout Fig 1 : Holdout Method
• Repeat holdout k times, accuracy = avg. of the accuracies obtained • Shows the trade-off between the true positive
rate and the false positive rate „ Vertical axis represents
• Cross-validation (k-fold, where k = 10 is most popular) the true positive rate
• Randomly partition the data into k mutually exclusive subsets, each • The area under the ROC curve is a measure of
„ Horizontal axis rep. the
approximately equal size the accuracy of the model
false positive rate
• At i-th iteration, use Di as test set and others as training set • Rank the test tuples in decreasing order: the „ The plot also shows a
• Leave-one-out: k folds where k = # of tuples, for small sized data, one that is most likely to belong to the positive
one sample is left out for testing diagonal line
class appears at the top of the list
• *Stratified cross-validation*: folds are stratified so that class dist. in each „ A model with perfect
fold is approx. the same as that in the initial data • The closer to the diagonal line (i.e., the closer accuracy will have an area
the area is to 0.5), the less accurate is the of 1.0
12
model
November 18
Fig 2 : Cross-Validation Method
11 November 18 12
By the end of this lesson, you should know:
• What is data visualisation.
• Benefits of good data visualisation.
Analytics Methods • Types of data visualisation.

ITS480 BUSINESS DATA ANALYTICS


Prepared by: Ruhaila Maskat (PhD)

Resource:
Tableau

October 18 1 October 18 2

What is data visualisation? Benefits of good data visualisation


• Data visualization refers to the graphical representation of • Our eyes are drawn to colors and patterns. We can quickly identify
information and data. By using visual elements like charts, graphs, red from blue, square from circle. Our culture is visual, including
and maps, data visualization is an accessible way to see and everything from art and advertisements to TV and movies.
understand trends, outliers, and patterns in data. • Data visualization is another form of visual art that grabs our interest
• In the world of Big Data, data visualization tools and technologies are and keeps our eyes on the message. When we see a chart, we quickly
essential to analyse massive amounts of information and make data- see trends and outliers. If we can see something, we internalize it
driven decisions. quickly. It’s storytelling with a purpose. If you’ve ever stared at a
massive spreadsheet of data and couldn’t see a trend, you know how
much more effective a visualization can be.

October 18 3 October 18 4
Benefits of good data visualisation Types of data visualisation
• As the “age of Big Data” kicks into high-gear, visualization is an increasingly • When you think of data visualization, your first thought probably
key tool to make sense of the trillions of rows of data generated every day.
Data visualization helps to tell stories by curating data into a form easier to immediately goes to simple bar graphs or pie charts. While these may
understand, highlighting the trends and outliers. A good visualization tells a be an integral part of visualizing data and a common baseline for
story, removing the noise from data and highlighting the useful many data graphics, the right visualization must be paired with the
information.
• However, it’s not simply as easy as just dressing up a graph to make it look
right set of information. Simple graphs are only the tip of the iceberg.
better or slapping on the “info” part of an infographic. Effective data There’s a whole selection of visualization methods to present data in
visualization is a delicate balancing act between form and function. The effective and interesting ways.
plainest graph could be too boring to catch any notice or it make tell a
powerful point; the most stunning visualization could utterly fail at
conveying the right message or it could speak volumes. The data and the
visuals need to work together, and there’s an art to combining great
analysis with great storytelling.

October 18 5 October 18 6

More specific examples of methods to


General types of data visualization
visualize data
• Charts • Area Chart • Matrix
• Bar Chart • Network
• Tables • Box-and-whisker Plots • Polar Area
• Graphs • Bubble Cloud • Radial Tree
• Bullet Graph • Scatter Plot (2D or 3D)
• Maps • Cartogram • Streamgraph
• Infographics • Circle View • Text Tables
• Dashboards • Dot Distribution Map • Timeline
• Gantt Chart • Treemap
• Heat Map • Wedge Stack Graph
• Highlight Table • Word Cloud
• Histogram

October 18 7 October 18 8
Treemap
• Treemaps are a powerful and compact way to visualize hierarchical
and part-to-whole relationships. Each branch of the tree is
represented as a rectangle, with the size of a branch proportionate to
a specified measure of the data. A lot of people like treemaps
because they're visually attractive, so understanding how to leverage
color is a plus. Color is often used to show dimensions in a treemap—
heat maps work well if you want to show a spectrum.

https://public.tableau.com/views/CashonHand1/CashonHand?:embed
=y&:loadOrderID=0&:display_count=yes

October 18 9 October 18 10

Learn how to build a tree map in Tableau Learn how to build it in Tableau
• https://www.tableau.com/learn/tutorials/on-demand/histograms • https://www.tableau.com/learn/tutorials/on-demand/treemaps-
word-clouds-and-bubble-charts-chart-type

October 18 11 October 18 12
Histogram
• Histograms plot the number of occurrences of a given variable in a set
of data. They’re a great tool for getting an overview of the entire
distribution of a variable, and they take the form of a bar chart.
Imagine using histograms for retail analytics, to count the number of
sales of individual products by category. Or in customer analytics, to
tally the range of spending in a certain demographic.

October 18 13 October 18 14

Learn how to build it in Tableau Box plot


• https://www.tableau.com/learn/tutorials/on-demand/histograms • A box plot (or box-and-whisker plot) is a diagram of a distribution of data
best known for highlighting these values:
• first quartile
• median
• third quartile
• whiskers (1.5 times the interquartile range away from the mean)
• outliers
• Box plots are useful for comparing sets of data—especially the variations in
the data. They're a favorite of statisticians, and used commonly in
statistical analytics. Tableau can plot hundreds of thousands of rows per
second, so it can convey much more information than the standard box
plot.

October 18 15 October 18 16
Learn how to build it in Tableau
• http://kb.tableau.com/articles/knowledgebase/box-plot-analog

October 18 17 October 18 18

Gantt chart
• Gantt charts are the enemy of procrastination, keeping those micro-
deadlines between projects well in view. They’re great for displaying a
timeline such as project stages or a product release—to ensure you
release the beta before the product.
• The viewer can instantly see when parts of a project begin and end in
relation to each other, without having to cross check between pages
or sheets. Did you know? The first Gantt-type chart was developed in
1896, and was called a harmonogram. So all your departments can
work in harmony.

October 18 19 October 18 20
Learn how to build it in Tableau Word cloud
• http://onlinehelp.tableau.com/current/pro/online/mac/en- • Word clouds are like bubble charts in that words are sized according
us/buildexamples_gantt.html to some numerical measure and all packed into a designated space.
They’re useful for presenting data about—you guessed it—words.
While word clouds are not the best for accurate interpretation,
sometimes they add impact to a dashboard and encourage more
people to engage with the data.

October 18 21 October 18 22

Learn how to build it in Tableau


• https://www.tableau.com/learn/tutorials/on-demand/treemaps-
word-clouds-and-bubble-charts-chart-type-8

October 18 23 October 18 24
Lab work
• Go through each video on the following URL.

• https://www.tableau.com/learn/training Natural Language Processing


ISP610 BUSINESS DATA ANALYTICS
Prepared by:
Ezzatul Akmal Kamaru Zaman

References:
Handbook Of Natural Language Processing
Coursera Basic Natural Language Processing

October 18 25 November 18 1

By the end of this lesson, you should know: What is Natural Language
• Overview of Natural Language Processing (NLP) • Language used for everyday communication
• Application of NLP • English
• Chinese
• Stages of NLP • Tamil
• Espanol
• Malay
• Not artificial computer language ( python, c++)
• Also the language we use in short text messages (c u 2nite) or on
tweets is also, by this definition natural language.

November 18 2 November 18 3
What is Natural Language Processing Application of NLP
• Machine Translation
• Translation System
• Natural Language Processing (NLP) is the study of the computational
• Google Translate
treatment of natural (human) language
• Yahoo! Babel Fish
• In simpler words, teaching computers: • Database Access
• how to understand words mean
• how to generate human language by understanding how sentences are • Information Retrieval
constructed • Selecting from a set of documents the ones that are relevant to a query
• Gmail
• Natural Language evolve • Search Engine
google, selfie
• New words get added • Text Categorization
• Old words lose popularity thou
• Sorting text into fixed topic categories
• Meanings of words change Learn (Words such as learn in Old English used to mean teach)
• Language rules change position of verb

November 18 4 November 18 5

Application of NLP Generic NLP Architecture


• Search Engine
• Google, Bing, Yahoo,Ask
• Extracting data from text
• Converting unstructured text into structure
data
• Spoken language control systems
• Natural Language Assistant (Apple’s Siri)
• Question-answering systems, where natural
language is used to query a database (for
example, a query system to a personnel
database)
• Spelling and grammar checkers
• Grammarly

November 18 6 November 18 7
1) Phoneticss & Phonology:
• Phonetics:
• Pronunciation of different speakers.
Stages Of NLP • Deals with physical building blocks of language sound system.
• Pace of Speech.
• Example: I ate eight cake, different 'k' sounds in 'kite', 'coat', That band is
banned.
Phonetics & Morphological Syntactic Lexical Semantic Discourse Pragmatic • Phonology:
Phonology Analysis Analysis Analysis Analysis Integration Analysis • Processing of a speech.
• Organization of speech sound with in language.
• Example: Bank (finance) v/s. Bank (River),

November 18 8 November 18 9

2) Morphologicall Analysis: 3) Syntacticc Analysis:


• Morphology is the structure of words. • It is concerned with the construction of sentences.
• Various form of basic word.
• Indicates how the words are related to each other.
• Make more words from less.
• Example: • Syntax tree is assigned by a grammar and a lexicon.
Consider a word like: "unhappiness". Explanation:
This has three parts. There are three morphemes, each carrying a a sentence is construct from
certain amount of meaning. un means "not", while ness means "being noun phrase (NP) & verb
in a state or condition". Happy is a free morpheme because it can
appear on its own (as a "word" in its own right). Bound phrase (VP).
morphemes have to be attached to a free morpheme, and so cannot be Noun Phrase construct from
words in their own right. Thus you can't have sentences in English such article (art) and noun (n) .
as "Jason feels very un ness today".
verb phrase are from verb (v)
and noun phrase.
November 18 10 November 18 11
4) Lexical Analysis: 5) Semantic Analysis:
• Obtaining the properties of word. • Concerned with the meaning of language.
• Ex: 'dog' then you can easily bring an image of dog & its properties • The first step in any semantic processing is to look up the individual
like 4 leg, carnivore and animate. This properties is also matches with word in the dictionary and extract their meaning.
another animals like Lion. • Example:
• The sentence "you have a colorless green ideas...." would rejected
as semantically because colorless & green makes no sense.

November 18 12 November 18 13

6) Discourse Integration 7) Pragmatic Analysis


• The meaning of any sentence depends upon the meaning of the • Understanding the text & dialogues.
sentence just before it. • It derives Knowledge from external common sense information.
• Ex: Bill had a red balloon. • Ex: Do you know what time it is?
• John wanted it. • It does not mean the speaker asking you the time
• We should understand what to do?

November 18 14 November 18 15
Ambiguity in NLP
• Ambiguity can be referred as the ability of having more than one • If an expression (word/phrase/sentence) has more than one
meaning or being understood in more than one way. interpretation we can refer it as ambiguous.
• Natural languages are ambiguous, so computers are not able to • Eg: Consider the sentence, The chicken is ready to eat.
understand language the way people do. • The interpretations in the above phrase can be,The chicken(bird) is ready to
be fedor or
• Natural Language Processing (NLP) is concerned with the
development of computational models of aspects of human language • The chicken (food) is ready to be eaten.
processing. • Consider another sentence, There was not a single man at the party
• Ambiguity can occur at various levels of NLP. Ambiguity could be • The interpretations in this case can be Lack of bachelors at the party or
Lexical, Syntactic, Semantic, Pragmatic etc. • Lack of men altogether

November 18 16 November 18 17

References
• http://www.ijircce.com/upload/2014/sacaim/59_Paper%2027.pdf

Text Analysis
ITS480 BUSINESS DATA ANALYTICS
Prepared by: Ruhaila Maskat (PhD)

References:
EMC Data science module
Wikipedia
November 18 18 October 18 1
By the end of this lesson, you should know: Text analysis
• A specific form of text analysis i.e. sentiment analysis/buzz tracking. • The processing and representation of data that is in text form for the
• The steps and processes in sentiment analysis. purpose of analysing and learning new models from it.
• The main challenge in text analysis is the problem of high
dimensionality: every possible word in the document represents a
dimension.
• Example: A book named ‘Green Eggs and Ham’ written by Dr. Seuss
has just fifty different words, hence has 50 dimensions.
• Another challenge of text analysis is the data is unstructured.

October 18 2 October 18 3

Buzz tracking Buzz tracking


• The monitoring of consumer responses to commercial services and • Implemented by businesses for a variety of reasons, namely to
products in order to establish the marketing buzz surrounding a new improve efficiency, reaction times and identify future opportunities.
or existing offer. • Insights gained can help guide marketing and communications,
• Similar to media monitoring it is becoming increasingly popular as a identify positive and negative customer experiences, assess product
base for strategic insight development alongside other forms and service demand, tackle crisis management, round off competitor
of market research. analysis, establish brand equity and predict market share.
• Involves the checking and analysis of myriad online sources such • Brand equity is the commercial value that derives from consumer
as internet forums, blogs, and social networks. perception of the brand name of a particular product or service,
rather than from the product or service itself.

October 18 4 October 18 5
Step 1: Monitor social networks, review sites
Buzz tracking: the steps and processes
for mentions
Steps Processes • Parsing
1. Monitor social networks, review sites for mentions Parse the data feeds to get actual content. • To resolve a sentence into component parts of speech and explain syntactical
of our products. Find and filter the raw text for product names.
(Use regular expression). relationship – Merriam-Webster.
2. Collect the reviews. Extract the relevant raw text. • Aim is to impose structure typically on semi-structured data e.g. html pages,
Convert the raw text into a suitable document RSS feeds.
representation. • The structure must be enough to find the part of the raw text that we really
Index into our review corpus.
care about: the actual content of review, titles, date of review.
3. Sort the reviews by product. Classification (or “Topic Tagging”).
• Output is a collection of phrases and words that speaks of the product of
4. Determining type of review (good or bad). Classification (sentiment analysis).
interest.
5. Marketing calls up and reads selected reviews in Search/Information retrieval.
full, for greater insight.

October 18 6 October 18 7

<channel> Step 1: Monitor social networks, review sites


<title>All about Phones</title> for mentions
<description>My Phone Review Site</description>
<link>http://www.phones.com/link.html</link> • Regular expressions
<item> • A popular technique used for finding words, strings or a particular pattern in
the text.
<title>bPhone: The best!</title>
• Basic use is to determine if the regular expression matches a string.
<description> I love love love my bPhone!</description> • With regular expressions we can take into account capitalisation (or lack of it),
<link>http://www.phones.com/link.htm</link> common misspellings, common abbreviations etc.
<pubDate>Tue, 29 Aug 2011 09:00:-- -0400</pubDate>
</item>
</channel>

October 18 8 October 18 9
Example Step 2: Collect the reviews
Regular expression Matches Note • Extract and represent text
B[P|p]hone bPhone, bphone Pipe “|” means “or” • Aim: to represent our collection of phrases and words in a structured manner
bEb*k bEbook, bEbk, bEback… “*” is a wildcard, matches anything for downstream analysis and calculate the number of times a term occurs.
^I love A line starting with “I love” “^” means start of a string • A common representation is the “bag of words”. “Bag of words” is a vector
Acme$ A line ending with “Acme” “$” means the end of a string with one dimension for every unique term in the space.
• However, this results in “VERY high dimensional” structure.
• To produce bag of words, count the occurrences of the words in the text
parsed and number of times the word is repeated and store word count.

October 18 10 October 18 11

Bag of words Reducing high dimensionality


Term Frequency • Remove “stop” words e.g. “the”, “a”, etc.
Acme 0
Bebook 0
• Stemming: the process of reducing inflected (or sometimes derived)
Bphone 1
words to their word stem, base or root form—generally a written
Fantastic 0
word form.
Love 2 • http://www.lextek.com/manuals/onix/stopwords1.html
Slow 0
Terrible 0
Terrific 0

October 18 12 October 18 13
Step 2: Collect the reviews
• Document representation – other features
• “Feature” is anything about the document that is used for search or analysis.
• Title
• Keywords or tags
• Date information
• Source information
• Named entities
• Features help with downstream analysis in text classification.

October 18 14 October 18 15

Step 2: Collect the reviews Reverse indexing

• Representing a corpus
• Corpus is a collection of documents.
• Why represent a corpus? Because we want to archive them yet able to
conduct search for future reference and research.
• Reverse indexing provides a way of keeping track of list of all documents that
contain a specific feature and for every possible feature.

October 18 16 October 18 17
Step 2: Collect the reviews Step 3: Sort the reviews
• Common perception • Once all reviews have been collected and represented, we want to
• Documents are often only relevant to in the context of a corpus, or a specific sort them by the subject of interest i.e. product/service.
collection of documents. Hence, classifiers need to be trained on a specific set
of documents. Any changes to the corpus requires retraining of a classifier.
• Examples:
• Challenge “The bphone-5x has coverage everywhere. It’s much less flaky than my
old bPhone-4G”.
• Corpus changes constantly over time: not only do new documents get added,
but word distributions can change over time. This could reduce the
effectiveness of classifiers and filters if they are not retrained e.g. spam filters.
“While I love Acme’s bPhone series, I’ve been quite disappointed by
the bEBook. The text is illegible, and it makes even the Kindle look
blazingly fast”.

October 18 18 October 18 19

“The bphone-5x has coverage everywhere. It’s much less “While I love Acme’s bPhone series, I’ve been quite
flaky than my old bPhone-4G” disappointed by the bEBook. The text is illegible, and it makes
even the Kindle look blazingly fast”

Review on bphone-5X or on bPhone-4G? Review on bPhone, bEBook or Kindle?

October 18 20 October 18 21
Step 4: Determining type of review (good or
Text classification
bad)
• To sort reviews, we need to classify them. Typically, by topic tagging. • Another text classification task is done at this step. But here it
• Topic tagging often involves having a team of human users to determine involves determining if a review is good (positive) or bad (negative).
the classification of a review and tag it accordingly. This answers the
question of: • Commonly-used classifiers include Naïve Bayes and Support Vector
• Is this review about bPhone, bEBook or Kindle? Machine (SVM).
• This review is on bphone-5X or on bPhone-4G?
• Some rules for topic tagging: • A major bottleneck of this step is the need for tagged training data.
• If the product is mentioned in the title, then the review is likely to be about the Two approaches to overcome this:
product.
• If the mentions are in the contents the review may or may not be related to the • Have human to identify good and bad reviews.
product. • Utilise sentiment dictionary.
• A tweet is more likely about the product than a forum because a review may be
about comparison of different products.
• More frequent mentions of the products may indicate the review is relevant.

October 18 22 October 18 23

Sentiment wordlists Training a classifier


• http://www.wjh.harvard.edu/~inquirer/homecat.htm • Aim: to determine the polarity of a piece of review.
• http://www.wjh.harvard.edu/~inquirer/Positiv.html • Polarity describes a review’s “negative”, “positive” or “neutral”
• http://www.wjh.harvard.edu/~inquirer/Negativ.html content.
• http://provalisresearch.com/Download/WSD.zip • Polarity confidence.
• http://www3.nd.edu/~mcdonald/Word_Lists.html
• http://sentiwordnet.isti.cnr.it/
• http://mpqa.cs.pitt.edu/
• http://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html

October 18 24 October 18 25
Training a classifier: Naïve Bayes
Tagged
training set
and its
1 probability.
2
3 The
probability
4
of a positive
5 review
6 having the
word “love”
is 4/6.
October 18 26 October 18 27

Step 5: Search/Information retrieval


• After sentiment/buzz has been analysed, marketing and sales
personnel would then want to search and retrieve relevant reviews
from the corpus.
• From the reviews, sales and marketing personnel may gain insight
about the product/service offered.
• Search by relevance is often made possible using Term Frequency-
Inverse Document Frequency (TF-IDF).
• Each search query is parsed into terms/words e.g. “s8 camera quality”
will be parsed into “s8”, “camera” and “quality”.

October 18 28 October 18 29
Term frequency – inverse document Term frequency – inverse document
frequency frequency
• Is a weight-based metrics to identify reviews/documents relevant to • Term frequency (tf): the number of times a term is found in a
some query terms. document over the total number of terms in the document.
• The underlying idea of TF-IDF is rare terms are weighted higher than • Document frequency (df): the number of documents with term t in it.
common terms. In other words, rare terms are regarded to be more • Inverse document frequency (idf): the logarithm of document
important than common terms due to their discriminating nature. frequency which indicates the rarity of a term.
• Consists of two parts: Term frequency and Inverse document
frequency.
idf = log ( (Size of corpus) / df )

October 18 30 October 18 31

Combining TF and IDF


Query terms: “this”, “example”.
RARE term

While document frequency only provides information about how


many documents contain a particular term, IDF provides
information about how rare the term is across the document
corpus. IDF provides measure of relevance that DF does not
provide.

October 18 32 October 18 33
Scoring TF-IDF
• To know which document is more relevant to the query terms.

• Sum of the TF-IDF of each searched term.

• Document 1: 0 + 0 = 0
• Document 2: 0 + 0.13 = 0.13 More relevant
October 18 34 October 18 35

MapReduce & Hadoop


ITS480 BUSINESS DATA ANALYTICS
Prepared by: Ruhaila Maskat (PhD) Hadoop Ecosystem
References:
Ques10
EMC Data Scientist Associate
October 18 1
What is Hadoop?
• Hadoop is a framework which deals with Big Data but unlike any
other frame work it's not a simple framework, it has its own family for
processing different thing which is tied up in one umbrella called as
Hadoop Ecosystem.

HDFS (Hadoop
SQOOP : SQL + Distributed File System)
- A technique to store data in
HADOOP = SQOOP distributed manner in order to
- Import structured data compute fast.
from tables (RDBMS) to - Saves data in a block of 64MB
HDFS. (default) or 128 MB in size which
is logical splitting of data.
- A file is created in HDFS
which contains the data
where it can be processed
by Map Reduce, HIVE or
PIG.
- Processed data in HDFS can
be stored back to another
table in RDBMS (export).
MapReduce MapReduce
Framework Framework
- A method of programming in a - For “embarrassingly parallel”
distributed data stored in a problem description. This problem
HDFS. is where a single task can be
divided into smaller tasks and
- Can be written by using any later recombined into a single
language like JAVA, C++ PIPEs, output.
PYTHON, RUBY etc. - MAP function divides a big task
into smaller tasks to be processed
- Can be applied to any type of on different units of machines.
data whether structured or This should be in the form of key,
unstructured. Example - word value pairs.
count using MapReduce. - In a word-count case, MAP
function would count the words in
each document by placing a
document on a machine. The key
would be a word, and the value
would be the count.

MapReduce HBASE
Framework - A non-relational (NoSQL)
- REDUCE function recombines database that runs on top
multiple small tasks to become a of HDFS.
single result.
- Was created for large table
- In a word-count case, REDUCE which have billions of rows
function would take the count of
words found in each document
and millions of columns
and total up to produce a total with fault tolerance
number of words counted. capability and horizontal
scalability and based on
Google Big Table.
- Hadoop can perform
only batch processing, and
data will be accessed only
in a sequential manner for
random access.
Hive Pig Latin
- For SQL-literate people. - Also deals with structured
- Mainly deals with structured data.
data stored in HDFS. - For programmer who loves
- A specialised query language scripting and don't want to
called HQL (Hive Query use Java/Python or SQL to
Language). process data.
- Also run Map reduce program in - A Pig Latin program is made
a backend to process data in up of a series of operations,
HDFS. or transformations, that are
applied to the input data
which runs MapReduce
program in backend to
produce output.

Mahout Oozie
- An open source machine learning - A workflow scheduler system to
library from Apache written in Java. manage Hadoop jobs.
- The algorithms it implements fall - A server-based Workflow Engine
under the broad umbrella of specialized in running
machine learning or collective
intelligence.
workflow jobs with actions that
run Hadoop MapReduce and
- Primarily recommender engines Pig jobs.
(collaborative filtering), clustering,
and classification. - -Implemented as a Java Web-
Application that runs in a Java
- The machine learning tool of choice
when the collection of data to be Servlet-Container.
processed is very large, perhaps far - Used when a programmer wants
too large for a single machine. to run many job in a sequential
manner like output of job A will
be input to Job B and similarly
output of job B is input to job C
and final output will be output
of job C.
Zookeeper
- A centralized service for
maintaining configuration
information, naming, providing
distributed synchronization, and
providing group services . In case
of any partial failure clients can
connect to any node and be
assured that they will receive
the correct, up-to-date
information.

You might also like