Slide Isp610
Slide Isp610
October 18 1 October 18 2
October 18 3 October 18 4
October 18 5 October 18 6
October 18 7 October 18 8
October 18 9 October 18 10
October 18 11 October 18 12
Data Sources The Value of Data
• These data come from multiple sources, including: • Everyone and everything is leaving a digital footprint. The graph
• Medical Information, such as genomic sequencing and MRIs shows the different forms of data being generated by new
• Increased use of broadband on the Web – including the 2 billion photos each applications and the scale and growth rate of the data. By analysing
month that Facebook users currently upload as well as the innumerable
videos uploaded to YouTube and other multimedia sites. these immense data, organisations can reap value.
• Video surveillance. • Industry case studies:
• Increased global use of mobile devices – the torrent of texting is not likely to • Health care – Reducing cost of care
cease.
• Public services – Preventing pandemics
• Smart devices – sensor-based collection of information from smart electric
grids, smart buildings and many other public and industry infrastructure. • Life sciences – Genomic mapping
• Non-traditional IT devices – including the use of RFID readers, GPS navigation
systems, and seismic processing.
October 18 13 October 18 14
Data
Data
VALUE!
VALUE!
October 18 15 October 18 16
Life Sciences Competitive Advantage
• To a profit-making organisation, value of data comes in the form of an
advantage over their competitors.
• According to Bain Research, top performing organisations tend to
make decisions based on what their data tells them. By having a good
Data basis to work on, these organisations tend to make decisions faster.
VALUE!
October 18 17 October 18 18
October 18 19 October 18 20
What’s Driving Analytics in Organisations? Analytics
• More than just an OLTP MIS reporting.
• Rather than doing standard reporting on these areas, organizations can
apply advanced analytical techniques to optimize processes and derive
more value from these typical tasks.
• Analytics examine large amounts of data to uncover hidden patterns,
correlations and other insights.
• Analytics help organisations to make more accurate decisions when faced
with problems.
• Analytics helps organizations harness their data and use it to identify new
opportunities. That, in turn, leads to smarter business moves, more
efficient operations, higher profits and happier customers
October 18 21 October 18 22
October 18 23 October 18 24
YOU
October 18 25 October 18 26
SKILL SET
YOU
October 18 27 October 18 28
• Quantitative skills, such as mathematics or statistics.
Important Questions to Ask Your Customer
• Technical aptitude, such as software engineering, machine learning
and programming skills.
• What is the business problem you’re trying to solve?
• Sceptical…. This may be a counterintuitive trait, although it is
important that data scientists can examine their work critically rather • What is your desired outcome?
than in a one-sided way. • Will the focus and scope of the problem change if the following
• Curious & Creative. Must be passionate about data and finding dimensions change:
creative ways to solve problems and portray information. • Time
• People
• Communicative & Collaborative: It is not enough to have strong
• Risk
quantitative skills or engineering skills. To make a project resonate,
you must be to articulate the business value in a clear way, and work • Resources
collaboratively with project sponsors and key stakeholders. • Size and attributes of data
October 18 29 October 18 30
October 18 31 October 18 32
By the end of this lesson, you should know:
• What NoSQL databases are.
• How are they different from SQL databases.
Unstructured Data • Types of NoSQL databases.
ISP610 BUSINESS DATA ANALYTICS
Prepared by: Ruhaila Maskat (PhD)
References:
Wikipedia
SearchDataManagement
3pillarglobal
October 18 1 October 18 2
October 18 3 October 18 4
NoSQL vs SQL
1. Non-relational model. 1. Relational model.
2. Stores data in JSON, key/value, 2. Stores data in a table.
graphs, columns.
SQL
NoSQL
3. Adding a new property may
3. New properties can be added on require altering schemas.
the fly. 4. Good for structured data.
4. Good for semi-structured, 5. Relationships are captured in
complex or nested data. normalised model using joins to
5. Relationships are captured by resolve references across tables.
denormalizing data and 6. Strict schema.
presenting all data for an object
in a single record.
6. Dynamic/flexible schema.
October 18 5 October 18 6
October 18 7 October 18 8
NoSQL model
In general: SQL NoSQL
• One query.
• No JOINS.
• No schema is maintained.
October 18 9 October 18 10
October 18 11 October 18 12
Key-value database Example data
• Most basic and a backbone implementation of NoSQL.
• Underlying is a hash table which consists of a unique key that points
to a specific item of data.
• Work by matching keys with values like a dictionary.
• Give a key (e.g. the_answer_to_life) and receives a matching value
(e.g.24).
• Database is a global collection of key-value pairs.
• As the volume of data increases, maintaining unique values as keys
may become more difficult.
• Riak, Amazon S3 (Dynamo), Oracle NoSQL.
October 18 13 October 18 14
October 18 15 October 18 16
Column/BigTable
• Advance the simple nature of key / value based.
• Do not require a pre-structured table to work with the data.
• Work by creating collections of one or more key / value pairs. KEY
• Two dimensional arrays whereby each key has one or more key /
value pairs attached to it.
Column-store,
• Two groups: column-store and column-family store.
• Column-family store: Bigtable, HBase, Hypertable, and Cassandra.
position-based
• Column-store: Sybase IQ, C-store, Vertica, VectorWise, MonetDB,
ParAccel and Infobright.
October 18 17 October 18 18
Column-family
Column-store,
rowid-based
KEY
VALUE
October 18 19 October 18 20
• The outermost keys 3PillarNoida,
3PillarCluj, 3PillarTimisoara and
3PillarFairfax are analogues to
rows.
October 18 21 October 18 22
Collections should make sense e.g. books, webstore, retail store, fruits.
Hence, document database is unstructured and schemaless.
October 18 25 October 18 26
October 18 27 October 18 28
Document database Use cases
• We can have more complicated structure. • Event logging
Example: • Blogs and Website content management
{ • Web-analytics or real time analytics
_id : “978”, • E-commerce applications e.g. shopping cart.
“Title” : “Data Science”,
“Author” : [“William Jackson”,
List of values
“Ben Ten”]
}
October 18 29 October 18 30
October 18 31 October 18 32
Graph database
Unstructured Data
ISP610 BUSINESS DATA ANALYTICS
Prepared by: Ruhaila Maskat (PhD)
References:
Wikipedia
SearchDataManagement
3pillarglobal
October 18 33 October 18 1
October 18 2 October 18 3
October 18 4 October 18 5
Modelling techniques
• Referencing documents.
• Embedding documents.
• Denormalisation.
• Heterogeneous collection.
October 18 6 October 18 7
{
Referencing documents Example _id : 1,
name : Ryan,
• You can reference another document using the document key. This is thumbnailUrl : ….,
similar to normalisation in relational db. { shortProfile : ….
• Referencing enables document databases to cache, store and retrieve sessionId : session1 }
the documents independently. refer
sessionName : Document modelling, ence
• Provides better write speed/performance. speakers : [{ id: 1 },{ id: 2 }] {
• Reading may require more round trips to the server. } _id : 2,
name : David,
thumbnailUrl : ….,
shortProfile : ….
}
October 18 8 October 18 9
October 18 10 October 18 11
1-to-many relationships (unbounded) Many-to-many relationships
October 18 12 October 18 13
Lower volatility
Greater volatility
October 18 16 October 18 17
{
Example sessionId : session1 Embedding can be advantageous when….
sessionName : Document modelling,
speakers : [ • Two data items are often queried together.
{ _id : 1,
name : Ryan,
• One data item is dependent on another.
thumbnailUrl : …., • 1:1 relationship.
shortProfile : …. • Similar volatility (speed of change or update).
},
{ _id : 2,
name : David,
thumbnailUrl : ….,
shortProfile : ….
}]
October 18
} 18 October 18 19
Two data items are often queried together One data item is dependent on another
Dependent
on Order
October 18 20 October 18 21
October 18 22 October 18 23
Normalised Denormalised
Query
October 18 24 October 18 25
• Denormalised: • But, this would require three different queries over three different
• Requires updates in multiple places. collections.
• Provides faster read speed.
October 18 26 October 18 27
Heterogeneous collections
• Multiple types in a single collection.
Unstructured Data
ISP610 BUSINESS DATA ANALYTICS
Prepared by: Ruhaila Maskat (PhD)
References:
Datasciencedojo
Wiki
October 18 28 April 19 1
April 19 4 April 19 5
April 19 6 April 19 7
How to view a web page’s structure
• Assuming we are to scrape this web page.
http://stackoverflow.com/questions/19957194/inst
all-beautiful-soup-using-pip
• Open this page using Google Chrome, right mouse click and choose
Inspect.
• The structure of most web pages are in the HTML form. Here are
good tutorials on HTML.
• https://websitesetup.org/html-tutorial-beginners/
• https://www.w3schools.com/tags/tag_div.asp
April 19 8 April 19 9
Web scraping with Python BeautifulSoup4 Web scraping with Python BeautifulSoup4
1. Install Anaconda3 in C:\Anaconda3 folder. 4. To know if BeautifulSoup is successfully installed, type,
C:\Anaconda3\Scripts>python
2. Open Windows command line.
You will get,
3. To install BeautifulSoup, go to C:\Anaconda3\Scripts. Type, Python 3.6.0 |Anaconda 4.3.1 (64-bit)| (default, Dec 23
C:\Anaconda3\Scripts>pip.exe install beautifulsoup4 2016, 11:57:41) [MSC v.1900 64 bit (AMD64)] on win32
You will get, Type "help", "copyright", "credits" or "license" for
more information.
Requirement already satisfied: beautifulsoup4 in Then type,
c:\anaconda3\lib\site-packages
>>> import bs4
5. To grab a html page, type,
>>> from urllib.request import urlopen as uReq
April 19 10 April 19 11
Web scraping with Python BeautifulSoup4 Web scraping with Python BeautifulSoup4
6. To parse html tags, call BeautifulSoup by typing, 9. To open a connection to the web page and downloading into our machine,
type,
>>> from bs4 import BeautifulSoup as soup
>>> uClient = uReq(my_url)
7. To define the html page’s url, type
10. To read the scraped contents, type,
>>> my_url =
'http://stackoverflow.com/questions/19957194/install- >>> page_html = uClient.read()
beautiful-soup-using-pip' Warning, do not view the contents at this point in time, because if the web
8. To check contents of my_url variable, type page is huge, the command prompt will crash.
>>> my_url 11. To close the connection, type,
You will get, >>> uClient.close()
'http://stackoverflow.com/questions/19957194/install- 12. To parse the contents, type,
beautiful-soup-using-pip' >>> page_soup = soup(page_html,"html.parser")
April 19 12 April 19 13
Web scraping with Python BeautifulSoup4 Web scraping with Python BeautifulSoup4
13. To view the header of the contents, type, 15. To know what elements and tags that the webpage has, right
>>> page_soup.h1 mouse click on Chrome and choose Inspect.
You will get,
16. To check tags in the content’s <body>, type,
<h1 itemprop="name"><a class="question-hyperlink"
href="/questions/19957194/install-beautiful-soup-using- >>> page_soup.body
pip">install beautiful soup using pip</a></h1>
14. To view any paragraph in the contents, type, Again, not advisable unless the body is short.
>>> page_soup.p 17. To check what is in the <span>, type,
You will get, >>> page_soup.body.span
<p>I am trying to install BeautifulSoup using
<code>pip</code> in Python 2.7. I keep getting an error You will get,
message, and can't understand why.</p> <span class="-img">Stack Overflow</span>
April 19 14 April 19 15
Web scraping with Python BeautifulSoup4 Web scraping with Python BeautifulSoup4
18. To focus on a particular part of the webpage, simply highlight the 22. To read what is in the variable, type,
text and right mouse click to choose Inspect. >>> containers[0]
19. Identify html class to be passed. 23. To grab title from the following,
20. To parse a particular class into a variable, type, <img alt=”EVGA” class=”lazy-img” data-effect=”blab la”
>>>containers = page_soup.findAll(“div”,{“class”:”item- title=”EVGA”
container”}) Type,
21. To check the length of a variable, type, >>> container.div.div.a.img[“title”]
>>> len(containers)
April 19 16 April 19 17
April 19 18 April 19 19
Closing a file
Analytics Methods
ITS480 BUSINESS DATA ANALYTICS
Prepared by: Ruhaila Maskat (PhD)
References:
IngramMicroAdvisor
KDnuggets
April 19 20 October 18 1
By the end of this lesson, you should know: RECAP: Purpose of data analytics
• Categories of analytics methods. • Support decision-making.
• Methodology for data analytics. • Provide an advantage over competitors.
• Popular analytics methods. • Gives insight into the future.
• Choosing analytical methods.
October 18 2 October 18 3
RECAP: Health care Four Types of Analytics
• Prescriptive – This type of analysis reveals what actions should be
taken. This is the most valuable kind of analysis and usually results in
rules and recommendations for next steps.
• Predictive – An analysis of likely scenarios of what might happen. The
deliverables are usually a predictive forecast.
Data • Diagnostic – A look at past performance to determine what
happened and why. The result of the analysis is often an analytic
dashboard.
• Descriptive – What is happening now based on incoming data. To
mine the analytics, you typically use a real-time dashboard and/or
VALUE! email reports.
October 18 4 October 18 5
October 18 6 October 18 7
Diagnostic Analytics Descriptive Analytics
Diagnostic analytics are used for discovery or to determine why Descriptive analytics are valuable for uncovering patterns that offer
something happened. insight.
For example, for a social media marketing campaign, you can use A simple example of descriptive analytics would be assessing credit
descriptive analytics to assess the number of posts, mentions, risk; using past financial performance to predict a customer’s likely
followers, fans, page views, reviews, pins, etc. There can be thousands financial performance. Descriptive analytics can be useful in the sales
of online mentions that can be distilled into a single view to see what cycle, for example, to categorize customers by their likely product
worked in your past campaigns and what didn’t. preferences and sales cycle.
October 18 8 October 18 9
October 18 10 October 18 11
Business understanding Data understanding
• Understand the problem to be solved. This may require multiple • Data is the raw material from which the solution will be built.
iterations before an acceptable solution formulation would appear.
• It is important to understand the strengths and limitations of the data
• The design team should think carefully about the problem to be because rarely is there an exact match with the problem. For
solved and about the use scenario. The must ask the questions of: example, historical data often are collected for a different purpose.
• What exactly do we want to do?
• How exactly would we do it?
• What parts of this use scenario constitute possible data mining models?
• It is common for a business problem to have several data mining tasks
and the result of each task solves the problem.
October 18 12 October 18 13
• Examples, converting data into tabular format, removing or inferring • Typically, the output is some sort of model or pattern capturing
missing values, and converting data to different types. regularities in the data.
October 18 14 October 18 15
Evaluation Deployment
• Aim is to assess the data mining results and to gain confidence that • Data mining results are put into real use in order to realise some
the results are valid and reliable. return on investment. This involves implementing the proposed
• Stakeholders would like to know if the proposed model is going to do model.
more good than harm, or would it be catastrophic. • The observation from this stage may require an iteration back to the
• Evaluating results of data mining includes both quantitative and Business Understanding stage. There, improvements and refinements
qualitative assessments. to the model is made.
• These evaluation techniques are statistical in nature and thus not
covered in this course.
October 18 16 October 18 17
October 18 18 October 18 19
Classification and class probability estimation Probability:
80%
• Goal: To predict in which class an individual belongs to.
• Question: Among all the customers of MegaTelecom, which are likely
to respond to a given offer? WILL RESPOND
• Individual is a customer.
• Classes are “will respond” and “will not respond”.
• Classification task: A data mining model predicts which class an If OFFER -
individual belongs to.
• Class probability estimation task: Instead predicting which class an
individual belongs to, here it predicts the “probability” that an
individual will belong to which class. The probability comes as a score Probability:
value. 5%
WILL NOT RESPOND
October 18 20 October 18 21
Regression
A LOT!
• Goal: To predict or estimate, for each individual, the numerical value
of some variable for that variable.
• Question: How much will a given customer use the service?
• Task: Predict the “service usage” property (variable) for a particular WILL RESPOND
individual typically by looking at other similar individuals in the
population and their historical usage. HOW much
service would
she use?
A LITTLE….
October 18 22 October 18 23
Similarity matching Clustering
• Underlies other data mining tasks, such as classification, regression • Goal: To group individuals in a population together by their similarity,
and clustering. but not driven by any specific purpose.
• Goal: To identify similar individuals based on data known about them. • Question: Do our customers form natural groups or segments?
In other words, to find similar individuals. • Useful in preliminary domain exploration which natural groups would
• Most popular methods for making product recommendations (finding later suggest other data mining tasks or approaches.
people who are similar to you when purchasing items).
October 18 24 October 18 25
Texts occasionally
Calls for long hours Co-occurrence grouping
• Also known as frequent itemset mining, association rule discovery,
Only receives calls Intensive data plan market-basket analysis.
• Goal: To find associations between individuals based on transactions
involving them.
• Question: What items are commonly purchased together?
Texts frequently
Seldom calls nor texts • Task: Identify similarity of objects based on their “appearing”
together in transactions.
• Example: people who bought X also bought Y.
October 18 26 October 18 27
Hungry
people who Profiling
bought PIZZA
also bought • Also known as “behaviour description”.
NOODLES, • Goal: To characterise the typical behaviour of an individual, group or
therefore, population.
always offer • Question: What is the typical cell phone usage of this customer
segment?
NOODLES to
someone • Task: Requires a complex description of night and weekend airtime
averages, international usage, roaming charges, text minutes etc.
who bought
PIZZA.
October 18 28 October 18 29
January January
February February
March March
April A mismatch!
A mismatch!
April Does not fit
Does not fit
profile.
profile.
FRAUD!
ALERT!
October 18 30 October 18 31
Q1: Do our customers Q2: Can we find groups of customers who
naturally fall into different have particularly high likelihoods of
Which analytical method? groups? cancelling their service soon after their
contracts expire?
• Often, a data analyst must be able to propose one or multiple Is there a target
Is there a target
analytical methods to solve a business problem. However, this can be
tricky. One way is by identifying if the business problem requires a
supervised or an unsupervised data mining method by determining if Will a customer leave when
the question has a target/purpose for the grouping. her contract expires?
Hence, use unsupervised methods Hence, use supervised methods
Clustering, co-occurrence
Classification, regression
grouping, profiling
Similarity
matching
October 18 32 October 18 33
October 18 34 October 18 35
Model Evaluation
• Evaluation metrics: How can we measure accuracy? Other metrics to
consider?
• Use test set of class-labeled tuples instead of training set when
Analytics Methods assessing accuracy
ITS480 BUSINESS DATA ANALYTICS • Methods for estimating a classifier’s accuracy:
• Holdout method, random subsampling
Prepared by: Ezzatul Akmal Kamaru Zaman • Cross-validation
• Bootstrap
• Comparing classifiers:
• Confidence intervals
• Cost-benefit analysis and ROC Curves
November 18 1 November 18 2
Classifier Evaluation Metrics: Classifier Evaluation Accuracy: Overall, how often is the classifier correct?
• (TP+TN)/total = (100+50)/165 = 0.91
Accuracy & Error Rate Metrics: Misclassification Rate: Overall, how often is it wrong?
• (FP+FN)/total = (10+5)/165 = 0.09
Confusion Matrix:
• equivalent to 1 minus Accuracy
Actual class\Predicted class C1 ~C1
Example - Confusion • also known as "Error Rate"
True Positive Rate: When it's actually yes, how often does it
C1 True Positives (TP) False Negatives (FN) Matrix predict yes?
~C1 False Positives (FP) True Negatives (TN) • TP/actual yes = 100/105 = 0.95
• also known as "Sensitivity" or "Recall"
Classifier Accuracy, or recognition rate: percentage of test set tuples that are False Positive Rate: When it's actually no, how often does it
correctly classified, predict yes?
• FP/actual no = 10/60 = 0.17
true positives (TP): These are cases in which we predicted yes Specificity: When it's actually no, how often does it predict no?
(they have the disease), and they do have the disease. • TN/actual no = 50/60 = 0.83
true negatives (TN): We predicted no, and they don't have the • equivalent to 1 minus False Positive Rate
disease. Precision: When it predicts yes, how often is it correct?
Error rate: 1 – accuracy, or false positives (FP): We predicted yes, but they don't actually • TP/predicted yes = 100/110 = 0.91
have the disease. (Also known as a "Type I error.") Prevalence: How often does the yes condition actually occur in
false negatives (FN): We predicted no, but they actually do our sample?
have the disease. (Also known as a "Type II error.") • actual yes/total = 105/165 = 0.64
3
November 18 3 November 18 4
3
Sensitivity in yellow, specificity in red Precision in red, recall in yellow
November 18 5 November 18 6
Actual class
• Sensitivity: True Positive recognition rate, Total 230 P’ 9770 N’ 10000 96.40
accuracy
Model 1
Evaluating Classifier Accuracy: Model 1 is better
Holdout & Cross-Validation Methods Model Selection: ROC Curves Model 2 than Model 2.
Why?
• Holdout method • ROC (Receiver Operating Characteristics)
• Given data is randomly partitioned into two independent sets Diagonal
curves: for visual comparison of classification
line
• Training set (e.g., 2/3) for model construction models
• Test set (e.g., 1/3) for accuracy estimation
• Originated from signal detection theory
• Random sampling: a variation of holdout Fig 1 : Holdout Method
• Repeat holdout k times, accuracy = avg. of the accuracies obtained • Shows the trade-off between the true positive
rate and the false positive rate Vertical axis represents
• Cross-validation (k-fold, where k = 10 is most popular) the true positive rate
• Randomly partition the data into k mutually exclusive subsets, each • The area under the ROC curve is a measure of
Horizontal axis rep. the
approximately equal size the accuracy of the model
false positive rate
• At i-th iteration, use Di as test set and others as training set • Rank the test tuples in decreasing order: the The plot also shows a
• Leave-one-out: k folds where k = # of tuples, for small sized data, one that is most likely to belong to the positive
one sample is left out for testing diagonal line
class appears at the top of the list
• *Stratified cross-validation*: folds are stratified so that class dist. in each A model with perfect
fold is approx. the same as that in the initial data • The closer to the diagonal line (i.e., the closer accuracy will have an area
the area is to 0.5), the less accurate is the of 1.0
12
model
November 18
Fig 2 : Cross-Validation Method
11 November 18 12
By the end of this lesson, you should know:
• What is data visualisation.
• Benefits of good data visualisation.
Analytics Methods • Types of data visualisation.
Resource:
Tableau
October 18 1 October 18 2
October 18 3 October 18 4
Benefits of good data visualisation Types of data visualisation
• As the “age of Big Data” kicks into high-gear, visualization is an increasingly • When you think of data visualization, your first thought probably
key tool to make sense of the trillions of rows of data generated every day.
Data visualization helps to tell stories by curating data into a form easier to immediately goes to simple bar graphs or pie charts. While these may
understand, highlighting the trends and outliers. A good visualization tells a be an integral part of visualizing data and a common baseline for
story, removing the noise from data and highlighting the useful many data graphics, the right visualization must be paired with the
information.
• However, it’s not simply as easy as just dressing up a graph to make it look
right set of information. Simple graphs are only the tip of the iceberg.
better or slapping on the “info” part of an infographic. Effective data There’s a whole selection of visualization methods to present data in
visualization is a delicate balancing act between form and function. The effective and interesting ways.
plainest graph could be too boring to catch any notice or it make tell a
powerful point; the most stunning visualization could utterly fail at
conveying the right message or it could speak volumes. The data and the
visuals need to work together, and there’s an art to combining great
analysis with great storytelling.
October 18 5 October 18 6
October 18 7 October 18 8
Treemap
• Treemaps are a powerful and compact way to visualize hierarchical
and part-to-whole relationships. Each branch of the tree is
represented as a rectangle, with the size of a branch proportionate to
a specified measure of the data. A lot of people like treemaps
because they're visually attractive, so understanding how to leverage
color is a plus. Color is often used to show dimensions in a treemap—
heat maps work well if you want to show a spectrum.
https://public.tableau.com/views/CashonHand1/CashonHand?:embed
=y&:loadOrderID=0&:display_count=yes
October 18 9 October 18 10
Learn how to build a tree map in Tableau Learn how to build it in Tableau
• https://www.tableau.com/learn/tutorials/on-demand/histograms • https://www.tableau.com/learn/tutorials/on-demand/treemaps-
word-clouds-and-bubble-charts-chart-type
October 18 11 October 18 12
Histogram
• Histograms plot the number of occurrences of a given variable in a set
of data. They’re a great tool for getting an overview of the entire
distribution of a variable, and they take the form of a bar chart.
Imagine using histograms for retail analytics, to count the number of
sales of individual products by category. Or in customer analytics, to
tally the range of spending in a certain demographic.
October 18 13 October 18 14
October 18 15 October 18 16
Learn how to build it in Tableau
• http://kb.tableau.com/articles/knowledgebase/box-plot-analog
October 18 17 October 18 18
Gantt chart
• Gantt charts are the enemy of procrastination, keeping those micro-
deadlines between projects well in view. They’re great for displaying a
timeline such as project stages or a product release—to ensure you
release the beta before the product.
• The viewer can instantly see when parts of a project begin and end in
relation to each other, without having to cross check between pages
or sheets. Did you know? The first Gantt-type chart was developed in
1896, and was called a harmonogram. So all your departments can
work in harmony.
October 18 19 October 18 20
Learn how to build it in Tableau Word cloud
• http://onlinehelp.tableau.com/current/pro/online/mac/en- • Word clouds are like bubble charts in that words are sized according
us/buildexamples_gantt.html to some numerical measure and all packed into a designated space.
They’re useful for presenting data about—you guessed it—words.
While word clouds are not the best for accurate interpretation,
sometimes they add impact to a dashboard and encourage more
people to engage with the data.
October 18 21 October 18 22
October 18 23 October 18 24
Lab work
• Go through each video on the following URL.
References:
Handbook Of Natural Language Processing
Coursera Basic Natural Language Processing
October 18 25 November 18 1
By the end of this lesson, you should know: What is Natural Language
• Overview of Natural Language Processing (NLP) • Language used for everyday communication
• Application of NLP • English
• Chinese
• Stages of NLP • Tamil
• Espanol
• Malay
• Not artificial computer language ( python, c++)
• Also the language we use in short text messages (c u 2nite) or on
tweets is also, by this definition natural language.
November 18 2 November 18 3
What is Natural Language Processing Application of NLP
• Machine Translation
• Translation System
• Natural Language Processing (NLP) is the study of the computational
• Google Translate
treatment of natural (human) language
• Yahoo! Babel Fish
• In simpler words, teaching computers: • Database Access
• how to understand words mean
• how to generate human language by understanding how sentences are • Information Retrieval
constructed • Selecting from a set of documents the ones that are relevant to a query
• Gmail
• Natural Language evolve • Search Engine
google, selfie
• New words get added • Text Categorization
• Old words lose popularity thou
• Sorting text into fixed topic categories
• Meanings of words change Learn (Words such as learn in Old English used to mean teach)
• Language rules change position of verb
November 18 4 November 18 5
November 18 6 November 18 7
1) Phoneticss & Phonology:
• Phonetics:
• Pronunciation of different speakers.
Stages Of NLP • Deals with physical building blocks of language sound system.
• Pace of Speech.
• Example: I ate eight cake, different 'k' sounds in 'kite', 'coat', That band is
banned.
Phonetics & Morphological Syntactic Lexical Semantic Discourse Pragmatic • Phonology:
Phonology Analysis Analysis Analysis Analysis Integration Analysis • Processing of a speech.
• Organization of speech sound with in language.
• Example: Bank (finance) v/s. Bank (River),
November 18 8 November 18 9
November 18 12 November 18 13
November 18 14 November 18 15
Ambiguity in NLP
• Ambiguity can be referred as the ability of having more than one • If an expression (word/phrase/sentence) has more than one
meaning or being understood in more than one way. interpretation we can refer it as ambiguous.
• Natural languages are ambiguous, so computers are not able to • Eg: Consider the sentence, The chicken is ready to eat.
understand language the way people do. • The interpretations in the above phrase can be,The chicken(bird) is ready to
be fedor or
• Natural Language Processing (NLP) is concerned with the
development of computational models of aspects of human language • The chicken (food) is ready to be eaten.
processing. • Consider another sentence, There was not a single man at the party
• Ambiguity can occur at various levels of NLP. Ambiguity could be • The interpretations in this case can be Lack of bachelors at the party or
Lexical, Syntactic, Semantic, Pragmatic etc. • Lack of men altogether
November 18 16 November 18 17
References
• http://www.ijircce.com/upload/2014/sacaim/59_Paper%2027.pdf
Text Analysis
ITS480 BUSINESS DATA ANALYTICS
Prepared by: Ruhaila Maskat (PhD)
References:
EMC Data science module
Wikipedia
November 18 18 October 18 1
By the end of this lesson, you should know: Text analysis
• A specific form of text analysis i.e. sentiment analysis/buzz tracking. • The processing and representation of data that is in text form for the
• The steps and processes in sentiment analysis. purpose of analysing and learning new models from it.
• The main challenge in text analysis is the problem of high
dimensionality: every possible word in the document represents a
dimension.
• Example: A book named ‘Green Eggs and Ham’ written by Dr. Seuss
has just fifty different words, hence has 50 dimensions.
• Another challenge of text analysis is the data is unstructured.
October 18 2 October 18 3
October 18 4 October 18 5
Step 1: Monitor social networks, review sites
Buzz tracking: the steps and processes
for mentions
Steps Processes • Parsing
1. Monitor social networks, review sites for mentions Parse the data feeds to get actual content. • To resolve a sentence into component parts of speech and explain syntactical
of our products. Find and filter the raw text for product names.
(Use regular expression). relationship – Merriam-Webster.
2. Collect the reviews. Extract the relevant raw text. • Aim is to impose structure typically on semi-structured data e.g. html pages,
Convert the raw text into a suitable document RSS feeds.
representation. • The structure must be enough to find the part of the raw text that we really
Index into our review corpus.
care about: the actual content of review, titles, date of review.
3. Sort the reviews by product. Classification (or “Topic Tagging”).
• Output is a collection of phrases and words that speaks of the product of
4. Determining type of review (good or bad). Classification (sentiment analysis).
interest.
5. Marketing calls up and reads selected reviews in Search/Information retrieval.
full, for greater insight.
October 18 6 October 18 7
October 18 8 October 18 9
Example Step 2: Collect the reviews
Regular expression Matches Note • Extract and represent text
B[P|p]hone bPhone, bphone Pipe “|” means “or” • Aim: to represent our collection of phrases and words in a structured manner
bEb*k bEbook, bEbk, bEback… “*” is a wildcard, matches anything for downstream analysis and calculate the number of times a term occurs.
^I love A line starting with “I love” “^” means start of a string • A common representation is the “bag of words”. “Bag of words” is a vector
Acme$ A line ending with “Acme” “$” means the end of a string with one dimension for every unique term in the space.
• However, this results in “VERY high dimensional” structure.
• To produce bag of words, count the occurrences of the words in the text
parsed and number of times the word is repeated and store word count.
October 18 10 October 18 11
October 18 12 October 18 13
Step 2: Collect the reviews
• Document representation – other features
• “Feature” is anything about the document that is used for search or analysis.
• Title
• Keywords or tags
• Date information
• Source information
• Named entities
• Features help with downstream analysis in text classification.
October 18 14 October 18 15
• Representing a corpus
• Corpus is a collection of documents.
• Why represent a corpus? Because we want to archive them yet able to
conduct search for future reference and research.
• Reverse indexing provides a way of keeping track of list of all documents that
contain a specific feature and for every possible feature.
October 18 16 October 18 17
Step 2: Collect the reviews Step 3: Sort the reviews
• Common perception • Once all reviews have been collected and represented, we want to
• Documents are often only relevant to in the context of a corpus, or a specific sort them by the subject of interest i.e. product/service.
collection of documents. Hence, classifiers need to be trained on a specific set
of documents. Any changes to the corpus requires retraining of a classifier.
• Examples:
• Challenge “The bphone-5x has coverage everywhere. It’s much less flaky than my
old bPhone-4G”.
• Corpus changes constantly over time: not only do new documents get added,
but word distributions can change over time. This could reduce the
effectiveness of classifiers and filters if they are not retrained e.g. spam filters.
“While I love Acme’s bPhone series, I’ve been quite disappointed by
the bEBook. The text is illegible, and it makes even the Kindle look
blazingly fast”.
October 18 18 October 18 19
“The bphone-5x has coverage everywhere. It’s much less “While I love Acme’s bPhone series, I’ve been quite
flaky than my old bPhone-4G” disappointed by the bEBook. The text is illegible, and it makes
even the Kindle look blazingly fast”
October 18 20 October 18 21
Step 4: Determining type of review (good or
Text classification
bad)
• To sort reviews, we need to classify them. Typically, by topic tagging. • Another text classification task is done at this step. But here it
• Topic tagging often involves having a team of human users to determine involves determining if a review is good (positive) or bad (negative).
the classification of a review and tag it accordingly. This answers the
question of: • Commonly-used classifiers include Naïve Bayes and Support Vector
• Is this review about bPhone, bEBook or Kindle? Machine (SVM).
• This review is on bphone-5X or on bPhone-4G?
• Some rules for topic tagging: • A major bottleneck of this step is the need for tagged training data.
• If the product is mentioned in the title, then the review is likely to be about the Two approaches to overcome this:
product.
• If the mentions are in the contents the review may or may not be related to the • Have human to identify good and bad reviews.
product. • Utilise sentiment dictionary.
• A tweet is more likely about the product than a forum because a review may be
about comparison of different products.
• More frequent mentions of the products may indicate the review is relevant.
October 18 22 October 18 23
October 18 24 October 18 25
Training a classifier: Naïve Bayes
Tagged
training set
and its
1 probability.
2
3 The
probability
4
of a positive
5 review
6 having the
word “love”
is 4/6.
October 18 26 October 18 27
October 18 28 October 18 29
Term frequency – inverse document Term frequency – inverse document
frequency frequency
• Is a weight-based metrics to identify reviews/documents relevant to • Term frequency (tf): the number of times a term is found in a
some query terms. document over the total number of terms in the document.
• The underlying idea of TF-IDF is rare terms are weighted higher than • Document frequency (df): the number of documents with term t in it.
common terms. In other words, rare terms are regarded to be more • Inverse document frequency (idf): the logarithm of document
important than common terms due to their discriminating nature. frequency which indicates the rarity of a term.
• Consists of two parts: Term frequency and Inverse document
frequency.
idf = log ( (Size of corpus) / df )
October 18 30 October 18 31
October 18 32 October 18 33
Scoring TF-IDF
• To know which document is more relevant to the query terms.
• Document 1: 0 + 0 = 0
• Document 2: 0 + 0.13 = 0.13 More relevant
October 18 34 October 18 35
HDFS (Hadoop
SQOOP : SQL + Distributed File System)
- A technique to store data in
HADOOP = SQOOP distributed manner in order to
- Import structured data compute fast.
from tables (RDBMS) to - Saves data in a block of 64MB
HDFS. (default) or 128 MB in size which
is logical splitting of data.
- A file is created in HDFS
which contains the data
where it can be processed
by Map Reduce, HIVE or
PIG.
- Processed data in HDFS can
be stored back to another
table in RDBMS (export).
MapReduce MapReduce
Framework Framework
- A method of programming in a - For “embarrassingly parallel”
distributed data stored in a problem description. This problem
HDFS. is where a single task can be
divided into smaller tasks and
- Can be written by using any later recombined into a single
language like JAVA, C++ PIPEs, output.
PYTHON, RUBY etc. - MAP function divides a big task
into smaller tasks to be processed
- Can be applied to any type of on different units of machines.
data whether structured or This should be in the form of key,
unstructured. Example - word value pairs.
count using MapReduce. - In a word-count case, MAP
function would count the words in
each document by placing a
document on a machine. The key
would be a word, and the value
would be the count.
MapReduce HBASE
Framework - A non-relational (NoSQL)
- REDUCE function recombines database that runs on top
multiple small tasks to become a of HDFS.
single result.
- Was created for large table
- In a word-count case, REDUCE which have billions of rows
function would take the count of
words found in each document
and millions of columns
and total up to produce a total with fault tolerance
number of words counted. capability and horizontal
scalability and based on
Google Big Table.
- Hadoop can perform
only batch processing, and
data will be accessed only
in a sequential manner for
random access.
Hive Pig Latin
- For SQL-literate people. - Also deals with structured
- Mainly deals with structured data.
data stored in HDFS. - For programmer who loves
- A specialised query language scripting and don't want to
called HQL (Hive Query use Java/Python or SQL to
Language). process data.
- Also run Map reduce program in - A Pig Latin program is made
a backend to process data in up of a series of operations,
HDFS. or transformations, that are
applied to the input data
which runs MapReduce
program in backend to
produce output.
Mahout Oozie
- An open source machine learning - A workflow scheduler system to
library from Apache written in Java. manage Hadoop jobs.
- The algorithms it implements fall - A server-based Workflow Engine
under the broad umbrella of specialized in running
machine learning or collective
intelligence.
workflow jobs with actions that
run Hadoop MapReduce and
- Primarily recommender engines Pig jobs.
(collaborative filtering), clustering,
and classification. - -Implemented as a Java Web-
Application that runs in a Java
- The machine learning tool of choice
when the collection of data to be Servlet-Container.
processed is very large, perhaps far - Used when a programmer wants
too large for a single machine. to run many job in a sequential
manner like output of job A will
be input to Job B and similarly
output of job B is input to job C
and final output will be output
of job C.
Zookeeper
- A centralized service for
maintaining configuration
information, naming, providing
distributed synchronization, and
providing group services . In case
of any partial failure clients can
connect to any node and be
assured that they will receive
the correct, up-to-date
information.