Business Analytics Notes
Business Analytics Notes
Business Analytics
Business Analytics is the use of data, information technology, statistical analysis,
quantitative methods and mathematical or computer based models to help managers gain
improved insight about their business operation and make better, fact based decision.
Business Analytics is supported by various tools such as Microsoft Excel, Excel add-ins,
commercial statistical software packages such as SAS or Minitab and more complex
business intelligence suites that integrate data with analytical software.
Why Analytics ?
Three fold
2. Technology : Used for data capture, data storage, data preparation, data analysis and data
share
Model building - Analytics model building is iterative process that aims to find best
model.
Communication and deployment of the data analytics.
Pharmaceutical firms use it get life saving drugs to market more quickly
Airlines and hotels use it to dynamically set prices over time to maximize revenue
Sports team use it to determine both game strategy and optimal ticket prices
Type of Decisions enhanced by Analytics
Pricing
Customer Segmentation
Merchandising
Location
Statistics methods - gain richer understanding of data - finding relationships among the
data
Data mining - better understanding characteristics and patterns among variables in large
databases using variety of statistical and analytical tools.
Simulation and risk analysis - relies on spreadsheet models and statistical analysis to
examine the impacts of uncertainty in the estimates and their potential interaction with
one another on the output variable of interest.
What-if analysis
Faster decisions
Better productivity
Data filters/ Drill-down : Equipped with a range of filters, drop-downs, slicers and search
functions that make it easy for users to find the information they need quickly.
Security
Data visualization: Bring your data to life - Dashboards, charts, graphs, gauges and other
visuals.
Self-service
Mobile applications
Descriptive Analytics
Simplest form of analytics that mainly uses simple descriptive statistics, data
visualization techniques and business related queries to understand past data.
Used for understanding the trends in past data which can be useful for generating
insights.
Examples….
Most shoppers turn towards the right side when they enter a retail store. Retailer keep
products with higher profit on the right side of the store.
John snow was able to come up with hypothesis that the cholera was a water borne
disease and does not spread through air
Predictive Analytics
It aims to predict the probability of occurrence of a future event
It predict by examining historical data, detecting patterns or relationship in these data and
then extrapolating these relationships forward in time.
Eg. Skiwear manufacturer predicts next season’s demand for skiwear of a specific color
and size.
Bank manager predict the chances that a loan applicant will default.
Prescriptive Analytics
It assist in finding the optimal solution to a problem or in making the right decision
among several alternatives
It uses the optimization to identify the best alternative to minimize or maximize some
objective
Examples..
Coca-Cola Enterprise developed an OR mode that would meet several objectives such as
improved customer satisfaction and optimal asset utilization for their distribution network
of Coca-Cola products from 430 distribution centers to 2.4million retail outlets.
Requires new technologies and techniques to capture, store, and analyze it; and is used to
enhance decision making, provide insight and discovery, and support and optimize
processes
Cost reductions
Reduced risks
Big data help organizations better understand & predict customer behavior and improve
customer service.
The effective use of big data has the potential to transform economies, delivering a new
wave of productivity growth and consumer surplus
Using big data will become a key basis of competition for existing companies
Understanding big data requires advanced analytics tools such as data mining and text
analytics
Slowly, cryptocurrencies are coming under the regulatory net in order to check misuse. Japan
recently became the first country to regulate cryptocurrencies; the US is quickly laying down
regulatory guidelines, the UK and Australia continue to work on the formalities while China has
recently banned Initial Coin Offerings (ICO) due to various reasons, including various ICO
scams around the world.
Though India plays a relatively small role in the global cryptocurrency market, only about 2% of
the global cryptocurrency market cap, the RBI has warned about the potential financial, legal,
customer protection and security-related risks in cryptocurrency, amidst prevalent media
rumours of RBI launching its own form of cryptocurrencynamed Lakshmi. The recent action of
conducting survey by the IT department over all the major exchanges has also triggered issuance
of income tax demand notice by the department to the users of these exchanges.
The global cryptocurrency market cap had declined from $650 billion to $480 billion by March
2018. Thus, the rapid proliferation of the cryptocurrency ecosystem in India with its benefits
removing the limitations of the monetary system, and being volatile at the same time, has left the
future of the Indian market uncertain.
Cryptocurrency
Cryptocurrency is designed from the ground up to take advantage of the internet and how it
works. Instead of relying on traditional financial institutions who verify and guarantee your
transactions, cryptocurrency transactions are verified by the user's computers logged into the
currency's network. Since the currency is protected and encrypted, it becomes impossible to
increase the money supply over a predefined algorithmic rate. All users are aware of the
algorithmic rate. Therefore, since each algorithm has a roof limit, no cryptocurrency can be
produced or "mined" beyond that.
Since Cryptocurrency is completely in the cloud, it does not attain a physical form but have a
digital value, and can be used for digital equivalent of cash in a steadily increasing number of
retailers and other businesses. Bitcoin was the first cryptocurrency that was ever created, and
while there is a small fee for every cryptocurrency transaction, it is still considerably lesser than
the usual credit card processing fees.
Bitcoin is the most popular cryptocurrency which has seen a massive success. There are other
cryptocurrencies such as Ripple, Litecoin, Peercoin, etc. for people to transact in. But for every
successful cryptocurrency, there are others which have died a slow death because no one
bothered to use them, and a cryptocurrency is only as strong as its users. Some of the salient
features of Cryptocurrency include -
Cryptocurrency can be converted into other forms of currency and deposited into user's
accounts at a lightning speed
Most Cryptocurrency can be transacted anonymously, and can be used as discreet online
cash anywhere in the world. Users therefore do not have to pay for any currency
conversion fees
While not 100% immune from theft, Cryptocurrency is generally safe to use and difficult
for malicious hackers to break
Bitcoin and other Cryptocurrency can be saved offline either in a "paper" wallet or on a
removable storage hard drive which can be disconnected from the internet when not in
use
Bitcoin is a digital currency designed for the recent market scenario. The currency was created in
the year 2009. The idea set out behind the creation of the coin is to use the white paper by one of
the mysterious individual Satoshi Nakamoto, whose identity has not yet been recognized
(Nakamloto 2013). The idea of the named of Bitcoin is that a paper is termed as bit and the
currency as the coin. The concept behind the creation of Bitcoin is the easy transfer of the money
without paying a large amount of transaction fees. In the view point of Ron and Shamir (2013),
the traditional online payment charges some amount of transaction amount that is to be paid to
the bank or other financial organization related to the transaction. As pointed out by Jarecki et al.
(2016), it must be a clear concept that there is no physical evidence of the Bitcoin. It is shown
only as balance in the account of the user of the Bitcoin profile. The balance is maintained
keeping all the accounting in mind such as the ledger or the balance sheet. At the time of
transaction, a kind of verification is done as to undertake that the transaction of the money has
been done without any kind of trouble or mishandling. The coin was available in various
denominations such as millibitcoin and microbitcoin. There have been more that 21 millions of
bitcoins issued till 2016 (McCallum 2014).
There are many reasons why the impact of Bitcoin is exceptionally relevant today, and why the
Cryptocurrency of 2018 is now here to stay. These include -
1. Reduced Remittance
Many governments around the world are implementing isolationist policies which restrict
remittances made from other countries or vice versa either by making the charges too high or by
writing new regulations. This fear of not being able to send money to family members and others
is driving more people towards digital Cryptocurrency, chief amongst them being Bitcoin.
Many sovereign currencies and their usage outside of their home country are being regulated and
restricted to an extent, thereby driving the demand for Bitcoin. For example, the Chinese
government recently made it tougher for people as well as businesses to spend the nation's
currency overseas, thereby trapping liquidity. As a result, options such as Bitcoin have gained
immense popularity in China.
3. Better Acceptance
Today, more consumers are using Bitcoins than ever before, and that is because more legitimate
businesses and companies have started accepting them as a form of payment. Today, online
shoppers and investors are using bitcoins regularly, and 2016 saw 1.1 million bitcoin wallets
being added and used.
4. Corruption Crackdown
Although unfortunate, digital Cryptocurrency such as Bitcoin are now also seeing more usage
because of the crackdown on corruption in many countries. Both India and Venezuela banned
their highest denomination and still-circulating bank notes in order to make it tougher to pay
bribes and make accumulated black money useless. But that also boosted the demand for
Bitcoins in such countries, enabling them to send and receive cash without having to answer to
the authorities.
Bitcoin – Challenges
Purely based on its digital form, Bitcoin and other types of Cryptocurrencies are nowadays the
favorite mode of payment for both hackers and criminals because of the air of anonymity it
lends. This instantly makes the general populace weary of using it. In 2014, Mt. Gox, the largest
Bitcoin exchange was hacked and robbed of almost $69 million, thereby bankrupting the whole
exchange. While the people who lost money have now been paid back, it still leaves a lot of
people wary of the same thing happening again.
The cryptocurrency community is up in arms over how the blockchain will be upgraded for
future users. As the time and fees required for verifying a transaction climbs to record highs,
more businesses are having a tough time accepting Bitcoins for payment. In early 2017, more
than 50 companies came together to speed up transactions, but till now the results have not yet
been felt. As a result, more users might start using normal modes of currency to overcome such
blockchain hassles.
Today, Bitcoin is not the only game in town, and while its value has increased by almost 100%
since the beginning of 2016, its share of the digital currency pile is rapidly reducing owing to
almost 700 different competitors. Its market share has reduced to 50% from 85% a year before, a
sign of the times to come.
Unrecognized by Governments
Most of the general populace doesn't understand Bitcoins, and nor does most of the world's
governments. The cost of gaining a license to set up cryptocurrency companies is sky-high, and
there are no regulations in sight which might make it easier for people looking to invest into
them. The U.S. Securities and Exchange Commission recently rejected a proposal by Bitcoin to
run a publicly traded fund based on the digital currency, which in turn led to a big plummet in
Bitcoin's shares.
In India, the government does not consider cryptocurrencies legal tender or coin and will take all
measures to eliminate use of these crypto-assets in financing illegitimate activities or as part of
the payment system. However, the government has recognized blockchain and will explore use
of blockchain technology proactively for ushering in digital economy. However, it doesn’t mean
that holding cryptocurrencies such as bitcoin is illegal or banned. The price of bitcoin peaked to
cross $18,500 in December 2017. Since then the price has dropped, to $8,500 in February 2018.
As on 13th April 13, 2018 value of 1 bitcoin is Rs 5,14,500. Bitcoin prices have been extremely
volatile in the past few months. Mint Money does not recommend investing in bitcoin or other
cryptocurrencies.
Right now, the general understanding of the term Bitcoin in India is vague. There are a lot of
people in India who are intrigued by the technology but don’t understand it well enough. With
India’s leading cryptocurrency exchanges such as Zebpay, Unocoin, Coinsecure, Coinome and
Bitxoxo among others reporting a marked increase in user interest every day, IAMAI has been
focusing on increasing user awareness outreach programs such as educational videos and reading
material, becoming one of the first industry bodies in the world to do so
Many have viewed the rise of bitcoin and other cryptocurrencies as a massive bubble similar to
the dotcom and other bubbles in history which saw asset prices increase without any fundamental
reason. Goldman Sachs, for instance, warned investors in February that most cryptocurrency
prices are headed to zero as they lack intrinsic value. So, to sceptics, the crash now will likely
vindicate a belief that markets eventually mark down the prices of assets that have no real value,
to zero. Cryptocurrency enthusiasts, on the other hand, view the crash as just another healthy
correction that is part of any asset’s rise over the long run. In fact, they point to similar steep
crashes in the price of cryptocurrencies in the past that turned out to be short-lived. Thus they see
the present crash as a good chance to buy cryptocurrencies cheaply before their prices begin to
rise again.
MODULE 2
The opportunity for the sector is to unlock the potential in the data through analytics and shape
the strategy for business through reliable factual insight rather than intuition. Recognising that
data is a significant corporate asset, a number of organisations are appointing chief data officers
following pressure to police the integrity of the numbers as last year's Economist Special Report,
"Data, Data Everywhere", put it. This pressure is driven by business leaders wanting more
consistency in information and by regulators becoming increasingly concerned at the quality of
data they receive during a time when regulatory demands are growing. This is made clear by the
increasing number of references to integrity of data in each new regulatory requirement. For
example, Solvency II requires insurers to have "internal processes and procedures in place to
ensure the appropriateness, completeness and accuracy of the data". These processes and
procedures could (and usually do) involve technology, but should also include data policies,
standards, roles and responsibilities to ensure data integrity is appropriately governed.
While it is crucial to ensure the integrity of data provided to executive management and
regulators, unlocking the insights in the data to better understand customers, competitors and
employees represents a significant opportunity to gain competitive advantage. While regulatory
pressure is forcing organisations to improve the integrity of the data, many financial institutions
are seeing improved data quality and the use of analytics as an opportunity to fundamentally
change the way decisions are made and to use the data for commercial gain.
Much of the current debate around big data is locked in technological advancements. This misses
the point that the real strategic value in the data is the insight it can give into what will happen in
the future. Predicting how customers and competitors' customers will behave and how that
behaviour will change is critical to tailoring and pricing products. Big data should be about
changing the way you do business to harnesses the real value in your data, re-shape the
interaction with the market and increase the lifetime value of your customers. Therefore, which
data is required to achieve these objectives, who needs it and how often are key pieces of the big
data puzzle.
Big data should also involve using multiple data sources, internally and externally. Geo-spatial
data, social media, voice, video and other unstructured data all have their part to play in knowing
the customer today and their future behaviours. For example, leading firms are looking at using
both internal and external data, both structured and unstructured, to develop personalised
banking products. Customers are more likely to be attracted and retained with personalised
products - hence, lifetime value goes up. Similarly, analytics have an increasingly important part
to play in the recovery of bad debt. Recoveries functions typically target based on the
delinquency status of the account. However, a better understanding of customer circumstances
can improve targeting and have an immediate impact on recovery rates while also reducing cost.
There is no doubt that harnessing the power of big data can enhance organisational performance.
However, it is not a technological question. It is a strategic one about how an organisation
derives genuine insight from their data and changes the way they interact with customers,
competitors and the market through fact-driven decision-making. Those organisations that
master this will set the trend in customer service, improve profitability and respond more rapidly
to the evolving regulatory and competitive demands of the industry.
Beyond the obvious sales and lead generation applications, marketing analytics can offer
profound insights into customer preferences and trends. Despite these compelling benefits, a
majority of organizations fail to ever realize the promises of marketing analytics. According to a
survey of senior marketing executives published in the Harvard Business Review, "more than
80% of respondents were dissatisfied with their ability to measure marketing ROI."
However, with the advent of search engines, paid search marketing, search engine optimization,
and powerful new software products from WordStream, marketing analytics is more powerful
and easier to implement than ever.
The importance of marketing analyics is obvious: if something costs more than it returns, it's not
a good long-term business strategy. In a 2008 study, the Lenskold Group found that "companies
making improvements in their measurement and ROI capabilities were more likely to report
outgrowing competitors and a higher level of effectiveness and efficiency in their marketing."
Simply put: Knowledge is power.
In search marketing in particular, one of the most powerful marketing performance metrics
comes in the form of keywords. Keywords tell you exactly what is on the mind of your current
and potential customers. In fact, the most valuable long-term benefit of engaging in paid and
natural search marketing isn't incremental traffic to your website, it's the keyword data
contained within each click which can be utliized to inform and optimize other business
processes.
Product Design: Keywords can reveal exactly what features or solutions your customers
are looking for.
Customer Surveys: By examining keyword frequency data you can infer the relative
priorities of competing interests.
Industry Trends: By monitoring the relative change in keyword frequencies you can
identify and predict trends in customer behavior.
Customer Support: Understand where customers are struggling the most and how
support resources should be deployed.
Reports and information received from search marketing help in all areas of your business,
including offline revenue and product development.
When implementing your search efforts, be sure keep these five tips in mind.
Pricing Analytics
Pricing Analytics enables companies, across all industries, to dramatically improve
profitability & market share by defining optimal prices & pricing strategy.
It leverages data to understand what drives your customers’ buying decisions and
integrates this knowledge to meet your pricing needs.
Research shows that price management initiatives can increase a company’s margins by 2
to 7 percent in 12 months—yielding an ROI between 200 and 350 %.
Combine pricing with analytics, and you can create a mechanism that functions as both a catalyst
and a metrics engine for managing profitability.
Pricing analytics can also help executives more clearly understand both the internal and external
factors affecting profitability at a granular level.
More effective business decisions on tightly defined issues, such as which customers to
focus on and which products to rationalize, using hard data rather than gut instincts.
Clear feedback loops so pricing teams can assess effectiveness and adjust as needed.
The ability to run advanced scenario modeling to avoid costly mistakes or missed
opportunities—for example, the potential impact of a pricing change on demand and
profitability.
Nowadays, analytics and “big data” are everywhere: hardly a single day passes without hearing
about them in the news. However beyond the “buzz”, a lot of businesses struggle to find any
practical use for their data. As global pricing consultants delivering analytics services and tools,
we are often asked the same question: “what will I get out of your science?” Over the years, we
have identified five situations in which a business needs to make use of its data and implement
an analytical process.
In spite of the increased availability of information, a great many companies are still blind (and
deaf) when it comes to knowing their customers. For those companies, simple descriptive and
diagnostic tools, including customer segmentation, can help dramatically improve performance.
One of our clients in manufacturing saw its margins improving by 4% by simply aligning its
pricing (and especially its discounting) along the customer segments we had identified for them.
A good understanding of the reasons behind past performance can sometimes go a long way in
improving the present.
The implementation of pricing analytics tools often leads our clients to uncover “quick wins”, or
extra revenue and/or margin that can be generated over a short period of time by fixing the most
obvious cases of price misalignment or leakages. In a recent project for a client in electronics,
our team uncovered over $20 million of such “low hanging fruits” by highlighting a few large
but highly unprofitable accounts, and realigning the discounts they were being offered. The
benefit to our clients is obvious. Beyond the extra bucks generated on the short term, the quick
wins are very often the first building block of a longer term effort to realign prices and increase
margins.
The two needs described above are typically expressed by companies without a well-defined
pricing strategy. Companies with a pricing strategy already in place have different needs,
typically revolving around the necessity to closely monitor the market and to anticipate the
impact of a price change or a promotional campaign. Those companies usually require predictive
models, capable of reproducing what actually happened in the past, and hence to predict what
will happen in the future with great accuracy. Even though the benefits of such tools are less
obvious to measure than those stemming from the diagnostic tools described above, they
shouldn’t be underestimated. For instance, one of our clients in the consumer packaged goods
industry uses a predictive model to plan its price changes and promotions. Using the model over
time, they have learned a lot about what to do and even more about what NOT to do. Ultimately,
this saved the company millions of dollars by putting an end to inefficient promotions.
4. Optimizing Pricing
Building upon the predictive pricing models they use to manage their prices and promotions,
some companies take it to the next level and implement full-fledged profit optimization tools.
However, it should be noted that the shift from being able to predict what will happen to actually
optimizing pricing is far from being straightforward. Companies that successfully manage the
shift typically have had a sound pricing strategy in place for several years and at the same time
excel at execution.
Generally speaking, this need applies to all companies, regardless of their industry, size or the
degree of sophistication of their pricing strategy. As every pricing manager knows too well,
getting all the internal stakeholders (whether from sales, marketing, finance or even production)
to agree upon the pricing strategy and execute it accordingly, can prove to be a major challenge.
The very fact of backing the strategy itself with “cold hard facts” and a scientifically robust
analysis of the data can help facilitate this process and get approval from the various
stakeholders more easily.
Retail analytics focuses on providing insights related to sales, inventory, customers, and other
important aspects crucial for merchants’ decision-making process.
The discipline encompasses several granular fields to create a broad picture of a retail business’
health, and sales alongside overall areas for improvement and reinforcement. Essentially, retail
analytics is used to help make better choices, run businesses more efficiently, and deliver
improved customer service.
The field of retail analytics goes beyond superficial data analysis, using techniques like data
mining and data discovery to sanitize datasets to produce actionable BI insights that can be
applied in the short-term.
Moreover, companies use these analytics to create better snapshots of their target demographics.
By harnessing sales data, retailers can identify their ideal customers according to diverse
categories such as age, preferences, buying patterns, location, and more.
Essentially, the field is focused not just on parsing data, but also defining what information is
needed, how best to gather it, and most importantly, how it will be used.
By prioritizing retail analytics basics that focus on the process and not exclusively on data itself,
companies can uncover stronger insights and be in a more advantageous position to succeed
when attempting to predict business and consumer needs.
In addition, they can optimize inventory management to emphasize products customers need,
reducing wasted space and associated overhead costs.
Apart from inventory activities, many retailers use analytics to identify customer trends and
changing preferences by combining data from different areas. By merging sales data with a
variety of factors, businesses can identify emerging trends and anticipate them better. This is
closely tied to marketing functions, which also benefit from analytics.
Companies can harness retail analytics to improve their marketing campaigns by building an
improved understanding of individual preferences and gleaning more granular insights. By
blending demographic data with information such as shopping habits, preferences, and purchase
history, companies can create strategies that focus on individuals and exhibit higher success
rates.
• Within each company, the supply chain includes all functions involved in fulfilling a
customer request ( product development, marketing, operations, distribution, finance,
customer service)
Lack of real-time data visibility, with no common view across all businesses and
channels.
Irregular reviews of safety stock levels, causing frequent stock-outs or excess inventory.
Lack of flexibility in the network and distribution footprint, so that decision-makers find
it difficult to prioritize between cost to serve and customer service levels, resulting in
less profitability.
Production line imbalance and suboptimal batch sizes, creating asset underutilization
Inventory Optimization
Managing near-term inventory needs is relatively straight forward, but this gets
progressively more complex the farther you plan into the future.
Transportation Analytics
Cost reduction and increase in efficiencies through more effective use of your
human resources and assets.
HR ANALYTICS
Human resource analytics (HR analytics) is an area in the field of analytics that refers to
applying analytic processes to the human resource department of an organization in the hope of
improving employee performance and therefore getting a better return on investment. HR
analytics does not just deal with gathering data on employee efficiency. Instead, it aims to
provide insight into each process by gathering data and then using it to make relevant decisions
about how to improve these processes.
What HR analytics does is correlate business data and people data, which can help establish
important connections later on. The key aspect of HR analytics is to provide data on the impact
the HR department has on the organization as a whole. Establishing a relationship between what
HR does and business outcomes – and then creating strategies based on that information – is
what HR analytics is all about.
HR has core functions that can be enhanced by applying processes in analytics. These are
acquisition, optimization, paying and developing the workforce of the organization. HR analytics
can help to dig into problems and issues surrounding these requirements, and using analytical
workflow, guide the managers to answer questions and gain insights from information at hand,
then make relevant decisions and take appropriate actions.
What is HR analytics?
Human Resource analytics (HR analytics) is about analyzing an organizations’ people problems.
For example, can you answer the following questions about your organization?
Do you know which employees will be the most likely to leave your company within a year?
You can only answer these questions when you use HR data. Most HR professionals can easily
answer the first question. However, answering the second question is harder.To answer the
second question, you’d need to combine two different data sources. To answer the third one,
you’d need to analyze your HR data.HR departments have long been collecting vast amounts of
HR data. Unfortunately, this data often remains unused. As soon as organizations start to analyze
their people problems by using this data, they are engaged in HR analytics.
“HR analytics is the systematic identification and quantification of the people drivers of business
outcomes (Heuvel & Bondarouk, 2016).”
In other words, it is a data-driven approach towards HR.Over the past 100 years, Human
Resource Management has changed. It has moved from an operational discipline towards a more
strategic discipline. The popularity of the term Strategic Human Resource Management (SHRM)
exemplifies this. The data-driven approach that characterizes HR analytics is in line with this
development. By using HR analytics you don’t have to rely on gut feeling anymore. Analytics
enables HR professionals to make data-driven decisions. Furthermore, analytics helps to test the
effectiveness of HR policies and different interventions. By the way, HR analytics is very similar
to people analytics but there are some subtle differences in how the terms are used.
Like marketing analytics has changed the field of marketing, HR analytics is changing HR. It
enables HR to:
Today, the majority of HR departments focus on reporting employee data. This doesn’t suffice in
today’s data-driven economy.
Just keeping records is often insufficient to add strategic value. In the words of Carly Fiorina:
“The goal is to turn data into information and information into insight”. This also applies to HR.
Doing this enables HR to become more involved in decision-making on a strategic level. A few
examples:
To get started with HR analytics, you need to combine data from different HR systems. Say you
want to measure the impact of employee engagement on financial performance. To measure this
relationship, you need to combine your annual engagement survey with your performance data.
This way you can calculate the impact of engagement on the financial performance of different
stores and departments.
Key HR areas will change based on the insights gained from HR analytics. Functions like
recruitment, performance management, and learning & development will change.
Imagine that you can calculate the business impact of your learning and development budget! Or
imagine that you can predict which new hires will become your highest performers in two years.
Or that you can predict which new hires will leave your company in the first year. Having this
information will change your hiring & selection procedures and decisions.
Talent Analytics
Talent analytics is a science, which focuses on the current/historical state of the workforce in the
organization. It is a cross-sectional representation of how many people work for you, who are
your best performers, which departments are functioning most effectively and leverage this
current/historical state of workforce to address deeper HR issues taking a predictive view on
issues like predicting and controlling attrition, (and) predicting good quality hires in the
organization, thus improving effectiveness in recruitment.
On the other hand, workforce planning (WFP) looks at the current and the future headcount
requirements in terms of both capacity and capability, required to meet an organization’s current
and future business plans. Workforce plans look at external and internal factors influencing the
need for the number of people and the skill set required by the organization into the future.
Talent analytics is an analytics platform that produces insights into the workforce — into the
potential hiring pool and into your existing team members. These insights are used to create a
better understanding of the strengths of employees and potential employees, their weaknesses
and how these can be improved.
Each suite, of course, has its individual stamp in terms of organization and what its features are
called. That said, the functionality in a talent analytics application can be expected to be divided
roughly into three categories, says Xavier Parkhouse-Parker, co-founder and CEO of PLATO
Intelligence.
Hiring Analytics - Hiring analytics provide insights into prospective hires by analyzing their
skills. It also guides the company into making an impartial decision based on the data. One hot
topic in the industry right now is bias, Parkhouse-Parker says, and it is argued that hiring
analytics can help stem or even eliminate this from the hiring process.
Ongoing Feedback Analytics - Ongoing feedback analytics focuses on the existing workforce,
determining whether the teams in the company are performing well, whether they have the right
skill set and the right talent in the right places.
Optimization Analytics - Optimization analytics marries the data and predictions from hiring
analytics and ongoing feedback analytics to ensure the company has what it needs to make its
internal processes as robust as possible.
Examples…
One typical feature in these applications is called automated resume screening, Downes says. It
works like this, the candidate uploads his/her resume and the people analytics app uses a
machine learning algorithm that reads the words on the resume and scores the candidate on how
well she will do in her job. The score is based on all kinds of seemingly disparate data and
insights the system has gleaned.
There is, in fact, a wealth of examples of what this application can do. According to a post by
Deloitte Insights, these data-driven tools can predict patterns of fraud, show trust networks, show
real-time correlations between coaching and engagement, and analyze employee patterns for
time management based on email and calendar data. They can also analyze video interviews and
help assess candidate honesty and personality through software.
Cloud computing
Simply put, cloud computing is the delivery of computing services—servers, storage, databases,
networking, software, analytics and more—over the Internet (“the cloud”). Companies offering
these computing services are called cloud providers and typically charge for cloud computing
services based on usage, similar to how you are billed for water or electricity at home.
Cloud computing is an information technology (IT) paradigm that enables ubiquitous access to
shared pools of configurable system resources and higher-level services that can be rapidly
provisioned with minimal management effort, often over the Internet. Cloud computing relies on
sharing of resources to achieve coherence and economies of scale, similar to a public utility.
Third-party clouds enable organizations to focus on their core businesses instead of expending
resources on computer infrastructure and maintenance. Advocates note that cloud computing
allows companies to avoid or minimize up-front IT infrastructure costs. Proponents also claim
that cloud computing allows enterprises to get their applications up and running faster, with
improved manageability and less maintenance, and that it enables IT teams to more rapidly
adjust resources to meet fluctuating and unpredictable demand.Cloud providers typically use a
"pay-as-you-go" model, which can lead to unexpected operating expenses if administrators are
not familiarized with cloud-pricing models.
Since the launch of Amazon EC2 in 2006, the availability of high-capacity networks, low-cost
computers and storage devices as well as the widespread adoption of hardware virtualization,
service-oriented architecture, and autonomic and utility computing has led to growth in cloud
computing.
Characteristics
Cloud computing exhibits the following key characteristics:
Agility for organizations may be improved, as cloud computing may increase users'
flexibility with re-provisioning, adding, or expanding technological infrastructure
resources.
Cost reductions are claimed by cloud providers. A public-cloud delivery model converts
capital expenditures (e.g., buying servers) to operational expenditure.This purportedly
lowers barriers to entry, as infrastructure is typically provided by a third party and need
not be purchased for one-time or infrequent intensive computing tasks.
Device and location independence enable users to access systems using a web browser
regardless of their location or what device they use (e.g., PC, mobile phone). As
infrastructure is off-site (typically provided by a third-party) and accessed via the
Internet, users can connect to it from anywhere.
Multitenancy enables sharing of resources and costs across a large pool of users thus
allowing for:
o peak-load capacity increases (users need not engineer and pay for the resources
and equipment to meet their highest possible load-levels)
o utilisation and efficiency improvements for systems that are often only 10–20%
utilised.
Performance is monitored by IT experts from the service provider, and consistent and
loosely coupled architectures are constructed using web services as the system interface.
Resource pooling is the provider’s computing resources are commingle to serve multiple
consumers using a multi-tenant model with different physical and virtual resources
dynamically assigned and reassigned according to user demand. There is a sense of
location independence in that the consumer generally have no control or knowledge over
the exact location of the provided resource.
Productivity may be increased when multiple users can work on the same data
simultaneously, rather than waiting for it to be saved and emailed. Time may be saved as
information does not need to be re-entered when fields are matched, nor do users need to
install application software upgrades to their computer.
Reliability improves with the use of multiple redundant sites, which makes well-designed
cloud computing suitable for business continuity and disaster recovery.
1. Cost
Cloud computing eliminates the capital expense of buying hardware and software and setting up
and running on-site datacenters—the racks of servers, the round-the-clock electricity for power
and cooling, the IT experts for managing the infrastructure. It adds up fast.
2. Speed
Most cloud computing services are provided self service and on demand, so even vast amounts
of computing resources can be provisioned in minutes, typically with just a few mouse clicks,
giving businesses a lot of flexibility and taking the pressure off capacity planning.
3. Global scale
The benefits of cloud computing services include the ability to scale elastically. In cloud speak,
that means delivering the right amount of IT resources—for example, more or less computing
power, storage, bandwidth—right when its needed and from the right geographic location.
4. Productivity
On-site datacenters typically require a lot of “racking and stacking”—hardware set up, software
patching and other time-consuming IT management chores. Cloud computing removes the need
for many of these tasks, so IT teams can spend time on achieving more important business goals.
5. Performance
The biggest cloud computing services run on a worldwide network of secure datacenters, which
are regularly upgraded to the latest generation of fast and efficient computing hardware. This
offers several benefits over a single corporate datacenter, including reduced network latency for
applications and greater economies of scale.
6. Reliability
Cloud computing makes data backup, disaster recovery and business continuity easier and less
expensive, because data can be mirrored at multiple redundant sites on the cloud provider’s
network.
Most cloud computing services fall into three broad categories: infrastructure as a service
(IaaS), platform as a service (PaaS) and software as a service (Saas). These are sometimes
called the cloud computing stack, because they build on top of one another. Knowing what
they are and how they are different makes it easier to accomplish your business goals.
Infrastructure-as-a-service (IaaS)
The most basic category of cloud computing services. With IaaS, you rent IT infrastructure—
servers and virtual machines (VMs), storage, networks, operating systems—from a cloud
provider on a pay-as-you-go basis.
Public cloud
Public clouds are owned and operated by a third-party cloud service provider, which deliver
their computing resources like servers and storage over the Internet. Microsoft Azure is an
example of a public cloud. With a public cloud, all hardware, software and other supporting
infrastructure is owned and managed by the cloud provider. You access these services and
manage your account using a web browser
Private cloud
A private cloud refers to cloud computing resources used exclusively by a single business or
organisation. A private cloud can be physically located on the company’s on-site datacenter.
Some companies also pay third-party service providers to host their private cloud. A private
cloud is one in which the services and infrastructure are maintained on a private network.
Hybrid cloud
Hybrid clouds combine public and private clouds, bound together by technology that allows
data and applications to be shared between them. By allowing data and applications to move
between private and public clouds, hybrid cloud gives businesses greater flexibility and more
deployment options.
MODULE 3
Measure of Central Tendency
A measure of central tendency is a single value that attempts to describe a set of data by
identifying the central position within that set of data. As such, measures of central tendency
are sometimes called measures of central location. The mean, median and mode are all valid
measures of central tendency
Mean
The formula to find the sample mean ( x ) is: for individual series = ( Σ xi ) / N.
Mode
The mode is the number that appears most often in a set of numbers. The mode is the most
frequently occurring value. Eg. In case of investigating concerns of wages and cost of living
of a set or a group, the mode value will tell you the maximum number of people getting a
particular wage.
Median
The median is a number in statistics that tells you where the middle of a data set is. The
median formula is {(n + 1) ÷ 2}th item for individual and discrete series, where “n” is the
number of items in the set.
For continuous series, find N/2 and median class. Median class corresponds to the
cumulative frequency which includes N/2. Then calculate
Median = l1 + (((N/2) - cf)/f ) × c ; where l1 is the lower limit of median class. ‘cf’ is the
cumulative frequency of the class just proceeding the median class. ‘f’ Is the frequency of
the median class. ‘c’ is the interval of median class.
Measures of Dispersion
In statistics, dispersion (also called variability, scatter, or spread) is the extent to which a
distribution is stretched or squeezed. Common examples of measures of statistical dispersion
are the variance, standard deviation, and interquartile range.
RANGE
It is the difference between the highest and the lowest values in a series.
Quartile deviation(QD) is defined as half the distance between the third and the first
quartiles.
QD = (Q3-Q1)/2
Coefficient of QD = ((Q3-Q1)/(Q3+Q1))
MEAN DEVIATION
Mean Deviation is defined as the arithmetic mean of deviations of all the values in a
series from their average, counting all such deviations as positive.
STANDARD DEVIATION
It is the square root of the mean of the squares of the deviations of all values of a series
from their Arithmetic mean.
σ = √((∑(X - X )2 )/ n)
Variance = (σ )2
Karl Pearson coefficient of correlation
The Pearson correlation coefficient, often referred to as the Pearson R test, is a statistical
formula that measures the strength between variables and relationships. To determine how
strong the relationship is between two variables, you need to find the coefficient value, which
can range between -1.00 and 1.00.
The coefficient of correlation formula can be expressed in the following form also:
r= n∑xy - ( ∑x . ∑y )
Only if all n ranks are distinct integers, it can be computed using the popular formula
In simple linear regression a single independent variable is used to predict the value of a
dependent variable.
n∑x2 - (∑x)2
n ∑y2 - (∑y)2
In multiple linear regression two or more independent variables are used to predict the
value of a dependent variable. The difference between the two is the number of
independent variables.
The regression equation that expresses the linear relationships between a single
independent variable and one or more independent variables is:
In this equation, ŷ is the predicted value of the dependent variable. Values of the k
independent variables are denoted by x1, x2, x3, … , xk.
And finally, we have the b's - b0, b1, b2, … , bk. The b's are constants, called regression
coefficients.
Artificial Intelligence
AI is a branch of computer science that deals with the study and creation of computer
systems that exhibit some form of intelligence
1. Error Reduction: Artificial intelligence helps us in reducing the error and the chance of
reaching accuracy with a greater degree of precision is a possibility. It is applied in various
studies such as exploration of space.Intelligent robots are fed with information and are sent to
explore space. Since they are machines with metal bodies, they are more resistant and have
greater ability to endure the space and hostile atmosphere.They are created and acclimatized
in such a way that they cannot be modified or get disfigured or breakdown in the hostile
environment.
2. Difficult Exploration: Artificial intelligence and the science of robotics can be put to use in
mining and other fuel exploration processes. Not only that, these complex machines can be
used for exploring the ocean floor and hence overcome the human limitations.
Due to the programming of the robots, they can perform more laborious and hard work with
greater responsibility. Moreover, they do not wear out easily.
3. Daily Application: Computed methods for automated reasoning, learning and perception
have become a common phenomenon in our everyday lives. We have our lady Siri or
Cortana to help us out. We are also hitting the road for long drives and trips with the help of
GPS. Smartphone in an apt and every day is an example of the how we use artificial
intelligence. In utilities, we find that they can predict what we are going to type and correct
the human errors in spelling. That is machine intelligence at work. When we take a picture,
the artificial intelligence algorithm identifies and detects the person’s face and tags the
individuals when we are posting our photographs on the social media sites. Artificial
Intelligence is widely employed by financial institutions and banking institutions to organize
and manage data. Detection of fraud uses artificial intelligence in a smart card based system.
4. Digital Assistants: Highly advanced organizations use ‘avatars’ which are replicas or
digital assistants who can actually interact with the users, thus saving the need for human
resources.For artificial thinkers, emotions come in the way of rational thinking and are not a
distraction at all. The complete absence of the emotional side, makes the robots think
logically and take the right program decisions. Emotions are associated with moods that can
cloud judgment and affect human efficiency. This is completely ruled out for machine
intelligence.
5. Repetitive Jobs: Repetitive jobs which are monotonous in nature can be carried out with
the help of machine intelligence. Machines think faster than humans and can be put to multi-
tasking. Machine intelligence can be employed to carry out dangerous tasks. Their
parameters, unlike humans, can be adjusted. Their speed and time are calculation based
parameters only. When humans play a computer game or run a computer-controlled robot,
we are actually interacting with artificial intelligence. In the game we are playing, the
computer is our opponent. The machine intelligence plans the game movement in response to
our movements. We can consider gaming to be the most common use of the benefits of
artificial intelligence.
6. Medical Applications: In the medical field also, we will find the wide application of AI.
Doctors assess the patients and their health risks with the help of artificial machine
intelligence. It educates them about the side effects of various medicines. Medical
professionals are often trained with the artificial surgery simulators. It finds a huge
application in detecting and monitoring neurological disorders as it can simulate the brain
functions. Robotics is used often in helping mental health patients to come out of depression
and remain active. A popular application of artificial intelligence is radiosurgery.
Radiosurgery is used in operating tumours and this can actually help in the operation without
damaging the surrounding tissues.
7. No Breaks: Machines, unlike humans, do not require frequent breaks and refreshments.
They are programmed for long hours and can continuously perform without getting bored or
distracted or even tired.
1. High Cost: Creation of artificial intelligence requires huge costs as they are very complex
machines. Their repair and maintenance require huge costs.They have software programs
which need frequent up gradation to cater to the needs of the changing environment and the
need for the machines to be smarter by the day. In the case of severe breakdowns, the
procedure to recover lost codes and reinstating the system might require huge time and cost.
4. No Original Creativity: While they can help you design and create, they are no match for
the power of thinking that the human brain has or even the originality of a creative mind.
Human beings are highly sensitive and emotional intellectuals. They see, hear, think and feel.
Their thoughts are guided by the feelings which completely lacks in machines. The inherent
intuitive abilities of the human brain cannot be replicated.
Identifying and studying the risk of artificial intelligence is a very important task at hand.
This can help in resolving the issues at hand. Programming errors or cyber attacks need more
dedicated and careful research. Technology companies and technology industry as a whole
needs to pay more attention to the quality of the software. Everything that has been created in
this world and in our individual societies is the continuous result of intelligence. Artificial
intelligence augments and empowers human intelligence. So as long we are successful in
keeping technology beneficial, we will be able to help this human civilization.
MODULE 4
Decision Theory
Decision making is the process of making choices by identifying a decision, gathering
information, and assessing alternative resolutions.
Decision making is a daily activity for any human being. There is no exception about that. When
it comes to business organizations, decision making is a habit and a process as well.
Effective and successful decisions make profit to the company and unsuccessful ones make
losses. Therefore, corporate decision making process is the most critical process in any
organization.
In the decision making process, we choose one course of action from a few possible alternatives.
In the process of decision making, we may use many tools, techniques and perceptions.
In addition, we may make our own private decisions or may prefer a collective decision.
Usually, decision making is hard. Majority of corporate decisions involve some level of
dissatisfaction or conflict with another party.
In this step, the problem is thoroughly analysed. There are a couple of questions one should ask
when it comes to identifying the purpose of the decision.
A problem of an organization will have many stakeholders. In addition, there can be dozens of
factors involved and affected by the problem.
In the process of solving the problem, you will have to gather as much as information related to
the factors and stakeholders involved in the problem. For the process of information gathering,
tools such as 'Check Sheets' can be effectively used.
In this step, the baseline criteria for judging the alternatives should be set up. When it comes to
defining the criteria, organizational goals as well as the corporate culture should be taken into
consideration.
As an example, profit is one of the main concerns in every decision making process. Companies
usually do not make decisions that reduce profits, unless it is an exceptional case. Likewise,
baseline principles should be identified related to the problem in hand.
For this step, brainstorming to list down all the ideas is the best option. Before the idea
generation step, it is vital to understand the causes of the problem and prioritization of causes.
For this, you can make use of Cause-and-Effect diagrams and Pareto Chart tool. Cause-and-
Effect diagram helps you to identify all possible causes of the problem and Pareto chart helps
you to prioritize and identify the causes with highest effect.
Then, you can move on generating all possible solutions (alternatives) for the problem in hand.
Use your judgement principles and decision-making criteria to evaluate each alternative. In this
step, experience and effectiveness of the judgement principles come into play. You need to
compare each alternative for their positives and negatives.
Once you go through from Step 1 to Step 5, this step is easy. In addition, the selection of the best
alternative is an informed decision since you have already followed a methodology to derive and
select the best alternative.
Convert your decision into a plan or a sequence of activities. Execute your plan by yourself or
with the help of subordinates.
Step 8: Evaluate the results
Evaluate the outcome of your decision. See whether there is anything you should learn and then
correct in future decision making. This is one of the best practices that will improve your
decision-making skills.
In decision making, we may encounter time where we just do not have the required information
to make a decision and keep having hesitation. Sometimes it's a decision that we have a lot of
information with and very certain with the conditions. There are 3 types of decision making
environment that we can classify. There are:
Decision making under a certain conditions means that the person who makes the decision have
all the complete and necessary information for him to make the decision. With all the
information available, that person is able to predict what the decision result will lead into. By
able to predict the outcome we can easily make a certain decision with certainty. The outcome
that gives the best result will usually be used and carried out. If we are given 2 choices and first
choices are better than the 2nd choice, surely we can decide easily which option to choose from.
Several examples of Decision Making in Certain Conditions are as below:
Example 1 (Personal)
Jason just moves to KL to work and he is looking for place to rent. He had been offered with two
alternative rental places. Rental A and Rental B. Rental A is near to Jason's workplace and is
within walking distance. Rental B is a bit far from his working place and required him to drive to
walk. Both rental fees are RM500 per room. With the complete information above, Jason will be
able to make decision in a very certain condition.
Example 2 (Consumer)
Shop ABC has just opened beside Shop DEF and is selling a wide range of daily consumer
products. Consumers around the area now have another alternative place that they can shop from.
Shop ABC has better overall price compare to Shop DEF, and Shop ABC employee are
friendlier. Both Shop are open from 8am to 10pm. Beside that Shop ABC have their own shop
points rewards card where shop DEF do not have any special promotion. With the scenario
above, the consumer has enough information to choose over these two places. This is a decision
that will be made under certain condition.
Example 3 (Business)
ABC Company has been operating for many years, and it is a company that provides data entry
as service. The printers that they are using are very outdated and faulty from time to time. The
board of the company decided to change and refresh all the existing printers. Since there are
wide selections of brand they can choose from, they have come out with the final two alternative
brand selections that are popular in the market. The final two alternatives of brand are Canon and
HP. With all the info easily collected from the printer's website, they were able to collect all the
required info easily and come out with a table to compare the cons & pros to make the final
decision of which brand they are going to purchase. With all the information available,
predictions are easy to make and this is a decision made under certain condition.
Making a decision when you are uncertain of the condition is similar to lack of information that
can help us to decide. Because of insufficient information, the decision maker doesn't know the
future and not able to predict the outcome of every alternative he has. In order to make a decision
with such conditions, the decision maker would have to judge and made the decision based on
their experience. If they do not have such experience, they have to consult and seek advice from
people who have more experience. There is a little risk involved there however since we are not
able to predict the result however experience from the past would close the gap. The following
are some examples when making a decision in an uncertain condition. They are:
Example 1 (Personal)
Jason is currently using a very old version of Nokia Phone. Seeing everyone is using a very
advanced phone, he decided to change his phone to a smart phone. Smart phones in the market
are using window OS, Apple OS and Android OS. Jason never used such interface on a phone
before. Which phone OS is more easy to use and which function of the phone is better, he is not
that certain. He then seeks advice from his friends and asks for opinions. After getting enough
advice, he then bought the smart phone. In this scenario, Jason does not have all the sufficient
information, so he gets advices from his friends. Jason is making the decision in an uncertain
condition.
Example 2 (Consumer)
A new brand of washing detergent has been introduced into the market. It is call "Detergent
XYZ" On the advertisement and product claiming it is the most efficient detergent ever in
market. Only 1 small cup of detergent would be sufficient to wash all the dirty clothes clean and
white. Consumers are attracted to the new product however they are uncertain how well the
product is. Will this new detergent clean well? Does it harmful to their skin or hands? The only
information that the user know, it can clean but how well they won't know however the price of
the detergent is cheap. In this case, consumers are making the decision to buy the detergent in an
uncertain condition.
Example 3 (Business)
ABC Restaurant is famous for their chicken rice at Kampung ABC. They have been selling
chicken rice for the past few years. Business has been good; however they are trying to expand
the business. Hence they decided to try out selling Duck Rice, since that would first shop ever in
Kampung ABC that serves that. ABC Restaurant is not sure will the people in the village would
eat this new menu. Would people in Kampung ABC accept duck as well? How much would they
sell Duck Rice? Will the Duck Rice be sold as the same price as Chicken Rice or cheaper? In this
case, ABC Restaurant are facing a time where they are making the decision in an uncertain
condition.
Making a decision in risky conditions means that we are making a decision that might result the
problem even big or from bad to very bad. Only minimal information is available to predict the
outcome. A wrong evaluation on making the decision under risky conditions might even result
the company suffer huge lost of profits or even bankrupt.
Example 1 (Consumer)
A new skin product has been release to the market, it is a face washer called ABC Face Cleaner.
It is a product commercialized to guarantee skin whitening result in just several time of using.
There is no exact information explain that how does this product help in improving the skin
color, any side effects and how long exactly it take to get the effect. However due to the
attractive price and the result promised, consumers decided to give the product a try. In this case,
consumers are making a decision to buy the product under risky condition.
Example 2 (Business)
Company ABC has been selling their new products for some time. The board of the company is
having a discussion either to stop selling the new product or put more investment to promote this
product. The only information they have is the weekly sales reports. It seems that for the first 7
weeks of the product launching, the product are not selling good and are causing them to suffer
lost of profits. However on the reports, they also noticed that week by week the selling rate is
slowly increases. It is showing an increase sales rate growth of 5% every week. Based on all the
info they have, they state that, the product sales rate will increase until 70% and company will
start gaining profits starting on the next 2 week. Without getting other information such as how
the consumers do see their product, they went ahead and make a decision to invest more on this
product. ACB Company is actually making the decision under a risky condition.
Decision trees (also known as decision tree learning or classification trees) are a collection of
predictive analytics techniques that uses tree like graphs for predicting the value of a response
variable (or target variable) based on the values of explanatory variables(or predictors). It is one
of the supervised learning algorithms used for predicting both the discrete and the continuous
dependent variable. In a decision tree learning, when the response variable take discrete values
then the decision trees are called classification trees.
Decision trees are collection of divide – conquer problem solving strategies that use tree-like
structure to predict the outcome of a variable. The tree starts with the root node consisting of the
complete data and thereafter uses intelligent strategies to split the node (parent node) into
multiple branches (thus creating children nodes). The original data is divided into subsets. This is
done in order to create more homogenous groups at children nodes. It is one of the most
powerful predictive analytics techniques used for generating business rules.
1. Splitting Criteria: Splitting criteria are used to split a node into subset.
2.Merging Criteria: When the predictor variable is categorical with n categories, it is possible
that not all n categories may be statistically significant. Thus, few categories may be merged to
create a compound or aggregate category.
3. Stopping Criteria: Stopping criteria is used for pruning the tree (stopping the tree from further
branching) to reduce the complexity associated with business rules generated from the tree.
Usually levels (depths) from root node (where each level corresponds to adding a predictor
variable), minimum number of observation in a node for splitting are used as stopping criteria.
1. Start with the root node in which all the data is present.
2. Decide on a splitting criterion and stopping criteria: The root node is then split into two or
more subsets leading to tree branches (called edges) using the splitting criterion. Nodes thus
created are known as internal nodes. Each internal node has exactly one incoming edge.
3. Further divide each internal node until no further splitting is possible or the stopping criterion
is met. The terminal nodes (or leaf nodes) will not have any outgoing edges.
5. Tree pruning ( a process for restricting the size of the tree) is used to avoid large trees and
overfitting the data. Tree pruning is achieved through different stopping criteria.
There are many decision tree techniques and they differ in the strategy that they use for splitting
the nodes. Two of the most popular decision tree techniques are, namely, Chi-square Automatic
Interaction Detection (CHAID) and Classification and Regression Trees (CART).
Depending on the nature of the dependent variable, the following statistical tests are used for
splitting the nodes:
2. Check the statistical significance of each independent variable depending on the type of
dependent variable.
3. The variable with the least p-value, based on the statistical tests described in step 2, is
used for splitting the data set thereby creating subsets. Bonferroni Correction is used for
adjusting the significance level α. Non significant categories in a categorical predictor
variable with more than two categories may be merged.
4. Using independent variable, repeat step 3 for each of the subsets of the data(internal
node) until:
(a) All the dependent variables are exhausted or they are not statistically significant at α.
2. Decide on the measure of impurity (usually Gini impurity index or Entropy). Choose a
predictor variable that minimizes the impurity when the parent node is split into children
nodes. This happens when the original data is divided into two subsets using a predictor
variable such that it results in the maximum reduction in the impurity in the case of
discrete dependent variable or the maximum reduction in SSE in the case of a continuous
dependent variable.
3. Repeat step 2 for each subset of the data (internal node) using the independent variables
until:
(b) The stopping criteria is met. Few stopping criteria used are number of levels of tree from
the root node, minimum number of observations in parent/child node and minimum
reduction in impurity index.
Design of experiments
The design of experiments (DOE, DOX, or experimental design) is the design of any task that
aims to describe or explain the variation of information under conditions that are hypothesized to
reflect the variation. The term is generally associated with experiments in which the design
introduces conditions that directly affect the variation, but may also refer to the design of quasi-
experiments, in which natural conditions that influence the variation are selected for observation.
In its simplest form, an experiment aims at predicting the outcome by introducing a change of
the preconditions, which is represented by one or more independent variables, also referred to as
"input variables" or "predictor variables." The change in one or more independent variables is
generally hypothesized to result in a change in one or more dependent variables, also referred to
as "output variables" or "response variables." The experimental design may also identify control
variables that must be held constant to prevent external factors from affecting the results.
Experimental design involves not only the selection of suitable independent, dependent, and
control variables, but planning the delivery of the experiment under statistically optimal
conditions given the constraints of available resources. There are multiple approaches for
determining the set of design points (unique combinations of the settings of the independent
variables) to be used in the experiment. Main concerns in experimental design include the
establishment of validity, reliability, and replicability. For example, these concerns can be
partially addressed by carefully choosing the independent variable, reducing the risk of
measurement error, and ensuring that the documentation of the method is sufficiently detailed.
Related concerns include achieving appropriate levels of statistical power and
sensitivity.Correctly designed experiments advance knowledge in the natural and social sciences
and engineering. Other applications include marketing and policy making
This branch of applied statistics deals with planning, conducting, analyzing and interpreting
controlled tests to evaluate the factors that control the value of a parameter or group of
parameters.
A strategically planned and executed experiment may provide a great deal of information about
the effect on a response variable due to one or more factors. Many experiments involve holding
certain factors constant and altering the levels of another variable. This One–Factor–at–a–Time
(or OFAT) approach to process knowledge is, however, inefficient when compared with
changing factor levels simultaneously.
Many of the current statistical approaches to designed experiments originate from the work of R.
A. Fisher in the early part of the 20th century. Fisher demonstrated how taking the time to
seriously consider the design and execution of an experiment before trying it helped avoid
frequently encountered problems in analysis. Key concepts in creating a designed experiment
include blocking, randomization and replication.
Example:
The strength of portland cement mortar
Single-Factor Experiments
Experiments in which the level of one and only one independent variable is manipulated. For
example, in an experiment assessing price sensitivity, there may be four treatments:$1, $2, $3,
$4.
The single-factor may be a composite of other variables. For example, in a concept test each of
the concepts may vary in terms of the general description, price, size and any number of other
factors. However, the nature of a single-factor experiment is that when the single-factor is a
composite of other variables it is not possible via statistical analysis to disentangle and isolate the
relative contribution of these factors.
In statistics, a full factorial experiment is an experiment whose design consists of two or more
factors, each with discrete possible values or "levels", and whose experimental units take on all
possible combinations of these levels across all such factors. A full factorial design may also be
called a fully crossed design. Such an experiment allows the investigator to study the effect of
each factor on the response variable, as well as the effects of interactions between factors on the
response variable. For the vast majority of factorial experiments, each factor has only two levels.
For example, with two factors each taking two levels, a factorial experiment would have four
treatment combinations in total, and is usually called a 2×2 factorial design.
If the number of combinations in a full factorial design is too high to be logistically feasible, a
fractional factorial design may be done, in which some of the possible combinations (usually at
least half) are omitted.
Many experiments examine the effect of only a single factor or variable. Compared to such one-
factor-at-a-time (OFAT) experiments, factorial experiments offer several advantages
Factorial designs are more efficient than OFAT experiments. They provide more
information at similar or lower cost. They can find optimal conditions faster than OFAT
experiments.
Factorial designs allow additional factors to be examined at no additional cost.
When the effect of one factor is different for different levels of another factor, it cannot
be detected by an OFAT experiment design. Factorial designs are required to detect such
interactions. Use of OFAT when interactions are present can lead to serious
misunderstanding of how the response changes with the factors.
Factorial designs allow the effects of a factor to be estimated at several levels of the other
factors, yielding conclusions that are valid over a range of experimental conditions.
Example
The simplest factorial experiment contains two levels for each of two factors. Suppose an
engineer wishes to study the total power used by each of two different motors, A and B,
running at each of two different speeds, 2000 or 3000 RPM. The factorial experiment
would consist of four experimental units: motor A at 2000 RPM, motor B at 2000 RPM,
motor A at 3000 RPM, and motor B at 3000 RPM. Each combination of a single level
selected from every factor is present once.
This experiment is an example of a 22 (or 2×2) factorial experiment, so named because it
considers two levels (the base) for each of two factors (the power or superscript), or
#levels#factors, producing 22=4 factorial points.
Designs can involve many independent variables. As a further example, the effects of
three input variables can be evaluated in eight experimental conditions shown as the
corners of a cube.
This can be conducted with or without replication, depending on its intended purpose and
available resources. It will provide the effects of the three independent variables on the
dependent variable and possible interactions.
Cluster Analysis has been used in marketing for various purposes. Segmentation of consumers
in cluster analysis is used on the basis of benefits sought from the purchase of the product. It can
be used to identify homogeneous groups of buyers.
• Marketing: Help marketers discover distinct groups in their customer bases, and then use this
knowledge to develop targeted marketing programs
• Land use: Identification of areas of similar land use in an earth observation database
• Insurance: Identifying groups of motor insurance policy holders with a high average claim cost
• City-planning: Identifying groups of houses according to their house type, value, and
geographical location
• Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults
There are a number of different methods that can be used to carry out a cluster analysis;
1. Hierarchical methods
a) Agglomerative methods, in which subjects start in their own separate cluster. The two
’closest’ (most similar) clusters are then combined and this is done repeatedly until all subjects
are in one cluster. At the end, the optimum number of clusters is then chosen out of all cluster
solutions.
b) Divisive methods, in which all subjects start in the same cluster and the above strategy is
applied in reverse until every subject is in a separate cluster. Agglomerative methods are used
more often than divisive methods.
2. Non-hierarchical methods (often known as k-means clustering methods): These are generally
used when large data set are involved. Further, these provide the flexibility of moving a subject
from one cluster to another.
The data used in cluster analysis can be interval, ordinal or categorical. However, having a
mixture of different types of variable will make the analysis more complicated. This is because
in cluster analysis you need to have some way of measuring the distance between observations
and the type of measure used will depend on what type of data you have.
Hierarchical agglomerative methods
Within this approach to cluster analysis there are a number of different methods used to
determine which clusters should be joined at each stage. The main methods are summarised
below.
•Nearest neighbour method (single linkage method)
In this method the distance between two clusters is defined to be the distance between the two
closest members, or neighbours. This method is relatively simple but is often criticised because it
doesn’t take account of cluster structure and can result in a problem called chaining whereby
clusters end up being long and straggly. However, it is better than the other methods when the
natural clusters are not spherical or elliptical in shape.
• Furthest neighbour method (complete linkage method)
In this case the distance between two clusters is defined to be the maximum distance between
members — i.e. the distance between the two subjects that are furthest apart. This method tends
to produce compact clusters of similar size but, as for the nearest neighbour method, does not
take account of cluster structure. It is also quite sensitive to outliers.
The distance between two clusters is calculated as the average distance between all pairs of
subjects in the two clusters. This is considered to be a fairly robust method.
• Centroid method
Here the centroid (mean value for each variable) of each cluster is calculated and the distance
between centroids is used. Clusters whose centroids are closest together are merged. This method
is also fairly robust.
• Ward’s method
In this method all possible pairs of clusters are combined and the sum of the squared distances
within each cluster is calculated. This is then summed over all clusters. The combination that
gives the lowest sum of squares is chosen. This method tends to produce clusters of
approximately equal size, which is not always desirable. It is also quite sensitive to outliers.
Despite this, it is one of the most popular methods, along with the average linkage method. It is
generally a good idea to try two or three of the above methods. If the methods agree reasonably
well then the results will be that much more believable.
Example: Hierarchical clustering, as the name suggests is an algorithm that builds hierarchy of
clusters. This algorithm starts with all the data points assigned to a cluster of their own. Then two
nearest clusters are merged into the same cluster. In the end, this algorithm terminates when
there is only a single cluster left.
The results of hierarchical clustering can be shown using dendrogram. The dendrogram can be
interpreted as:
At the bottom, we start with 25 data points, each assigned to separate clusters. Two closest
clusters are then merged till we have just one cluster at the top. The height in the dendrogram at
which two clusters are merged represents the distance between two clusters in the data space.
The decision of the no. of clusters that can best depict different groups can be chosen by
observing the dendrogram. The best choice of the no. of clusters is the no. of vertical lines in the
dendrogram cut by a horizontal line that can transverse the maximum distance vertically without
intersecting a cluster.
In the above example, the best choice of no. of clusters will be 4 as the red horizontal line in the
dendrogram below covers maximum vertical distance AB.
Two important things that you should know about hierarchical clustering are:
This algorithm has been implemented above using bottom up approach. It is also possible
to follow top-down approach starting with all data points assigned in the same cluster and
recursively performing splits till each data point is assigned a separate cluster.
The decision of merging two clusters is taken on the basis of closeness of these clusters.
There are multiple metrics for deciding the closeness of two clusters :
o Euclidean distance: ||a-b||2 = √(Σ(ai-bi))
o Squared Euclidean distance: ||a-b||22 = Σ((ai-bi)2)
o Manhattan distance: ||a-b||1 = Σ|ai-bi|
o Maximum distance:||a-b||INFINITY = maxi|ai-bi|
o Mahalanobis distance: √((a-b)T S-1 (-b)) {where, s : covariance matrix}
In these methods the desired number of clusters is specified in advance and the ’best’ solution is
chosen. The steps in such a method are as follows:
1. Choose initial cluster centres (essentially this is a set of observations that are far apart — each
subject forms a cluster of one and its centre is the value of the variables for that subject).
2. Assign each subject to its ’nearest’ cluster, defined in terms of the distance to the centroid.
4. Re-calculate the distance from each subject to each centroid and move observations that are
not in the cluster that they are closest to.
Non-hierarchical cluster analysis tends to be used when large data sets are involved. It is
sometimes preferred because it allows subjects to move from one cluster to another (this is not
possible in hierarchical cluster analysis where a subject, once assigned, cannot move to a
different cluster).
(1) it is often difficult to know how many clusters you are likely to have and therefore the
analysis may haveto be repeated several times and
(2) it can be very sensitive to the choice of initial cluster centres. Again, it may be worth trying
different ones to see what impact this has.
One possible strategy to adopt is to use a hierarchical approach initially to determine how many
clusters there are in the data and then to use the cluster centres obtained from this as initial
cluster centres in the non-hierarchical method.
Multidimensional scaling
Multidimensional scaling (MDS) is a means of visualizing the level of similarity of individual
cases of a dataset. It refers to a set of related ordination techniques used in information
visualization, in particular to display the information contained in a distance matrix. In general,
the goal of the analysis is to detect meaningful underlying dimensions that allow the researcher
to explain observed similarities or dissimilarities (distances) between the investigated objects
With MDS, you can analyze any kind of similarity or dissimilarity matrix, in addition to
correlation matrices.
Multidimensional scaling (MDS) is a very useful technique for market researchers because it
produces an invaluable "perceptual map" revealing like and unlike products, thus making it
useful in brand similarity studies, product positioning, and market segmentation.
Logic of MDS
The following simple example may demonstrate the logic of an MDS analysis. Suppose we take
a matrix of distances between major US cities from a map. We then analyze this matrix,
specifying that we want to reproduce the distances based on two dimensions. As a result of the
MDS analysis, we would most likely obtain a two-dimensional representation of the locations of
the cities, that is, we would basically obtain a two-dimensional map.
In general then, MDS attempts to arrange "objects" (major cities in this example) in a space with
a particular number of dimensions (two-dimensional in this example) so as to reproduce the
observed distances. As a result, we can "explain" the distances in terms of underlying
dimensions; in our example, we could explain the distances in terms of the two geographical
dimensions: north/south and east/west.
FACTOR ANALYSIS
Factor analysis is a statistical method used to describe variability among observed, correlated
variables in terms of a potentially lower number of unobserved variables called factors. For
example, it is possible that variations in six observed variables mainly reflect the variations in
two unobserved (underlying) variables. Factor analysis searches for such joint variations in
response to unobserved latent variables. The observed variables are modelled as linear
combinations of the potential factors, plus "error" terms. Factor analysis aims to find
independent latent variables. The theory behind factor analytic methods is that the information
gained about the interdependencies between observed variables can be used later to reduce the
set of variables in a dataset. Factor analysis is commonly used in biology, psychometrics,
personality theories, marketing, product management, operations research, and finance.
Proponents of factor analysis believe that it helps to deal with data sets where there are large
numbers of observed variables that are thought to reflect a smaller number of underlying/latent
variables. It is one of the most commonly used inter-dependency techniques and is used when the
relevant set of variables shows a systematic inter-dependence and the objective is to find out the
latent factors that create a commonality.
Factor analysis is a useful tool for investigating variable relationships for complex concepts such
as socioeconomic status, dietary patterns, or psychological scales.
It allows researchers to investigate concepts that are not easily measured directly by collapsing a
large number of variables into a few interpretable underlying factors.
What is a factor?
The key concept of factor analysis is that multiple observed variables have similar patterns of
responses because they are all associated with a latent (i.e. not directly measured) variable.
For example, people may respond similarly to questions about income, education, and
occupation, which are all associated with the latent variable socioeconomic status.
In every factor analysis, there are the same number of factors as there are variables. Each factor
captures a certain amount of the overall variance in the observed variables, and the factors are
always listed in order of how much variation they explain. The eigenvalue is a measure of how
much of the variance of the observed variables a factor explains. Any factor with an eigenvalue
≥1 explains more variance than a single observed variable.
So if the factor for socioeconomic status had an eigenvalue of 2.3 it would explain as much
variance as 2.3 of the three variables. This factor, which captures most of the variance in those
three variables, could then be used in other analyses.
The factors that explain the least amount of variance are generally discarded.
Example
Suppose a psychologist has the hypothesis that there are two kinds of intelligence, "verbal
intelligence" and "mathematical intelligence", neither of which is directly observed. Evidence for
the hypothesis is sought in the examination scores from each of 10 different academic fields of
1000 students. If each student is chosen randomly from a large population, then each student's 10
scores are random variables. The psychologist's hypothesis may say that for each of the 10
academic fields, the score averaged over the group of all students who share some common pair
of values for verbal and mathematical "intelligences" is some constant times their level of verbal
intelligence plus another constant times their level of mathematical intelligence, i.e., it is a
combination of those two "factors". The numbers for a particular subject, by which the two kinds
of intelligence are multiplied to obtain the expected score, are posited by the hypothesis to be the
same for all intelligence level pairs, and are called "factor loading" for this subject. For
example, the hypothesis may hold that the average student's aptitude in the field of astronomy is
{10 × the student's verbal intelligence} + {6 × the student's mathematical intelligence}.
The numbers 10 and 6 are the factor loadings associated with astronomy. Other academic
subjects may have different factor loadings.
Two students assumed to have identical degrees of the latent, unmeasured traits of verbal and
mathematical intelligence may have different measured aptitudes in astronomy because
individual aptitudes differ from average aptitudes and because of measurement error itself. Such
differences make up what is collectively called the "error" — a statistical term that means the
amount by which an individual, as measured, differs from what is average for or predicted by his
or her levels of intelligence.
The observable data that go into factor analysis would be 10 scores of each of the 1000 students,
a total of 10,000 numbers. The factor loadings and levels of the two kinds of intelligence of each
student must be inferred from the data.
The variable with the strongest association to the underlying latent variable. Factor 1, is income,
with a factor loading of 0.65.
Since factor loadings can be interpreted like standardized regression coefficients, one could also
say that the variable income has a correlation of 0.65 with Factor 1. This would be considered a
strong association for a factor analysis in most research fields. Two other variables, education
and occupation, are also associated with Factor 1. Based on the variables loading highly onto
Factor 1, we could call it “Individual socioeconomic status.”
House value, number of public parks, and number of violent crimes per year, however, have high
factor loadings on the other factor, Factor 2. They seem to indicate the overall wealth within the
neighborhood, so we may want to call Factor 2 “Neighborhood socioeconomic status.”
Notice that the variable house value also is marginally important in Factor 1 (loading = 0.38).
This makes sense, since the value of a person’s house should be associated with his or her
income.
Exploratory factor analysis (EFA) is used to identify complex interrelationships among items and
group items that are part of unified concepts. The researcher makes no a priori assumptions
about relationships among factors.
Confirmatory factor analysis (CFA) is a more complex approach that tests the hypothesis that the
items are associated with specific factors. CFA uses structural equation modeling to test a
measurement model whereby loading on the factors allows for evaluation of relationships
between observed variables and unobserved variables. Structural equation modeling approaches
can accommodate measurement error, and are less restrictive than least-squares estimation.
Hypothesized models are tested against actual data, and the analysis would demonstrate loadings
of observed variables on the latent variables (factors), as well as the correlation between the
latent variables
Exploratory factor analysis is used to measure the underlying factors that affect the variables in a
data structure without setting any predefined structure to the outcome. Confirmatory factor
analysis on the other hand is used as tool in market research and analysis to reconfirm the effects
and correlation of an existing set of predetermined factors and variables that affect these factors.
Factor analysis will only yield accurate and useful results if done by a researcher who has
adequate knowledge to select data and assign attributes. Selecting factors and variables so as to
avoid too much similarity of characteristics is also important. Factor analysis, if done correctly,
can allow for market research and analysis that helps in various areas of decision making like
product features, product development, pricing, market segmentation, penetration and even with
targeting.
Discriminant Analysis
Discriminant Analysis is a statistical tool with an objective to assess the adequacy of a
classification, given the group membership. Discriminant analysis is a technique that is used by
the researcher to analyze the research data when the criterion or the dependent variable is
categorical and the predictor or the independent variable is interval in nature. The term
categorical variable means that the dependent variable is divided into a number of categories. For
example, three brands of computers, Computer A, Computer B and Computer C can be the
categorical dependent variable.
The objective of discriminant analysis is to develop discriminant functions that are nothing but
the linear combination of independent variables that will discriminate between the categories of
the dependent variable in a perfect manner. It enables the researcher to examine whether
significant differences exist among the groups, in terms of the predictor variables. It also
evaluates the accuracy of the classification.
As in statistics, everything is assumed up until infinity, so in this case, when the dependent
variable has two categories, then the type used is two-group discriminant analysis. If the
dependent variable has three or more than three categories, then the type used is multiple
discriminant analysis. The major distinction to the types of discriminant analysis is that for a two
group, it is possible to derive only one discriminant function. On the other hand, in the case of
multiple discriminant analysis, more than one discriminant function can be computed.
There are many examples that can explain when discriminant analysis fits. It can be used to
know whether heavy, medium and light users of soft drinks are different in terms of their
consumption of frozen foods. In the field of psychology, it can be used to differentiate between
the price sensitive and non price sensitive buyers of groceries in terms of their psychological
attributes or characteristics. In the field of business, it can be used to understand the
characteristics or the attributes of a customer possessing store loyalty and a customer who does
not have store loyalty.
3. The next step is the determination of the significance of these discriminant functions.
5. The last and the most important step is to assess the validity.
LDA is closely related to analysis of variance (ANOVA) and regression analysis, which also
attempt to express one dependent variable as a linear combination of other features or
measurements. However, ANOVA uses categorical independent variables and a continuous
dependent variable, whereas discriminant analysis has continuous independent variables and a
categorical dependent variable (i.e. the class label). Logistic regression and probit regression are
more similar to LDA than ANOVA is, as they also explain a categorical variable by the values of
continuous independent variables. These other methods are preferable in applications where it is
not reasonable to assume that the independent variables are normally distributed, which is a
fundamental assumption of the LDA method.