Data Science
Data Science
[Music]
I really enjoy regression.
I'd say regression was maybe one of the first concepts that I, that really helped
me understand data so I enjoy regression.
I really like data visualization.
I think it's a key element for people to get across their message to
people that don't understand that well what data science is.
Artificial neural networks.
I'm really passionate about neural networks because we have a lot to learn with nature
so when we are trying to mimic our, our brain I think that we can do some applications with
this behavior with this biological behavior in algorithms.
Data visualization with R. I love to do this.
Nearest neighbor.
It's the simplest but it just gets the best results so many more times than some overblown,
overworked algorithm that's just as likely to overfit as it is to make a good fit.
So structured data is more like tabular data things that you’re familiar with in Microsoft
Excel format.
You've got rows and columns and that's called structured data.
Unstructured data is basically data that is coming from mostly from web where it's not
tabular.
It is not, it's not in rows and columns.
It's text.
It's sometimes it's video and audio, so you would have to deploy more sophisticated algorithms
to extract data.
And in fact, a lot of times we take unstructured data and spend a great deal of time and effort
to get some structure out of it and then analyze it.
So if you have something which fits nicely into tables and columns and rows, go ahead.
That's your structured data.
But if you see if it's a weblog or if you're trying to get information out of webpages
and you've got a gazillion web pages, that's unstructured data that would require a little
bit more effort to get information out of it.
There are thousands of books written on regression and millions of lectures delivered on
regression.
And I always feel that they don’t do a good job of explaining regression because they
get into data and models and statistical distributions.
Let's forget about it.
Let me explain regression in the simplest possible terms.
If you have ever taken a cab ride, a taxi ride, you understand regression.
Here is how it works.
The moment you sit in a cab ride, in a cab, you see that there's a fixed amount there.
It says $2.50.
You, rather the cab, moves or you get off.
This is what you owe to the driver the moment you step into a cab.
That's a constant.
You have to pay that amount if you have stepped into a cab.
Then as it starts moving for every meter or hundred meters the fare increases by certain
amount.
So there's a... there's a fraction, there's a relationship between distance and the amount
you would pay above and beyond that constant.
And if you're not moving and you're stuck in traffic, then every additional minute you
have to pay more.
So as the minutes increase, your fare increases.
As the distance increases, your fare increases.
And while all this is happening you've already paid a base fare which is the constant.
This is what regression is.
Regression tells you what the base fare is and what is the relationship between time
and the fare you have paid, and the distance you have traveled and the fare you've paid.
Because in the absence of knowing those relationships, and just knowing how much people
traveled
for and how much they paid, regression allows you to compute that constant that you didn't
know.
That it was $2.50, and it would compute the relationship between the fare and and the distance
and
the fare and the time.
That is regression.
[Music]
Now that you know what is in the book, it is time to put down some definitions. Despite their
ubiquitous use,
consensus evades the notions of Big data and Data Science. The question, Who is a data
scientist? is very much alive and being contested by individuals, some of whom are merely
interested in protecting their discipline or academic turfs. In this section, I attempt to address
these controversies and explain Why a narrowly construed definition of either Big data or Data
science will result in excluding hundreds of thousands of individuals who have recently turned to
the emerging field.
Everybody loves a data scientist, wrote Simon Rogers (2012) in the Guardian. Mr. Rogers also
traced the newfound love for number crunching to a quote by Google’s Hal Varian, who declared
that the sexy job in the next ten years will be statisticians.
Whereas Hal Varian named statisticians sexy, it is widely believed that what he really meant
were data
scientists. This raises several important questions:
What is data science?
How does it differ from statistics?
What makes someone a data scientist?
In the times of big data, a question as simple as, What is data science? can result in many
answers. In some cases, the diversity of opinion on these answers borders on hostility.
I define a data scientist as someone who finds solutions to problems by analyzing Big or small
data using appropriate tools and then tells stories to communicate her findings to the relevant
stakeholders. I do not use the data size as a restrictive clause. A data below a certain arbitrary
threshold does not make one less of a data scientist. Nor is my definition of a data scientist
restricted to particular analytic tools, such as machine learning. As long as one has a curious
mind, fluency in analytics, and the ability to communicate the findings, I consider the person a
data scientist.
I define data science as something that data scientists do. Years ago, as an engineering student at
the University of Toronto, I was stuck With the question: What is engineering? I wrote my
master’s thesis on forecasting housing prices and my doctoral dissertation on forecasting
homebuilders’ choices related to What they build, when they build, and where they build new
housing. In the civil engineering department,
Others were working on designing buildings, bridges, tunnels, and worrying about the stability of
slopes. My work, and that of my supervisor, was not your traditional garden-variety engineering.
Obviously, I was repeatedly asked by others whether my research was indeed engineering.
When I shared these concerns with my doctoral supervisor, Professor Eric Miller, he had a laugh.
Dr Miller spent a lifetime researching urban land use and transportation and had earlier earned a
doctorate from MIT. “Engineering is what engineers do,” he responded. Over the next 17 years,
I realized the wisdom in his statement. You first become an engineer by obtaining a degree and
then registering with the local
professional body that regulates the engineering profession. Now you are an engineer. You can
dig tunnels; write software codes; design components of an iPhone or a supersonic jet. You are
an engineer. And when you are leading the global response to a financial crisis in your role as the
chief economist of the International Monetary Fund (IMF), as Dr Raghuram Rajan did, you are
an engineer.
Professor Raghuram Rajan did his first degree in electrical engineering from the Indian Institute
of Technology. He pursued economics in graduate studies, later became a professor at a
prestigious university, and eventually landed at the IMF. He is currently serving as the 23rd
Governor of the Reserve Bank of India. Could someone argue that his intellectual prowess is
rooted only in his training as an economist and that the
fundamentals he learned as an engineering student played no role in developing his problem-
solving abilities?
Professor Rajan is an engineer. So are Xi Jinping, the President of the People’s Republic of
China, and Alexis Tsipras, the Greek Prime Minister who is forcing the world to rethink the
fundamentals of global economics. They might not be designing new circuitry, distillation
equipment, or bridges, but they are helping build better societies and economies and there can be
no better definition of engineering and
engineers—that is, individuals dedicated to building better economies and societies.
So briefly, I would argue that data science is what data scientists do.
Others have many different definitions. In September 2015, a co-panelist at a meetup organized
by BigDataUniversity.com in Toronto confined data science to machine learning. There you
have it. If you are not using the black boxes that makeup machine learning, as per some experts
in the field, you are not a data scientist. Even if you were to discover the cure to a disease
threatening the lives of millions, turf-protecting
colleagues will exclude you from the data science club.
Dr Vincent Granville (2014), an author on data science, offers certain thresholds to meet to be a
data scientist. On pages 8 and 9 in Developing Analytic talent, Dr Granville describes the new
data science professor as a non-tenured instructor at a non-traditional university, who publishes
research results in
online blogs, does not waste time writing grants, works from home, and earns more money than
the traditional tenured professors. Suffice it to say that the thriving academic community of data
scientists might disagree with Dr Granville.
Dr Granville uses restrictions on data size and methods to define what data science is. He defines
a data scientist as one who can easily process a So-million-row data set in a couple of
hours, and who distrusts (statistical) models. He distinguishes data science from statistics. Yet he
lists algebra, calculus, and training in probability and statistics as necessary background to
understand data science (page 4).
Some believe that big data is merely about crossing a certain threshold on data size or the
number of observations, or is about the use of a particular tool, such as Hadoop. Such arbitrary
thresholds on data size are problematic because, with innovation, even regular computers and
off-the-shelf software have begun to manipulate very large data sets. Stata, a commonly used
software by data scientists and statisticians, announced that one could now process between 2
billion to 24.4 billion rows using its desktop solutions. If Hadoop is the password to the big data
club, Stata’s ability to process 24.4 billion rows, under certain limitations, has just gatecrashed
that big data party.
It is important to realize that one who tries to set arbitrary thresholds to exclude others is likely to
run into inconsistencies. The goal should be to define data science in a more exclusive,
discipline- and platform-independent, size-free context where data-centric problem solving and
the ability to weave strong narratives take center stage.
Given the controversy, I would rather consult others to see how they describe a data scientist.
Why don’t we again consult the Chief Data Scientist of the United States? Recall Dr Patil told
the Guardian newspaper in 2012 that a data scientist is that unique blend of skills that can both
unlock the insights of data and tell a
fantastic story via the data. What is admirable about Dr Patil’s definition is that it is inclusive of
individuals of various academic backgrounds and training, and does not restrict the definition of
a data scientist to a particular tool or subject it to a certain arbitrary minimum threshold of data
size.
The other key ingredient for a successful data scientist is a behavioral trait: curiosity. A data
scientist has to be one with a very curious mind, willing to spend significant time and effort to
explore her hunches. In journalism, the editors call it having the nose for news. Not all reporters
know where the news lies. Only
those Who have the nose for news get the Story. Curiosity is equally important for data scientists
as it is for journalists.
Rachel Schutt is the Chief Data Scientist at News Corp. She teaches a data science course at
Columbia University. She is also the author of an excellent book, Doing Data Science. In an
interview With the New York Times, Dr Schutt defined a data scientist as someone who is a part
computer scientist, part software engineer, and part statistician (Miller, 2013). But that’s the
definition of an average data scientist. “The best”, she contended, “tend to be really curious
people, thinkers who ask good questions and are O.K. dealing with unstructured situations and
trying to find structure in them.”
WEEK #02
BIG DATA:
Digital Transformation affects business operations, updating existing processes and operations
and creating new ones to harness the benefits of new technologies.
This digital change integrates digital technology into all areas of an organization resulting
in fundamental changes to how it operates and delivers value to customers.
It is an organizational and cultural change driven by Data Science, and especially Big
Data.
The availability of vast amounts of data, and the competitive advantage that analyzing
it brings, has triggered digital transformations throughout many industries.
Netflix moved from being a postal DVD lending system to one of the world’s foremost video
streaming providers, the Houston Rockets NBA team used data gathered by overhead cameras
to analyze the most productive plays, and Lufthansa analyzed customer data to improve
its service.
Organizations all around us are changing to their very core.
Let’s take a look at an example, to see how Big Data can trigger a digital transformation,
not just in one organization, but in an entire industry.
In 2018, the Houston Rockets, a National Basketball Association, or NBA team, raised their
game
using Big Data.
The Rockets were one of four NBA teams to install a video tracking system which mined
raw data from games.
They analyzed video tracking data to investigate which plays provided the best opportunities
for high scores, and discovered something surprising.
Data analysis revealed that the shots that provide the best opportunities for high scores
are two-point dunks from inside the two-point zone, and three-point shots from outside the
three-point line, not long-range two-point shots from inside it.
This discovery entirely changed the way the team approached each game, increasing the
number of three-point shots attempted.
In the 2017-18 season, the Rockets made more three-point shots than any other team in NBA
history, and this was a major reason they won more games than any of their rivals.
In basketball, Big Data changed the way teams try to win, transforming the approach to the
game.
Digital transformation is not simply duplicating existing processes in digital form; the in-depth
analysis of how the business operates helps organizations discover how to improve their
processes and operations, and harness the benefits of integrating data science into
their workflows.
Most organizations realize that digital transformation will require fundamental changes to their
approach towards data, employees, and customers, and it will affect their organizational culture.
Digital transformation impacts every aspect of the organization, so it is handled by decision
makers at the very top levels to ensure success.
The support of the Chief Executive Officer is crucial to the digital transformation process,
as is the support of the Chief Information Officer, and the emerging role of Chief Data
Officer.
But they also require support from the executives who control budgets, personnel decisions,
and day-to-day priorities.
This is a whole organization process.
Everyone must support it for it to succeed.
There is no doubt dealing with all the issues that arise in this effort requires a new mindset,
but Digital Transformation is the way to succeed now and in the future
Course Text Book: ‘Getting Started with Data Science’ Publisher: IBM Press; 1 edition
(Dec 13 2015) Print.
Author: Murtaza Haider
Prescribed Reading: Chapter 12 Pg. 529-531
The first step in data mining requires you to set up goals for the exercise. Obviously, you must
identify the key questions that need to be answered. However, going beyond identifying the key
questions are the concerns about the costs and benefits of the exercise. Furthermore, you must
determine, in advance, the expected level of accuracy and usefulness of the results obtained from
data mining. If money were no object, you could throw as many funds as necessary to get the
answers required. However, the cost-benefit trade-off is always instrumental in determining the
goals and scope of the data mining exercise. The level of accuracy expected from the results also
influences the costs. High levels of accuracy from data mining would cost more and vice versa.
Furthermore, beyond a certain level of accuracy, you do not gain much from the exercise, given
the diminishing returns. Thus, the cost-benefit trade-offs for the desired level of accuracy are
important considerations for data mining goals.
Selecting Data
The output of a data-mining exercise largely depends upon the quality of data being used. At
times, data are readily available for further processing. For instance, retailers often possess large
databases of customer purchases and demographics. On the other hand, data may not be readily
available for data mining. In such cases, you must identify other sources of data or even plan
new data collection initiatives, including surveys. The type of data, its size, and frequency of
collection have a direct bearing on the cost of data mining exercise. Therefore, identifying the
right kind of data needed for data mining that could answer the questions at reasonable costs is
critical.
Preprocessing Data
Preprocessing data is an important step in data mining. Often raw data are messy, containing
erroneous or irrelevant data. In addition, even with relevant data, information is sometimes
missing. In the preprocessing stage, you identify the irrelevant attributes of data and expunge
such attributes from further consideration. At the same time, identifying the erroneous aspects of
the data set and flagging them as such is necessary. For instance, human error might lead to
inadvertent merging or incorrect parsing of information between columns. Data should be
subject to checks to ensure integrity. Lastly, you must develop a formal method of dealing with
missing data and determine whether the data are missing randomly or systematically.
If the data were missing randomly, a simple set of solutions would suffice. However, when data
are missing in a systematic way, you must determine the impact of missing data on the results.
For instance, a particular subset of individuals in a large data set may have refused to disclose
their income. Findings relying on an individual’s income as input would exclude details of those
individuals whose income was not reported. This would lead to systematic biases in the analysis.
Therefore, you must consider in advance if observations or variables containing missing data be
excluded from the entire analysis or parts of it.
Transforming Data
After the relevant attributes of data have been retained, the next step is to determine the
appropriate format in which data must be stored. An important consideration in data mining is to
reduce the number of attributes needed to explain the phenomena. This may require transforming
data Data reduction algorithms, such as Principal Component Analysis (demonstrated and
explained later in the chapter), can reduce the number of attributes without a significant loss in
information. In addition, variables may need to be transformed to help explain the phenomenon
being studied. For instance, an individual’s income may be recorded in the data set as wage
income; income from other sources, such as rental properties; support payments from the
government, and the like. Aggregating income from all sources will develop a representative
indicator for the individual income.
Often you need to transform variables from one type to another. It may be prudent to transform
the continuous variable for income into a categorical variable where each record in the database
is identified as low, medium, and high-income individual. This could help capture the non-
linearities in the underlying behaviors.
Storing Data
The transformed data must be stored in a format that makes it conducive for data mining. The
data must be stored in a format that gives unrestricted and immediate read/write privileges to the
data scientist. During data mining, new variables are created, which are written back to the
original database, which is why the data storage scheme should facilitate efficiently reading from
and writing to the database. It is also important to store data on servers or storage media that
keeps the data secure and also prevents the data mining algorithm from unnecessarily searching
for pieces of data scattered on different servers or storage media. Data safety and privacy should
be a prime concern for storing data.
Mining Data
After data is appropriately processed, transformed, and stored, it is subject to data mining. This
step covers data analysis methods, including parametric and non-parametric methods, and
machine-learning algorithms. A good starting point for data mining is data visualization.
Multidimensional views of the data using the advanced graphing capabilities of data mining
software are very helpful in developing a preliminary understanding of the trends hidden in the
data set.
Later sections in this chapter detail data mining algorithms and methods.
After results have been extracted from data mining, you do a formal evaluation of the results.
Formal evaluation could include testing the predictive capabilities of the models on observed
data to see how effective and efficient the algorithms have been in reproducing data. This is
known as an “in-sample forecast”. In addition, the results are shared with the key stakeholders
for feedback, which is then incorporated in the later iterations of data mining to improve the
process.
Data mining and evaluating the results becomes an iterative process such that the analysts use
better and improved algorithms to improve the quality of results generated in light of the
feedback received from the key stakeholders.
You might have noticed that taller parents often have tall children who are not necessarily taller
than their parents and that’s a good thing. This is not to suggest that children born to tall parents
are not necessarily taller than the rest. That may be the case, but they are not necessarily taller
than their own “tall” parents. Why I think this to be a good thing requires a simple mental
simulation. Imagine if every successive generation born to tall parents were taller than their
parents, in a matter of a couple of millennia, human beings would become uncomfortably tall for
their own good, requiring even bigger furniture, cars, and planes.
Sir Frances Galton in 1886 studied the same question and landed upon a statistical technique we
today know as regression models. This chapter explores the workings of regression models,
which have become the workhorse of statistical analysis. In almost all empirical pursuits of
research, either in the academic or professional fields, the use of regression models, or their
variants, is ubiquitous. In medical science, regression models are being used to develop more
effective medicines, improve the methods for operations, and optimize resources for small and
large hospitals. In the business world, regression models are at the forefront of analyzing
consumer behavior, firm productivity, and competitiveness of public and private sector entities.
I would like to introduce regression models by narrating a story about my Master’s thesis. I
believe that this story can help explain the utility of regression models.
In 1999, I finished my Masters’ research on developing hedonic price models for residential real
estate properties. It took me three years to complete the project involving 500,000 real estate
transactions. As I was getting ready for the defense, my wife generously offered to drive me to
the university. While we were on our way, she asked, “Tell me, what have you found in your
research?”. I was delighted to be finally asked to explain what I have been up to for the past three
years. “Well, I have been studying the determinants of housing prices. I have found that larger
homes sell for more than smaller homes,” I told my wife with a triumphant look on my face as I
held the draft of the thesis in my hands.
We were approaching the on-ramp for a highway. As soon as I finished the sentence, my wife
suddenly turned the car to the shoulder and applied brakes. As the car stopped, she turned to me
and said: “I can’t believe that they are giving you a Master’s degree for finding just that. I could
have told you that larger homes sell for more than smaller homes.”
At that very moment, I felt like a professor who taught at the department of obvious conclusions.
How can I blame her for being shocked that what is commonly known about housing prices will
earn me a Master’s degree from a university of high repute?
I requested my wife to resume driving so that I could take the next ten minutes to explain to her
the intricacies of my research. She gave me five minutes instead, thinking this may not require
even that. I settled for five and spent the next minute collecting my thoughts. I explained to her
that my research has not just found the correlation between housing prices and the size of
housing units, but I have also discovered the magnitude of those relationships. For instance, I
found that all else being equal, a term that I explain later in this chapter, an additional washroom
adds more to the housing price than an additional bedroom. Stated otherwise, the marginal
increase in the price of a house is higher for an additional washroom than for an additional
bedroom. I found later that the real estate brokers in Toronto indeed appreciated this finding.
I also explained to my wife that proximity to transport infrastructure, such as subways, resulted
in higher housing prices. For instance, houses situated closer to subways sold for more than did
those situated farther away. However, houses near freeways or highways sold for less than others
did. Similarly, I also discovered that proximity to large shopping centers had a nonlinear impact
on housing prices. Houses located very close (less than 2.5 km) to the shopping centers sold for
less than the rest. However, houses located closer (less than 5 km, but more than 2.5 km) to the
shopping center sold for more than did those located farther away. I also found that the housing
values in Toronto declined with distance from downtown.
As I explained my contributions to the study of housing markets, I noticed that my wife was
mildly impressed. The likely reason for her lukewarm reception was that my findings confirmed
what we already knew from our everyday experience. However, the real value added by the
research rested in quantifying the magnitude of those relationships.
Why Regress?
A whole host of questions could be put to regression analysis. Some examples of questions that
regression (hedonic) models could address include:
The ultimate purpose of analytics is to communicate findings to the concerned who might use
these insights to formulate policy or strategy. Analytics summarize findings in tables and plots.
The data scientist should then use the insights to build the narrative to communicate the findings.
In academia, the final deliverable is in the form of essays and reports. Such deliverables are
usually 1,000 to 7,000 words in length.
In consulting and business, the final deliverable takes on several forms. It can be a small
document of fewer than 1,500 words illustrated with tables and plots, or it could be a
comprehensive document comprising several hundred pages. Large consulting firms, such as
McKinsey and Deloitte,I routinely generate analytics-driven reports to communicate their
findings and, in the process, establish their expertise in specific knowledge domains.
Let’s review the “United States Economic Forecast”, a publication by the Deloitte University
Press. This document serves as a good example for a deliverable that builds narrative from data
and analytics. The 24-page report focuses on the state of the U.S. economy as observed in
December 2014. The report opens with a grabber highlighting the fact that contrary to popular
perception, the economic and job growth has been quite robust in the United States. The report is
not merely a statement of facts.
In fact, it is a carefully crafted report that cites Voltaire and follows a distinct theme. The report
focuses on the good news about the U.S. economy. These include the increased investment in
manufacturing equipment in the U.S. and the likelihood of higher consumer consumption
resulting from lower oil prices.
The Deloitte report uses time series plots to illustrate trends in markets. The GDP growth chart
shows how the economy contracted during the Great Recession and has rebounded since then.
The graphic presents four likely scenarios for the future. Another plot shows the changes in
consumer spending. The accompanying narrative focuses on income inequality in the U.S. and
refers to Thomas Pikkety’s book on the same. The Deloitte report mentions many consumers did
not experience an increase in their real incomes over the years, while they still maintained their
level of spending. Other graphics focused on housing, business, and government sectors,
international trade, labor, and financial markets, and prices. The appendix carries four tables
documenting data for the four scenarios discussed in the report.
Deloitte’s “United States Economic Forecast” serves the very purpose that its authors intended.
The report uses data and analytics to generate the likely economic scenarios. It builds a powerful
narrative in support of the thesis statement that the U.S. economy is doing much better than most
would like to believe. At the same time, the report shows Deloitte to be a competent firm capable
of analyzing economic data and prescribing strategies to cope with the economic challenges.
Now consider if we were to exclude the narrative from this report and presented the findings as a
deck of PowerPoint slides with eight graphics and four tables. The PowerPoint slides would have
failed to communicate the message that the authors carefully crafted in the report citing Piketty
and Voltaire. I consider Deloitte’s report a good example of storytelling with data and encourage
you to read the report to decide for yourself whether the deliverable would have been equally
powerful without the narrative.
Now, let us work backward from the Deloitte report. Before the authors started their analysis,
they must have discussed the scope of the final deliverable. They would have deliberated the key
message of the report and then looked for the data and analytics they needed to make their case.
The initial planning and conceptualizing of the final deliverable is therefore extremely important
for producing a compelling document. Embarking on analytics, without due consideration to the
final deliverable, is likely to result in a poor-quality document where the analytics and narrative
would struggle to blend.