Data Science Lecture 1 Introduction
Data Science Lecture 1 Introduction
Data Science Lecture 1 Introduction
Lecture 1
Data science
Exponential Increase in Data
• All human generated information up
to 2003 was about 5 exabytes.
• Same amount of data was generated
every 2 days in 2011
• and would be every 10 min NOW.
“Data is the New Oil”
– World Economic Forum 2011
• “Data is the new oil." Coined in 2006 by
Clive Huby, a British data
commercialization entrepreneur, this
now famous phrase was embraced by
the World Economic Forum in a 2011
report,
• Data is just like crude oil. It’s valuable,
but if unrefined it cannot really be
used. It has to be changed into gas,
plastic, chemicals, etc.
• To create a valuable entity that drives
profitable activity; so must data be
broken down, analyzed for it to have
value.
What is Data Science?
• Fortune magazine
• “Hot New Gig in Tech”
• Hal Varian, Google’s Chief Economist, NYT, 2009:
• Statistics: The next attractive job
• “The ability to take data—to be able to understand it, to process it, to extract value
from it, to visualize it, to communicate it—that’s going to be a hugely important skill.”
• Mike Driscoll, CEO of meta markets:
• “Data science, as it's practiced, is a blend of Red-Bull-fueled hacking and espresso-
inspired statistics.”
• “Data science is the civil engineering of data. Its acolytes possess a practical knowledge
of tools & materials, coupled with a theoretical understanding of what's possible.”
Data Science – A Visual Definition
• Drew Conway’s Data Science Venn Diagram
What do data scientists do?
• “They need to find nuggets of truth in data and then explain it to the
business leaders” , Rchard Snee, EMC
Smarter Work
More efficient and effective use of staff and resources
Data Scientist
“A data scientist is someone who can obtain, scrub, explore, model
and interpret data, blending hacking, statistics and machine learning.
Data scientists not only are adept at working with data, but appreciate
data itself as a first-class product.”
Hilary Mason, chief scientist at bit.ly
• “data wrangling”
• “data jujitsu”
• “data munging”
Three types of tasks
1) Preparing to run a model
Gathering, cleaning, integrating, restructuring, transforming, loading,
filtering, deleting, combining, merging, verifying, extracting, shaping,
massaging.
• Data scientists are responsible for coming up with data centric products and
applications that handle data in a way which conventional systems cannot. The process
of data science is much more focused on the technical abilities of handling any type of
data.
• Unlike data mining and data machine learning it is responsible for assessing the impact
of data in a specific product or organization.
• Data science focuses on the science of data, data mining deals with the process of
discovering newer patterns in big data sets. It might be apparently similar to machine
learning, because it categorizes algorithms. However, unlike machine learning,
algorithms are only a part of data mining.
• In machine learning, algorithms are used for gaining knowledge from data sets.
However, in data mining algorithms are only combined that too as the part of a process.
Unlike machine learning it does not completely focus on algorithms.
Data Science
• Data Science is a field of study which includes everything from Big
Data Analytics, Data Mining, Predictive Modeling, Data
Visualization, Mathematics, and Statistics.
• Data Science has been referred to as the fourth paradigm of Science.
(the other three being Theoretical, Empirical and Computational).
Academia often conduct exclusive research in Data Science.
Key Differences Between Data Science Vs Data Mining
• Data Mining is an activity which is a part of a broader Knowledge
Discovery in Databases (KDD) Process while Data Science is a field of
study just like Applied Mathematics or Computer Science.
• Data Science is thought to be broader in scope while Data Mining is
considered narrower.
• Some activities of Data Mining such as statistical analysis, writing data
flows and pattern recognition can intersect with Data Science. Hence,
Data Mining becomes a subset of Data Science.
• Machine Learning in Data Mining is used more in pattern recognition
while in Data Science it has a more general use.
Data Science Vs Data Mining Comparison Table
Databases and Data Science
Databases Data Science
Data Value “Precious” “Cheap”
Data Volume Modest Massive
Examples Bank records, Online clicks,
Personnel records, GPS logs,
Census, Tweets,
Medical records Building sensor readings
Priorities Consistency, Speed,
Error recovery, Availability,
Auditability Query richness
Structured Strongly (Schema) Weakly or none (Text)
Properties Transactions, ACID* CAP* theorem (2/3),
eventual consistency
Realizations SQL NoSQL:
MongoDB, CouchDB,
Hbase, Cassandra, Riak, Memcached,
Apache River, …
• If you’re a statistician, you need to learn to deal with data that does
not fit in memory