0% found this document useful (0 votes)
390 views24 pages

IDS Unit 1 Notes

Data science involves extracting insights from vast amounts of data using scientific methods and algorithms. It is an interdisciplinary field that allows knowledge to be extracted from structured or unstructured data to solve business problems. Data science jobs include data scientist, data engineer, data analyst, statistician, and more. These roles use tools like R, Python, SQL, and Tableau to work with large datasets, build models, and communicate results. Challenges include dealing with diverse data types, a lack of talent and financial support, and ensuring privacy when working with people's information.

Uploaded by

Pavan Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
390 views24 pages

IDS Unit 1 Notes

Data science involves extracting insights from vast amounts of data using scientific methods and algorithms. It is an interdisciplinary field that allows knowledge to be extracted from structured or unstructured data to solve business problems. Data science jobs include data scientist, data engineer, data analyst, statistician, and more. These roles use tools like R, Python, SQL, and Tableau to work with large datasets, build models, and communicate results. Challenges include dealing with diverse data types, a lack of talent and financial support, and ensuring privacy when working with people's information.

Uploaded by

Pavan Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Introduction to Data Science

What is Data Science? Introduction, Basic Concepts & Process

What is Data Science?


Data Science is the area of study which involves extracting insights from vast amounts of data
using various scientific methods, algorithms, and processes. It helps you to discover hidden
patterns from the raw data. The term Data Science has emerged because of the evolution of
mathematical statistics, data analysis, and big data.

Data Science is an interdisciplinary field that allows you to extract knowledge from structured
or unstructured data. Data science enables you to translate a business problem into a research
project and then translate it back into a practical solution.

Why Data Science?


Here are significant advantages of using Data Analytics Technology:

• Data is the oil for today’s world. With the right tools, technologies, algorithms, we can
use data and convert it into a distinct business advantage
• Data Science can help you to detect fraud using advanced machine learning algorithms
• It helps you to prevent any significant monetary losses
• Allows to build intelligence ability in machines
• You can perform sentiment analysis to gauge customer brand loyalty
• It enables you to take better and faster decisions
• It helps you to recommend the right product to the right customer to enhance your
business

Data Science Components

Statistics:
Statistics is the most critical unit of Data Science basics, and it is the method or science of
collecting and analyzing numerical data in large quantities to get useful insights.

Visualization:
Visualization technique helps you access huge amounts of data in easy to understand and
digestible visuals.

Data Science Process


1. Discovery:
Discovery step involves acquiring data from all the identified internal & external sources, which
helps you answer the business question.

The data can be:

• Logs from webservers


• Data gathered from social media
• Census datasets
• Data streamed from online sources using APIs

2. Preparation:
Data can have many inconsistencies like missing values, blank columns, an incorrect data
format, which needs to be cleaned. You need to process, explore, and condition data before
modelling. The cleaner your data, the better are your predictions.

3. Model Planning:
In this stage, you need to determine the method and technique to draw the relation between
input variables. Planning for a model is performed by using different statistical formulas and
visualization tools. SQL analysis services, R, and SAS/access are some of the tools used for this
purpose.

4. Model Building:
In this step, the actual model building process starts. Here, Data scientist distributes datasets for
training and testing. Techniques like association, classification, and clustering are applied to the
training data set. The model, once prepared, is tested against the “testing” dataset.

5. Operationalize:
You deliver the final baselined model with reports, code, and technical documents in this stage.
Model is deployed into a real-time production environment after thorough testing.

6. Communicate Results
In this stage, the key findings are communicated to all stakeholders. This helps you decide if the
project results are a success or a failure based on the inputs from the model.

Data Science Jobs Roles


Most prominent Data Scientist job titles are:

• Data Scientist
• Data Engineer
• Data Analyst
• Statistician
• Data Architect
• Data Admin
• Business Analyst
• Data/Analytics Manager

Data Scientist:
Role: A Data Scientist is a professional who manages enormous amounts of data to come up
with compelling business visions by using various tools, techniques, methodologies, algorithms,
etc.

Languages: R, SAS, Python, SQL, Hive, Matlab, Pig, Spark

Data Engineer:
Role: The role of a data engineer is of working with large amounts of data. He develops,
constructs, tests, and maintains architectures like large scale processing systems and databases.
Languages: SQL, Hive, R, SAS, Matlab, Python, Java, Ruby, C + +, and Perl

Data Analyst:
Role: A data analyst is responsible for mining vast amounts of data. They will look for
relationships, patterns, trends in data. Later he or she will deliver compelling reporting and
visualization for analyzing the data to take the most viable business decisions.

Languages: R, Python, HTML, JS, C, C+ + , SQL

Statistician:
Role: The statistician collects, analyses, and understands qualitative and quantitative data using
statistical theories and methods.
Languages: SQL, R, Matlab, Tableau, Python, Perl, Spark, and Hive

Data Administrator:
Role: Data admin should ensure that the database is accessible to all relevant users. He also
ensures that it is performing correctly and keeps it safe from hacking.

Languages: Ruby on Rails, SQL, Java, C#, and Python

Business Analyst:
Role: This professional needs to improve business processes. He/she is an intermediary between
the business executive team and the IT department.

Languages: SQL, Tableau, Power BI and, Python

Tools for Data Science

Applications of Data Science Some


application of Data Science are:

Internet Search:
Google search uses Data science technology to search for a specific result within a fraction of a
second

Recommendation Systems:
To create a recommendation system. For example, “suggested friends” on Facebook or
suggested videos” on YouTube, everything is done with the help of Data Science.

Image & Speech Recognition:


Speech recognizes systems like Siri, Google Assistant, and Alexa run on the Data science
technique. Moreover, Facebook recognizes your friend when you upload a photo with them,
with the help of Data Science.

Gaming world:
EA Sports, Sony, Nintendo are using Data science technology. This enhances your gaming
experience. Games are now developed using Machine Learning techniques, and they can update
themselves when you move to higher levels.

Online Price Comparison:


PriceRunner, Junglee, Shopzilla work on the Data science mechanism. Here, data is fetched
from the relevant websites using APIs.

Challenges of Data Science Technology :

1. **High Variety of Data:** This means that data scientists often need many different types of information to
do their job properly. It's like needing different ingredients to cook a meal.

2. **Shortage of Data Science Talent:** There aren't enough people with the right skills to work in data
science. Think of it as not having enough chefs to cook in a restaurant.

3. **Lack of Financial Support:** Sometimes, the people in charge of a company don't want to spend money
on a data science team, which can make it hard to do the work.

4. **Difficult Access to Data:** It can be tough to get the data you need for analysis. It's like trying to find
the right ingredients for your recipe but not being able to locate them easily.

5. **Underutilization of Data by Decision-Makers:** Sometimes, the important people who make decisions
in a company don't use the information that data scientists provide them. It's like having a great recipe, but no
one wants to try the dish you cooked.

6. **Explaining Data Science:** Data scientists often struggle to explain their work to others who may not
understand technical jargon. It's like trying to describe a complicated science experiment to someone who
didn't study science.

7. **Privacy Issues:** Protecting people's private information when working with data is a big concern. It's
like making sure that people's personal secrets are safe when you're looking at their recipe preferences.

8. **Lack of Domain Expertise:** Sometimes, data scientists may not know enough about the specific
industry they're working in, which can make their analysis less accurate. It's like trying to cook a traditional
dish from a foreign country without knowing the recipe.
9. **Small Organizations:** In very small companies, there might not be enough resources or people to have
a dedicated data science team. It's like trying to run a restaurant with just a few staff members.

These are some of the common challenges in the world of data science, simplified to everyday situations to
make them easier to understand.Need for Data Science:

Some years ago, data was less and mostly available in a structured form, which could be easily
stored in excel sheets, and processed using BI tools.

But in today's world, data is becoming so vast, i.e., approximately 2.5 quintals bytes of data is
generating on every day, which led to data explosion. It is estimated as per researches, that by
2020, 1.7 MB of data will be created at every single second, by a single person on earth. Every
Company requires data to work, grow, and improve their businesses.

Now, handling of such huge amount of data is a challenging task for every organization. So to
handle, process, and analysis of this, we required some complex, powerful, and efficient
algorithms and technology, and that technology came into existence as data Science. Following
are some main reasons for using data science technology: o With the help of data science
technology, we can convert the massive amount of raw and unstructured data into meaningful
insights. o Data science technology is opting by various companies, whether it is a big brand or a
startup. Google, Amazon, Netflix, etc, which handle the huge amount of data, are using data
science algorithms for better customer experience.
o Data science is working for automating transportation such as creating a self-driving car,
which is the future of transportation.
o Data science can help in different predictions such as various survey, elections, flight
ticket confirmation, etc.

Data science Jobs:

As per various surveys, data scientist job is becoming the most demanding Job of the 21st century
due to increasing demands for data science. Some people also called it "the hottest job title of
the 21st century". Data scientists are the experts who can use various statistical tools and
machine learning algorithms to understand and analyze the data.

The average salary range for data scientist will be approximately $95,000 to $ 165,000 per
annum, and as per different researches, about 11.5 millions of job will be created by the year
2026.

Types of Data Science Job

If you learn data science, then you get the opportunity to find the various exciting job roles in this
domain. The main job roles are given below:

1. Data Scientist
2. Data Analyst
3. Machine learning expert
4. Data engineer
5. Data Architect
6. Data Administrator
7. Business Analyst
8. Business Intelligence Manager
Below is the explanation of some critical job titles of data science.

1. Data Analyst:

Data analyst is an individual, who performs mining of huge amount of data, models the data,
looks for patterns, relationship, trends, and so on. At the end of the day, he comes up with
visualization and reporting for analyzing the data for decision making and problem-solving
process.

Skill required: For becoming a data analyst, you must get a good background in mathematics,
business intelligence, data mining, and basic knowledge of statistics. You should also be
familiar with some computer languages and tools such as MATLAB, Python, SQL, Hive, Pig,
Excel, SAS, R, JS, Spark, etc.

2. Machine Learning Expert:

The machine learning expert is the one who works with various machine learning algorithms used
in data science such as regression, clustering, classification, decision tree, random forest, etc.
Skill Required: Computer programming languages such as Python, C++, R, Java, and Hadoop.
You should also have an understanding of various algorithms, problem-solving analytical skill,
probability, and statistics.

3. Data Engineer:

A data engineer works with massive amount of data and responsible for building and maintaining
the data architecture of a data science project. Data engineer also works for the creation of data
set processes used in modeling, mining, acquisition, and verification.

Skill required: Data engineer must have depth knowledge of SQL, MongoDB, Cassandra, HBase,
Apache Spark, Hive, MapReduce, with language knowledge of Python, C/C++,
Java, Perl, etc.

4. Data Scientist:

A data scientist is a professional who works with an enormous amount of data to come up with
compelling business insights through the deployment of various tools, techniques, methodologies,
algorithms, etc.

Skill required: To become a data scientist, one should have technical language skills such as R,
SAS, SQL, Python, Hive, Pig, Apache spark, MATLAB. Data scientists must have an
understanding of Statistics, Mathematics, visualization, and communication skills.

Prerequisite for Data Science

Non-Technical Prerequisite:

o Curiosity: To learn data science, one must have curiosities. When you have curiosity
and ask various questions, then you can understand the business problem easily.

Criterion Business intelligence Data science

Data Business intelligence deals with structured data, Data science deals with structured and
Source e.g., data warehouse. unstructured data, e.g., weblogs, feedback, etc.
o Critical Thinking: It is also required for a data scientist so that you can find multiple
new ways to solve the problem with efficiency.
o Communication skills: Communication skills are most important for a data scientist
because after solving a business problem, you need to communicate it with the team.

Technical Prerequisite:

o Machine learning: To understand data science, one needs to understand the concept of
machine learning. Data science uses machine learning algorithms to solve various
problems.
o Mathematical modeling: Mathematical modeling is required to make fast mathematical
calculations and predictions from the available data.
o Statistics: Basic understanding of statistics is required, such as mean, median, or
standard deviation. It is needed to extract knowledge and obtain better results from the
data.
o Computer programming: For data science, knowledge of at least one programming
language is required. R, Python, Spark are some required computer programming
languages for data science.
o Databases: The depth understanding of Databases such as SQL, is essential for data
science to get the data and to work with data.

Difference between BI and Data Science

BI stands for business intelligence, which is also used for data analysis of business information:
Below are some differences between BI and Data sciences:

Method Analytical(historical data) Scientific(goes deeper to know the reason for the
data report)

Skills
Statistics and Visualization are the two skills Statistics, Visualization, and Machine learning
required for business intelligence. are the required skills for data science.

Focus Business intelligence focuses on both Past and Data science focuses on past data, present data,
present data and also future predictions.

Data Science Components:


The main components of Data Science are given below:

1. Statistics: Statistics is one of the most important components of data science. Statistics is
a way to collect and analyze the numerical data in a large amount and finding meaningful insights
from it.

2. Domain Expertise: In data science, domain expertise binds data science together.
Domain expertise means specialized knowledge or skills of a particular area. In data science,
there are various areas for which we need domain experts.

3. Data engineering: Data engineering is a part of data science, which involves acquiring,
storing, retrieving, and transforming the data. Data engineering also includes metadata (data
about data) to the data.

4. Visualization: Data visualization is meant by representing data in a visual context so that


people can easily understand the significance of data. Data visualization makes it easy to access
the huge amount of data in visuals.

5. Advanced computing: Heavy lifting of data science is advanced computing. Advanced


computing involves designing, writing, debugging, and maintaining the source code of computer
programs.
6. Mathematics: Mathematics is the critical part of data science. Mathematics involves the
study of quantity, structure, space, and changes. For a data scientist, knowledge of good
mathematics is essential.

7. Machine learning: Machine learning is backbone of data science. Machine learning is all
about to provide training to a machine so that it can act as a human brain. In data science, we use
various machine learning algorithms to solve the problems.

Big Data and Data Science Hype:

Let’s get this out of the way right off the bat, because many of you are likely skeptical of data
science already for many of the reasons we were. We want to address this up front to let you
know: we’re right there with you. If you’re a skeptic too, it probably means you have something
useful to contribute to making data science into a more legitimate field that has the power to
have a positive impact on society.
So, what is eyebrow-raising about Big Data and data science? Let’s count the ways:

1. There’s a lack of definitions around the most basic terminology. What is “Big Data”
anyway? What does “data science” mean? What is the relationship between Big Data and data
science? Is data science the science of Big Data? Is data science only the stuff going on in
companies like Google and Facebook and tech companies? Why do many people refer to Big
Data as crossing disciplines (astronomy, finance, tech, etc.) and to data science as only taking
place in tech? Just how big is big? Or is it just a relative term? These terms are so ambiguous,
they’re well-nigh meaningless.
2. There’s a distinct lack of respect for the researchers in academia and industry labs who
have been working on this kind of stuff for years, and whose work is based on decades (in some
cases, centuries) of work by statisticians, computer scientists, mathematicians, engineers, and
scientists of all types. From the way the media describes it, machine learning algorithms were
just invented last week and data was never “big” until Google came along. This is simply not
the case. Many of the methods and techniques we’re using—and the challenges we’re facing
now—are part of the evolution of everything that’s come before. This doesn’t mean that there’s
not new and exciting stuff going on, but we think it’s important to show some basic respect for
everything that came before.
3. The hype is crazy—people throw around tired phrases straight out of the height of the
pre-financial crisis era like “Masters of the Universe” to describe data scientists, and that
doesn’t bode well. In general, hype masks reality and increases the noise-to-signal ratio. The
longer the hype goes on, the more many of us will get turned off by it, and the harder it will be
to see what’s good underneath it all, if anything.
4. Statisticians already feel that they are studying and working on the “Science of Data.”
That’s their bread and butter. Maybe you, dear reader, are not a statistician and don’t care, but
imagine that for the statistician, this feels a little bit like how identity theft might feel for you.
Although we will make the case that data science is not just a rebranding of statistics or
machine learning but rather a field unto itself, the media often describes data science in a way
that makes it sound like as if it’s simply statistics or machine learning in the context of the tech
industry.
5. People have said to us, “Anything that has to call itself a science isn’t.” Although there
might be truth in there, that doesn’t mean that the term “data science” itself represents nothing,
but of course what it represents may not be science but more of a craft.

Getting Past the Hype


Rachel’s experience going from getting a PhD in statistics to working at Google is a great
example to illustrate why we thought, in spite of the aforementioned reasons to be dubious,
there might be some meat in the data science sandwich. In her words:

It was clear to me pretty quickly that the stuff I was working on at Google was different than
anything I had learned at school when I got my PhD in statistics. This is not to say that my
degree was useless; far from it—what I’d learned in school provided a framework and way of
thinking that I relied on daily, and much of the actual content provided a solid theoretical and
practical foundation necessary to do my work.
But there were also many skills I had to acquire on the job at Google that I hadn’t learned in
school. Of course, my experience is specific to me in the sense that I had a statistics background
and picked up more computation, coding, and visualization skills, as well as domain expertise
while at Google. Another person coming in as a computer scientist or a social scientist or a
physicist would have different gaps and would fill them in accordingly. But what is important here
is that, as individuals, we each had different strengths and gaps, yet we were able to solve
problems by putting ourselves together into a data team well-suited to solve the data problems
that came our way.

Here’s a reasonable response you might have to this story. It’s a general truism that, whenever
you go from school to a real job, you realize there’s a gap between what you learned in school
and what you do on the job. In other words, you were simply facing the difference between
academic statistics and industry statistics.

We have a couple replies to this:

• Sure, there’s is a difference between industry and academia. But does it really have to be
that way? Why do many courses in school have to be so intrinsically out of touch with reality?
• Even so, the gap doesn’t represent simply a difference between industry statistics and
academic statistics. The general experience of data scientists is that, at their job, they have
access to a larger body of knowledge and methodology, as well as a process, which we now
define as the data science process (details in Chapter 2), that has foundations in both statistics
and computer science

Rachel gave herself the task of understanding the cultural phenomenon of data science and how
others were experiencing it. She started meeting with people at Google, at startups and tech
companies, and at universities, mostly from within statistics departments.
From those meetings she started to form a clearer picture of the new thing that’s emerging. She
ultimately decided to continue the investigation by giving a course at Columbia called
“Introduction to Data Science,” which Cathy covered on her blog. We figured that by the end of
the semester, we, and hopefully the students, would know what all this actually meant. And
now, with this book, we hope to do the same for many more people.

Why Now?

We have massive amounts of data about many aspects of our lives, and, simultaneously, an
abundance of inexpensive computing power. Shopping, communicating, reading news, listening
to music, searching for information, expressing our opinions—all this is being tracked online, as
most people know.
What people might not know is that the “datafication” of our offline behavior has started as
well, mirroring the online data collection revolution (more on this later). Put the two together,
and there’s a lot to learn about our behavior and, by extension, who we are as a species.

It’s not just Internet data, though—it’s finance, the medical industry, pharmaceuticals,
bioinformatics, social welfare, government, education, retail, and the list goes on. There is a
growing influence of data in most sectors and most industries. In some cases, the amount of
data collected might be enough to be considered “big” (more on this in the next chapter); in
other cases, it’s not.

But it’s not only the massiveness that makes all this new data interesting (or poses challenges).
It’s that the data itself, often in real time, becomes the building blocks of data products. On the
Internet, this means Amazon recommendation systems, friend recommendations on Facebook,
film and music recommendations, and so on. In finance, this means credit ratings, trading
algorithms, and models. In education, this is starting to mean dynamic personalized learning
and assessments coming out of places like Knewton and Khan Academy. In government, this
means policies based on data.

We’re witnessing the beginning of a massive, culturally saturated feedback loop where our
behavior changes the product and the product changes our behavior. Technology makes this
possible: infrastructure for large-scale data processing, increased memory, and bandwidth, as
well as a cultural acceptance of technology in the fabric of our lives. This wasn’t true a decade
ago.
Considering the impact of this feedback loop, we should start thinking seriously about how it’s
being conducted, along with the ethical and technical responsibilities for the people responsible
for the process. One goal of this book is a first stab at that conversation.

Datafication

In the May/June 2013 issue of Foreign Affairs, Kenneth Neil Cukier and Viktor
MayerSchoenberger wrote an article called “The Rise of Big Data”. In it they discuss the
concept of datafication, and their example is how we quantify friendships with “likes”: it’s
the way everything we do, online or otherwise, ends up recorded for later examination in
someone’s data storage units. Or maybe multiple storage units, and maybe also for sale.

They define datafication as a process of “taking all aspects of life and turning them into data.”
As examples, they mention that “Google’s augmented-reality glasses datafy the gaze. Twitter
datafies stray thoughts. LinkedIn datafies professional networks.”
Datafication is an interesting concept and led us to consider its importance with respect to
people’s intentions about sharing their own data. We are being datafied, or rather our actions
are, and when we “like” someone or something online, we are intending to be datafied, or at
least we should expect to be. But when we merely browse the Web, we are unintentionally, or at
least passively, being datafied through cookies that we might or might not be aware of. And
when we walk around in a store, or even on the street, we are being datafied in a completely
unintentional way, via sensors, cameras, or Google glasses.

This spectrum of intentionality ranges from us gleefully taking part in a social media
experiment we are proud of, to all-out surveillance and stalking. But it’s all datafication. Our
intentions may run the gamut, but the results don’t.

They follow up their definition in the article with a line that speaks volumes about their
perspective:

Once we datafy things, we can transform their purpose and turn the information into new forms
of value.

Here’s an important question that we will come back to throughout the book: who is “we” in
that case? What kinds of value do they refer to? Mostly, given their examples, the “we” is the
modelers and entrepreneurs making money from getting people to buy stuff, and the “value”
translates into something like increased efficiency through automation.
If we want to think bigger, if we want our “we” to refer to people in general, we’ll be swimming
against the tide.

What Is Datafication?
The word “Datafication” does not have a definition or rather it is not yet a word that has found
a place in a dictionary. And yet it is a word we are hearing a lot these days. What it simply
means is this- from our actions to our thoughts, everything is getting transformed into a
numerically quantified format or “Data”.
From sports to finance and from entertainment to healthcare everything around us is converting
into data. For example, we create data every time we talk on the phone, SMS, tweet, email, use
Facebook, watch a video, withdraw money from an ATM, use a credit card, or even walk past
a security camera. The notion is different from digitization. In fact datafication is far broader
than digitization. This astronomical amount of data has information about our identity and our
behaviour.

Datafication is helping us to understand the world in a way which was never done before. New
technologies are now available to ingest, store, process and visualise that data. Organizations
are using them to get benefits. For example marketers are analysing Facebook and Twitter data
to determine and predict sales. Companies spanning from all sectors and sizes have started to
realize the big benefits of data and its analytics. They are beginning to improve their capabilities
to collect and analyse data. Bernard Marr gives us one example to better understand how
businesses use data:
Summarizing, datafication is a technological trend turning many aspects of our lives into
computerized data using processes to transform organizations into data-driven enterprises by
converting this information into new forms of value.
Datafication refers to the fact that daily interactions of living things can be rendered into a data
format and put to social use.

Examples:

Let‟s say social platforms, Facebook or Instagram, for example, collect and monitor data
information of our friendships to market products and services to us and surveillance services
to agencies which in turn changes our behaviour; promotions that we daily see on the socials
are also the result of the monitored data. In this model, data is used to redefine how content is
created by datafication being used to inform content rather than recommendation systems.

However, there are other industries where datafication process is actively used:
▪ Insurance: Data used to update risk profile development and business models.
▪ Banking: Data used to establish trustworthiness and likelihood of a person paying back a
loan.
▪ Human resources: Data used to identify e.g. employees risk-taking profiles.
▪ Hiring and recruitment: Data used to replace personality tests.
▪ Social science research: Datafication replaces sampling techniques and restructures the
manner in which social science research is performed.

Netflix Case:

Netflix, an internet streaming media provider, is a bright example of datafication process. It


provides services in more than 40 countries and 33 million streaming members. Originally,
operations were more physical in nature with its core business in mail order-based disc rental
(DVD and Blu-ray). Simply said, the operating model was that the subscriber creates and
maintains the queue (an ordered list) of media content that they want to rent
The Current Landscape (with a Little History)
So, what is data science? Is it new, or is it just statistics or analytics rebranded? Is it real, or is it
pure hype? And if it’s new and if it’s real, what does that mean?

This is an ongoing discussion, but one way to understand what’s going on in this industry is to
look online and see what current discussions are taking place. This doesn’t necessarily tell us
what data science is, but it at least tells us what other people think it is, or how they’re
perceiving it. For example, on Quora there’s a discussion from 2010 about “What is Data
Science?” and here’s Metamarket CEO Mike Driscoll’s answer:

Data science, as it’s practiced, is a blend of Red-Bull-fueled hacking and espresso-inspired


statistics.

But data science is not merely hacking—because when hackers finish debugging their Bash
oneliners and Pig scripts, few of them care about non-Euclidean distance metrics.

And data science is not merely statistics, because when statisticians finish theorizing the perfect
model, few could read a tab-delimited file into R if their job depended on it.

Data science is the civil engineering of data. Its acolytes possess a practical knowledge of tools
and materials, coupled with a theoretical understanding of what’s possible.

Driscoll then refers to Drew Conway’s Venn diagram of data science from 2010, shown in
Figure 1-1.
Figure 1-1. Drew Conway’s Venn diagram of data science

He also mentions the sexy skills of data geeks from Nathan Yau’s 2009 post, “Rise of the Data
Scientist”, which include:

• Statistics (traditional analysis you’re used to thinking about)


• Data munging (parsing, scraping, and formatting data)
• Visualization (graphs, tools, etc.)

But wait, is data science just a bag of tricks? Or is it the logical extension of other fields like
statistics and machine learning?

For one argument, see Cosma Shalizi’s posts here and here, and Cathy’s posts here and here,
which constitute an ongoing discussion of the difference between a statistician and a data
scientist. Cosma basically argues that any statistics department worth its salt does all the stuff in
the descriptions of data science that he sees, and therefore data science is just a rebranding and
unwelcome takeover of statistics.
Differences between Big Data and Data Science:
Data Science Big Data

1.Big Data is a technique to collect, maintain and


process huge information.
1.Data Science is an area.

2. It is about the collection, processing,


analyzing, and utilizing of data in various It is about extracting vital and valuable
operations. It is more conceptual. information from a huge amount of data.

It is a field of study just like Computer


Science, Applied Statistics, or Applied It is a technique for tracking and discovering
Mathematics. trends in complex data sets.

The goal is to make data more vital and usable i.e.


by extracting only important information from the
The goal is to build data-dominant products huge data within existing traditional aspects.
for a venture.
Tools mostly used in Big Data include Hadoop,
Tools mainly used in Data Science include Spark, Flink, etc.
SAS, R, Python, etc

It is a superset of Big Data as data science


consists of Data scrapping, cleaning, It is a sub-set of Data Science as mining activities
visualization, statistics, and many more which is in a pipeline of Data science.

It is mainly used for business purposes and


customer satisfaction.
It is mainly used for scientific purposes.
It broadly focuses on the science of the It is more involved with the processes of handling
data. voluminous data.
Probability Distribution:
A Probability Distribution of a random variable is a list of all possible outcomes with
corresponding probability values.

Note : The value of the probability always lies between 0 to 1.


What is an example of Probability Distribution?

Let’s understand the probability distribution by an example:

When two dice are rolled with six sided dots, let the possible outcome of rolling is denoted by (a,

b), where a : number on the top of first dice b : number on the top of second dice

Then, sum of a + b are:

Sum of a + b (a, b)

2 (1,1)

3 (1,2), (2,1)

4 (1,3), (2,2), (3,1)

5 (1,4), (2,3), (3,2), (4,1)

6 (1,5), (2,4), (3,3), (4,2), (5,1)

(1,6), (2,5), (3,4),(4,3), (5,2),


7
(6,1)

8 (2,6), (3,5), (4,4), (5,3), (6,2)

9 (3,6), (4,5), (5,4), (6,3)

10 (4,6), (5,5), (6,4)


• If a random variable is a discrete variable, it’s probability distribution is called discrete
probability distribution.
• Example : Flipping of two coins
• Functions that represents a discrete probability distribution is known as Probability Mass
Function.
If a random variable is a continuous variable, it’s probability distribution is called continuous
probability distribution.
• Example: Measuring temperature over a period of time
• Functions that represents a continuous probability distribution is known as Probability
Density Function.

You might also like