0% found this document useful (0 votes)
56 views

Datascience Notes

Data science deals with extracting knowledge and insights from large amounts of data using techniques like data preparation, statistics, predictive modeling, and machine learning. Its findings can be applied across sectors like healthcare, education, and travel. Machine learning is a subset of artificial intelligence that uses algorithms to analyze patterns in data and make data-driven decisions without human interaction. It allows companies to unlock value from corporate and customer data to gain insights and stay ahead of competitors. Common mistakes in data preparation include not having complete or unbiased historical data, building models with insufficient data, and not properly cleaning data by removing outliers and duplicates.

Uploaded by

PGNSeetha
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
56 views

Datascience Notes

Data science deals with extracting knowledge and insights from large amounts of data using techniques like data preparation, statistics, predictive modeling, and machine learning. Its findings can be applied across sectors like healthcare, education, and travel. Machine learning is a subset of artificial intelligence that uses algorithms to analyze patterns in data and make data-driven decisions without human interaction. It allows companies to unlock value from corporate and customer data to gain insights and stay ahead of competitors. Common mistakes in data preparation include not having complete or unbiased historical data, building models with insufficient data, and not properly cleaning data by removing outliers and duplicates.

Uploaded by

PGNSeetha
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 2

Why Data Science?

An interdisciplinary field, data science deals with processes and systems, that are used to
extract knowledge or insights from large amounts of data.
By using data preparation, statistics, predictive modeling and machine learning, data science
tries to resolve many issues within individual sectors and the economy at large.
To understand customers in a personalized manner
Its findings and results can be applied to almost any sector like travel, healthcare and education
among others.
Why machine learning?
A subset of artificial intelligence (AI), machine learning (ML) is the area of computational
science that focuses on analyzing and interpreting patterns and structures in data to enable
learning, reasoning, and decision making outside of human interaction. Simply put, machine
learning allows the user to feed a computer algorithm an immense amount of data and have
the computer analyze and make data-driven recommendations and decisions based on only the
input data. If any corrections are identified, the algorithm can incorporate that information to
improve its future decision making.
Data is the lifeblood of all business. Data-driven decisions increasingly make the difference
between keeping up with competition or falling further behind. Machine learning can be the
key to unlocking the value of corporate and customer data and enacting decisions that keep a
company ahead of the competition.

 The heavily hyped, self-driving Google car? The essence of machine learning.

 Online recommendation offers such as those from Amazon and Netflix? Machine learning
applications for everyday life.

 Knowing what customers are saying about you on Twitter? Machine learning combined with
linguistic rule creation.

 Fraud detection? One of the more obvious, important uses in our world today.

Data Preparation Process

Step-1- Selection

 What is the extent of the data you have available?


 What data is not available that you wish you had available?
 What data don’t you need to address the problem?
Step-2-Preprocess Data

 Formatting
 Cleaning
 Sampling
Step-3-Transform Data

 Scaling
 Decomposition
 Aggregation
Common mistakes in data cleaning process

 Historical data not available accurately: This is a common system constraint in Organizations
where there is no warehousing in place or in case when base systems overwrites data there
by erasing historical information.
 Data collected only for positive outcomes:
 Absence of non-biased data set:
 Including data from a period which is no longer valid
 Variables which can change because of change in customer behavior
 Building model on thin data
 Not removing outlier
 Not removing duplicates
 Not treating zero, null and special values carefully
 Adding ID as a variable
 Not being hypothesis driven in creating calculated / transformed variables
 Not spending enough time thinking about transformations

You might also like