Datascience Notes
Datascience Notes
An interdisciplinary field, data science deals with processes and systems, that are used to
extract knowledge or insights from large amounts of data.
By using data preparation, statistics, predictive modeling and machine learning, data science
tries to resolve many issues within individual sectors and the economy at large.
To understand customers in a personalized manner
Its findings and results can be applied to almost any sector like travel, healthcare and education
among others.
Why machine learning?
A subset of artificial intelligence (AI), machine learning (ML) is the area of computational
science that focuses on analyzing and interpreting patterns and structures in data to enable
learning, reasoning, and decision making outside of human interaction. Simply put, machine
learning allows the user to feed a computer algorithm an immense amount of data and have
the computer analyze and make data-driven recommendations and decisions based on only the
input data. If any corrections are identified, the algorithm can incorporate that information to
improve its future decision making.
Data is the lifeblood of all business. Data-driven decisions increasingly make the difference
between keeping up with competition or falling further behind. Machine learning can be the
key to unlocking the value of corporate and customer data and enacting decisions that keep a
company ahead of the competition.
The heavily hyped, self-driving Google car? The essence of machine learning.
Online recommendation offers such as those from Amazon and Netflix? Machine learning
applications for everyday life.
Knowing what customers are saying about you on Twitter? Machine learning combined with
linguistic rule creation.
Fraud detection? One of the more obvious, important uses in our world today.
Step-1- Selection
Formatting
Cleaning
Sampling
Step-3-Transform Data
Scaling
Decomposition
Aggregation
Common mistakes in data cleaning process
Historical data not available accurately: This is a common system constraint in Organizations
where there is no warehousing in place or in case when base systems overwrites data there
by erasing historical information.
Data collected only for positive outcomes:
Absence of non-biased data set:
Including data from a period which is no longer valid
Variables which can change because of change in customer behavior
Building model on thin data
Not removing outlier
Not removing duplicates
Not treating zero, null and special values carefully
Adding ID as a variable
Not being hypothesis driven in creating calculated / transformed variables
Not spending enough time thinking about transformations