Unit 1 Part 1
Unit 1 Part 1
Data Science
Data Science is: Popular
● Lots of Data => Lots of Analysis => Lots of Jobs
● Universities: Starting new multidisciplinary programs
● Industry: Cottage industry evolving for online and training
courses
Data is: Big!
● 2.5 quintillion (1018) bytes of data are generated every day!
● Everything around you collects/generates data
– Social media sites
– Business transactions
– Location-based data
– Sensors
– Digital photos, videos
– Consumer behaviour (online and store transactions)
● More data is publicly available
● Database technology is advancing
● Cloud based & mobile applications are widespread
If I have data, I will know:
● Everyone wants better predictability, forecasting, customer
satisfaction, market differentiation, prevention, great user
experience, ...
● How can I price a particular product?
● What can I recommend online customers to buy after buying X, Y or
Z?
● How can we discover market segments? group customers into market
segments?
● What customer will buy in the upcoming holiday season? (what to
stock?)
● What is the price point for customer retention for subscriptions?
Data Science is: making sense of Data
● Data science is an interdisciplinary field that uses scientific methods,
processes, algorithms and systems to extract knowledge and insights
from data in various forms, both structured and unstructured, similar to
data mining. ~Wikipedia
● Data science is the field of study that combines domain expertise,
programming skills, and knowledge of math and statistics to extract
meaningful insights from data.
● Data science practitioners apply machine learning algorithms to
numbers, text, images, video, audio, and more to produce artificial
intelligence (AI) systems that perform tasks which ordinarily require
human intelligence.
● In turn, these systems generate insights that analysts and business
users translate into tangible business value.
AI, ML and DS
Data Science is: multidisciplinary
● Statisticians
● Mathematicians
● Computer Scientists in
– Data mining
– Artificial Intelligence & Machine Learning
– Systems Development and Integration
– Database development
– Analytics
● Domain Experts
– Medical experts
– Geneticists
– Finance, Business, Economy experts ○ etc.
Data science fields
● Statistics and Probability
● Python
● Machine Learning
● Data Processing
● Data Visualization
● Data Mining
● Predictive Analytics
● Big Data
● Modelling
● Data Consultancy etc.....
Data science impact
● Empowering management and officers to make better decision
● Directing actions based on trends—which in turn help to define goals
● Challenging the staff to adopt best practices and focus on issues that
matter
● Identifying opportunities
● Decision making with quantifiable, data-driven evidence
● Testing these decisions
● Identification and refining of target audiences
● Recruiting the right talent for the organization
● & much more.....
Data Team
Data Acquisition Stage
● At this stage, one must assess:
– What type of data is available
– What might be required and currently is not collected
– Is it available from other units of the company?
– Does she need to crawl/buy data from third parties?
– How much data is needed? (Data volume)
– How to access the data?
– Is the data private?
– Is it legally OK to use the data?
● Data may not exist
● Sources of data may be public or private
● Not all sources of data may be suitable for processing
● Data are often incomplete and dirty
● Data consolidation and cleanup are essential
Data Cleaning
● Data are often incomplete, incorrect.
– Typo : e.g., text data in numeric fields
– Missing Values : some fields may not be collected for some of the
examples
– Impossible Data combinations: e.g., gender=MALE, pregnant =
TRUE
– Out-of-Range Values: e.g., age=1000
● Garbage In Garbage Out
● Scripting, Visualization
Exploratory Data Analysis
● Univariate Analysis: Analyze/explore variables one by one
● Bivariate Analysis: Explore relationship between variables
● Coverage, missing values: treating unknown values
● Outliers: detect and treat values that are distant from other observations
● Feature Engineering: Variable transformations and creation of new better
variables from raw features
● Commonly used tools:
– SQL
– R: plyr, reshape, ggplot2, data.table,
– Python: NumPy, Pandas, SciPy, matplotlib
Feature Engineering
● Create new features from existing raw features: discretize, bin
● Transform Variables
● Create new categorical variables: too many levels, levels that
rarely occur, one level almost always occur
● Extremely skewed data - outliers
● Imputation: Filling in missing data