Fundamentals of Data Science
Fundamentals of Data Science
Fundamentals of Data Science
OF DATA SCIENCE
Dr. C.Deepa
AI&DS
KIT-Kalaignarkarunanidhi Institute of Technology
Course Objectives
1. To introduce the basic concepts of Data Science.
2. To understand the mathematical skills in statistics
3. To acquire the skills in data pre-processing steps.
4. To learn the concepts of feature selection
algorithms in machine learning.
5. To learn the concept of clustering approaches and
to visualize the processed data using visualization
techniques
Unit I
INTRODUCTION
• Need for Data Science – Benefits and uses –
Facets of data – Types of data- Organization of
data- Data Science process- Data Science life
cycle- Role of Data Science- Big Data – sources
and characteristics of Big Data
Introduction
• Data science is the field of study that combines domain expertise,
programming skills, and knowledge of mathematics and statistics to
extract meaningful insights from data.
• Data Science is a blend of various tools, algorithms, and machine
learning principles with the goal to discover hidden patterns from
the raw data.
• Data science is the application of computational and statistical
techniques to address or gain insight into some problem in the real
world
• Data Analyst
• Data engineer
• Data Architect
• Data Administrator
• Business Analyst
1. Discovery:
• The first phase is discovery, which involves asking the right
questions.
• When you start any data science project, you need to
determine what are the basic requirements, priorities, and
project budget.
• In this phase, we need to determine all the requirements of the
project such as the number of people, technology, time, data,
an end goal, and then we can frame the business problem on
first hypothesis level.
2. Data preparation:
• Data cleaning
• Data Reduction
• Data integration
• Data transformation,
• After performing all the above tasks, we can easily use this data for our
further processes.
3. Model Planning:
In this phase, the process of model building starts. We will create datasets for training
and testing purpose. We will apply different techniques such as association,
classification, and clustering, to build the model.
• WEKA
• SPCS Modeler
• MATLAB
5. Operationalize:
In this phase, we will deliver the final reports of the project, along with briefings,
code, and technical documents. This phase provides you a clear overview of complete
project performance and other components on a small scale before the full
deployment.
6. Communicate results:
In this phase, we will check if we reach the goal, which we have set on the initial
phase. We will communicate the findings and final result with the business team.
Applications of Data Science:
• Image recognition and speech recognition
– Ok Google, Siri, Cortana
• Gaming world
– EA Sports, Sony, Nintendo
• Internet search
– Google, Yahoo, Bing,
• Transport
– self-driving cars.
• Healthcare
– tumor detection, drug discovery, medical image analysis,
virtual medical bots
• Recommendation systems
– suggestions for similar products
• Risk detection
– issue of fraud and risk of losses
Benefits and uses of Data Science
• Improves Business Predictions
• Business Intelligence
• Helps in Sales & Marketing
• Complex Data Interpretation
• Helps in Making Decisions
• Automating Recruitment Processes
advantages
Facets of Data
project charter requires teamwork, and your input covers at least the
following:
❖ A clear research goal
❖ A timeline
2. Retrieving data
❖ Start with data stored within the company
Administration
❖ Do data quality checks now to prevent problems later
3. Data Preparation
3. Data Preparation
Step 3: Cleansing, integrating, and transforming data
1. Cleansing data
❖ Data cleansing is a subprocess of the data science process that focuses
❖ REDUNDANT WHITESPACE
❖ OUTLIERS
❖ APPENDING TABLES
3. Data Preparation
4. Transforming data
3. Data Preparation
❖ REDUCING THE NUMBER OF VARIABLES
➢ Having too many variables in your model makes the
model difficult to handle, and certain techniques don’t
perform well when you overload them with too many
input variables
❖ TURNING VARIABLES INTO DUMMIES
➢ Variables can be turned into
This data is either found within the company or retrieved from a third party.
❖ Data preparation—Checking and remediating data errors, enriching the data
with data from other data sources, and transforming it into a suitable format
for your models.
❖ Data exploration—Diving deeper into your data using descriptive statistics
and industrializing your analysis process for repetitive reuse and integration
with other tools.
Big data
❖ Big data is larger, more complex data sets, especially from new data
sources.
❖ Data which are very large in size is called Big Data.
volumes and with more velocity. This is also known as the three Vs
➢ Volume- process high volumes of low-density, unstructured data
➢ Velocity-Velocity is the fast rate at which data is received and (perhaps) acted on
➢ Variety - various types of data
➢ Big data makes it possible for you to gain more complete answers
huge amount of logs from which users buying trends can be traced.
❖ Weather Station: All the weather station and satellite gives very
user trends and accordingly publish their plans and for this they
store the data of its million users.
❖ Share Market: Stock exchange across the world generates huge
4. Value
❖ Value is an essential characteristic of big data.
❖ It is not the data that we process or store.
❖ It is valuable and reliable data that we store, process, and
also analyze.
Characteristic of Big data
Characteristic of Big data
5. Velocity
❖ Velocity plays an important role compared to others.
❖ Velocity creates the speed by which the data is created in
real-time.
❖ It contains the linking of incoming data sets speeds, rate of
change, and activity bursts.
❖ The primary aspect of Big Data is to provide demanding data
rapidly.
❖ Big data velocity deals with the speed at the data flows from
sources like application logs, business processes, networks,
and social media sites, sensors, mobile devices, etc.
Applications of big data