0% found this document useful (0 votes)

11 views131 pages

Module 1 - Data Science Introduction _Detailed

The document outlines the fundamentals of data science, including its definition, importance, and applications across various industries. It discusses the concepts of big data, data warehousing, and the data preparation process, emphasizing the role of data scientists in analyzing and extracting insights from data. Additionally, it highlights the significance of structured, unstructured, and semi-structured data in the context of data science.

Uploaded by

nive69555

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views131 pages

Module 1 - Data Science Introduction _Detailed

Uploaded by

nive69555

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 131

School of Engineering and Technology

Semester - I
Course Title: Fundamentals of Data Science
Course code: : 24BTELY107

Module 1_Introduction To Data Science

Faculty Name: Dr. K.Vijayan

Date: 11/07/2024
1.1 Introduction To Data Science

Definition—Big Data and Data Science Hype—Datafication—Data

Science Profile—Meta Data—Definition—Data Scientist—
Statistical Inference—Populations and Samples—Populations and
Samples of Big Data—Modelling-Data Warehouse—Philosophy of
Exploratory Data Analysis—The Data Science Process—A Data
Scientist’s Role in this Process Case Study: Real Direct—Housing
Market Analysis
Introduction - Definition
1.1 Introduction: Definition
 Data science is a deep study of the large amount of data.
 It involves extracting meaningful insights from raw, structured, and
unstructured data that is processed using the scientific method,
different technologies, and algorithms.
 Data science uses the most powerful hardware, programming systems,
and most efficient algorithms to solve the data related problems.
 It is the future of artificial intelligence.
 By using Data Science, companies are able to make:
 Better decisions (should we choose A or B)
 Predictive analysis (what will happen next?)
 Pattern discoveries (find pattern, or maybe hidden information in the
data)
 Data science is the combination of: statistics, mathematics,
programming, and problem-solving; capturing data in ingenious ways;
the ability to look at things differently.
 The activity of cleansing, preparing, and aligning data. This includes
various techniques that are used when extracting insights and
information from data.
Need for Data Science
 Data Science is used in many industries in the world today, e.g.
banking, consultancy, healthcare, and manufacturing.
 For route planning: To discover the best routes to ship
 To foresee delays for flight/ship/train etc. (through predictive
analysis)
 To create promotional offers
 To find the best suited time to deliver goods
 To forecast the next years revenue for a company
 To analyze health benefit of training
Data Science can be applied in nearly every part of a business where
data is available. Examples are:
 Consumer goods
 Stock markets
 Industry

 Politics

 Logistic companies
 E-commerce
Data science is all about:

 Asking the correct questions and analyzing the raw data.

 Modeling the data using various complex and efficient algorithms.

 Visualizing the data to get a better perspective.

 Understanding the data to make better decisions and finding the final

result.
Data Science
Example:
 Let suppose we want to travel from station A to station B by car.
 We need to take some decisions such as which route will be the
best route to reach faster at the location, in which route there will
be no traffic jam, and which will be cost-effective.
 All these decision factors will act as input data, and we will get an
appropriate answer from these decisions, so this analysis of data
is called the data analysis, which is a part of data science.
Big Data
1.2 Big Data
 Big data refers to significant volumes of data that cannot be processed effectively
with the traditional applications that are currently used.
 The processing of big data begins with raw data that isn’t aggregated and is most
often impossible to store in the memory of a single computer.
 Big data is used to analyze insights, which can lead to better decisions and
strategic business moves.
 Big data is a combination of structured, semi-structured and unstructured data that
organizations collect, analyze and mine for information and insights.
 Big data is high-volume, and high-velocity or high-variety information
assets that demand cost-effective, innovative forms of information
processing that enable enhanced insight, decision making, and process
automation.
 Companies use big data in their systems to improve operational efficiency,
provide better customer service, create personalized marketing campaigns
and take other actions that can increase revenue and profits.
 Businesses that use big data effectively hold a potential competitive
advantage over those that don't because they're able to make faster and
more informed business decisions.
 Medical researchers use big data to identify disease signs and risk factors.
 Doctors use it to help diagnose illnesses and medical conditions in
patients.
 In addition, a combination of data from electronic health records, social
media sites, the web and other sources gives healthcare organizations and
government agencies up-to-date information on infectious disease threats
and outbreaks.
 Big data helps oil and gas companies identify potential drilling locations
and monitor pipeline operations.
 Likewise, utilities use it to track electrical grids.
 Financial services firms use big data systems for risk management
and real-time analysis of market data.
 Manufacturers and transportation companies rely on big data to
manage their supply chains and optimize delivery routes.
 Government agencies use big data for emergency response, crime
prevention and smart city initiatives.
 Data collection is the process of acquiring, collecting, extracting, and
storing the voluminous amount of data which may be in the structured
or unstructured form like text, video, audio, XML files, records, or
other image files used in later stages of data analysis.
 In the process of big data analysis, “Data collection” is the initial step
before starting to analyse the patterns or useful information in data.
 The data which is to be analysed must be collected from different
valid sources.
 The actual data is then further divided mainly into two types known
as:

1. Primary data

2. Secondary data
Applications of Big Data
Applications of Big Data
Big Data for Financial Services
 Credit card companies, retail banks, private wealth management advisories,
insurance firms, venture funds, and institutional investment banks all use big data for
their financial services.
 The common problem among them all is the massive amounts of multi-structured
data living in multiple disparate systems, which big data can solve.
Big data is used in several ways, including:
 Customer analytics
 Compliance analytics
 Fraud analytics
 Operational analytics
Types of Big Data
Structured Data

 This type of data is highly organized and easily searchable by basic

algorithms. It is often stored in relational databases and can be

represented in tables with rows and columns.

 Examples: SQL databases, spreadsheets, and data from customer

relationship management (CRM) systems.

Characteristics of Structured Data
 Data conforms to a data model and has easily identifiable
structure
 Data is stored in the form of rows and columns
Example: Database
 Data is well organised so, Definition, Format and Meaning of
data is explicitly known
 Data resides in fixed fields within a record or file
 Dataelements are addressable, so efficient to analyse and
process
Sources of Structured Data
 SQL Databases (SQL - Structured Query Language)
 Spreadsheets such as Excel
 Sensors such as GPS or RFID tags
 Network and Web server logs
 Medical devices
Unstructured Data:
 This type of data does not have a pre-defined data model or is not
organized in a pre-defined manner.
 It is more challenging to collect, process, and analyze.
Examples:
 Text documents
 Emails
 Social media posts
 Videos
 Images, and
 Sensor data
Characteristics of Unstructured Data
 Lack of Format: Unstructured data does not fit neatly into
tables or databases.
 It can be textual or non-textual, making it difficult to
categorize and organize.
 Variety: This type of data can include a wide range of
formats, such as:
• Text documents (e.g., emails, reports, articles)
• Multimedia files (e.g., images, audio, video)
• Social media content (e.g., posts, comments)
 Web pages and blogs
• Volume: Unstructured data represents a significant portion
of the data generated today. It is often larger in volume
compared to structured data.
• Diverse Sources: It can originate from various sources,
including user-generated content, sensor data, customer
interactions, and more.
Sources of Unstructured Data:
 Web pages
 Images (JPEG, GIF, PNG, etc.)
 Videos
 Reports
 Word documents and PowerPoint presentations
 Surveys
Semi-structured Data:
 This type of data does not conform to the formal structure of data models but contains
tags or other markers to separate semantic elements.
 It is more flexible than structured data but easier to organize than unstructured data.
Examples:
 XML files - XML stands for eXtensible Markup Language. XML is a markup language
much like HTML. XML was designed to store and transport data.
 JSON documents - JavaScript Object Notation. JSON documents consist of fields, which
are name-value pair objects. JSON is often used when data is sent from a server to a
web page.
 NoSQL databases - NoSQL is a type of database management system (DBMS) that is
designed to handle and store large volumes of unstructured and semi-structured data.
Characteristics of semi-structured Data:
 Data does not conform to a data model but has some
structure.
 Datacan not be stored in the form of rows and columns as
in Databases
 Semi-structured data contains tags and elements
(Metadata) which is used to group data and describe how
the data is stored
 Similarentities are grouped together and organized in a
hierarchy
Sources of Semi-structured Data:

 E-mails

 XML and other markup languages

 Binary executables
 TCP/IP packets
 Zipped files
 Integration of data from different sources
 Web pages
Data Warehousing
A data warehouse is a centralized storage system that allows for
the storing, analyzing, and interpreting of data in order to
facilitate better decision-making.
A data warehouse is a type of data management system that
facilitates and supports business intelligence (BI) activities, specifically
analysis.
 Data warehousing involves data cleaning, data integration, and data
consolidations.
A data warehouse can be defined as a collection of organizational
data and information extracted from operational sources and external
data sources.
 The data is periodically pulled from various internal applications like
sales, marketing, and finance, customer-interface applications as well
as external partner systems.
 They come in different types like Enterprise Data Warehouses and
Data Marts, and can be implemented using cloud-based solutions like
Amazon Redshift, on-premises systems like Oracle Exadata, or open-
source options like Apache Hive.
Need for Data Warehousing
 In today’s rapidly changing corporate environment, organizations are
turning to cloud-based technologies for convenient data collection,
reporting, and analysis.
 This is where Data Warehousing comes in as a core component of
business intelligence that enables businesses to enhance their
performance.
 It is important to understand what is data warehouse and why it is
evolving in the global marketplace.
 In today’s business environment, an organization must have reliable
reporting and analysis of large amounts of data.
 Businesses need their data collected and integrated for different levels of
aggregation, from customer service to partner integration to top-level
executive business decisions.
 This is where data warehousing comes in to make reporting and analysis
easier.
 This rise in data, in turn, increases the use of data warehouses to
manage business data.
Data Warehouse Architecture
Metadata
 Metadata is simply defined as data about data.
 The data that are used to represent other data is known as metadata.
 For example, the index of a book serves as a metadata for the contents

in the book.
 In other words, we can say that metadata is the summarized data that

leads us to the detailed data.

In terms of data warehouse, we can define metadata as following −
 Metadata is a road-map to data warehouse.
 Metadata in data warehouse defines the warehouse objects.
 Metadata acts as a directory. This directory helps the decision support

system to locate the contents of a data warehouse.

Data Mart
 Data marts contain a subset of organization-wide data that is
valuable to specific groups of people in an organization.
 In other words, a data mart contains only those data that is
specific to a particular group.
 For example, the marketing data mart may contain only data
related to items, customers, and sales.
 Data marts are confined to subjects.
The information gathered in a warehouse can be used in any of the
following domains −
 Tuning Production Strategies − The product strategies can be well
tuned by repositioning the products and managing the product
portfolios by comparing the sales quarterly or yearly.
 Customer Analysis − Customer analysis is done by analyzing the
customer's buying preferences, buying time, budget cycles, etc.
 Operations Analysis − Data warehousing also helps in customer
relationship management, and making environmental corrections. The
information also allows us to analyze business operations.
A data warehouse is an enterprise system used for the analysis
and reporting of structured and semi-structured data from multiple
sources, such as point-of-sale transactions, marketing
automation, customer relationship management, and more.
A data warehouse is suited for ad hoc analysis as well custom
reporting.
Goals of Data Warehousing
 To help reporting as well as
analysis
 Maintain the organization's
historical information
 Be the foundation for
decision making
Examples of Data Warehousing in Various Industries
Investment and Insurance sector
 Firms primarily use a data warehouse to analyze customer and
market trends and other data patterns in these sectors.
 Forex and stock markets are two major sub-sectors.
 Here data warehouses play a crucial role because a single point
difference can lead to massive losses across the board.
 Data Warehouses are usually shared in these sectors and focus on
real-time data streaming.
Retail chains
 Retail chains use Data Warehouses for distribution and marketing.
 Common uses are tracking items, examining pricing policies,
tracking promotional deals, and analyzing customer buying trends.
 Retail chains usually incorporate EDW (Enterprise Data
Warehouse) systems for business intelligence and forecasting
needs.
Healthcare
 Healthcare businesses use a Data Warehouse to forecast
patient outcomes.
 They also use it to generate treatment reports and share data
with insurance providers, research labs, and other medical units.
 Enterprise Data Warehouses are the backbone of healthcare
systems because the latest, up-to-date treatment information is
crucial for saving lives.
Data Preparation
Definition of Data Preparation:
 Data preparation follows a series of steps that starts with
collecting the right data, followed by cleaning, labeling, validation, and
visualization.
 Data preparation is an important step in data analytics as well as in
business intelligence. It's also a core function of business analysts.
 Raw data is usually collected from multiple sources.
 For example, if you want to analyze customer behavior, your raw data
might come from your company’s sales database and customer
relationship management (CRM) system.
 In this case, sales records would be stored in a sales table while
customer information would be stored in a customer table.
 These two tables would probably have identical fields (or columns),
but they would contain different values.
 It’s your job as a data scientist to combine these two tables into one
big table so that you can determine which customers bought what
products and how much they paid for those products.
 Data preparation is the process of collecting, transforming and
enriching raw data to make it suitable for analysis.
 This can include cleaning, normalizing and consolidating raw data
sets, as well as adding new dimensions or attributes to the data.
 Data preparation is sometimes referred to as "data wrangling."
Data Science Hype
1.3 Data Science Hype
 Data science enables companies not only to understand data from
multiple sources but also to enhance decision making.
 As a result, data science is widely used in almost every industry,
including health care, finance, marketing, banking, city planning, and
more.
 DataScience has been recognized as one of the most exciting IT
domains in the world.
 Recent reports suggest that leading countries like USA and China are
investing billions of dollars to integrate their industries with Artificial
Intelligence (AI).
 AI is emerging and being applied across various areas including finance,
healthcare, manufacturing etc.
 The hype is crazy—people throw around tired phrases straight out of the
height of the pre-financial crisis era like “Masters of the Universe” to describe
data scientists.
 Statisticians already feel that they are studying and working on the “Science
of Data.”
 The media often describes data science in a way that makes it sound like as
if it’s simply statistics or machine learning in the context of the tech industry.
Finance
 Each day, many financial institutions handle multi-billion
monetary transactions including ATM withdrawals, debit/credit card
payments, deposits, online payments, and so on.
 It is also evident that considerable amount of fraud and corruption
take place hindering financial growth of the country.
 Modern ML (Machine Learning) techniques and analytics in
combination with human-based skills are being adopted to curb
mishaps.
 The system recognizes potential threats and flag them as fraudulent in order
to minimize losses beforehand.
 AI techniques not only create early warning signs but also reduce human
errors, thereby increasing efficiency.

Healthcare
 Presence of Artificial Intelligence in the healthcare domain helps health
managers to analyze simple to most complicated medical conditions.
 It allows them to examine symptoms, diagnose diseases, and even suggest
medical treatments.
 The medical industry is applying AI concepts to enhance their
accuracy and bringing improvements.
 Apart from diagnosing diseases and suggesting treatments, both AI
and ML algorithms are being utilized to improve the healthcare quality
as well as cutback on high-end medical costs.
Manufacturing
 Manufacturing is one of the most vital industries in our country.
 However, they constantly face challenges while maintaining logistics,
product forecasting, and supply chain management as well.
 Manufacturing companies can enhance their efficiency through
automation with AI and machine learning.
Agriculture
A noticeable presence of AI has also been seen in the field of
agriculture.
 Agriculture takes the help of artificial intelligence to improve
production and minimize wastages.
 Farmers are integrating traditional farming practices with AI to
automate the processes.
Data Science Applications
Healthcare: Data science can identify and predict disease, and

personalize healthcare recommendations.

Transportation: Data science can optimize shipping routes in real-time.

Sports: Data science can accurately evaluate athletes’ performance.

Government: Data science can prevent tax evasion and predict

incarceration rates.

E-commerce: Data science can automate digital ad placement.

Gaming: Data science can improve online gaming experiences.

Social media: Data science can create algorithms to pinpoint compatible

partners.

Fintech: Data science can help create credit reports and financial profiles,

run accelerated underwriting and create predictive models based on

historical payroll data.

1.4 Datafication
 Datafication as a process of “taking all aspects of life and turning them
into data.” As examples, they mention that “Google’s augmented-
reality glasses datafy the gaze. Twitter datafies stray thoughts.
LinkedIn datafies professional networks.”
 Datafication is an interesting concept and led us to consider its
importance with respect to people’s intentions about sharing their own
data.
 Datafication refers to the collective tools, technologies, and processes
used to transform an organization into a data-driven enterprise.
 An organizational trend of defining the key to core business operations
through a global reliance on data and its related infrastructure.
 Datafication refers to the fact that daily interactions of living things can be
rendered into a data format and put to social use.
Benefits of Datafication
 Datafication is a technique that is financially advantageous to pursue since
it provides great opportunity for streamlining corporate procedures.
 Datafication is a cutting-edge process for creating a futuristic framework
that is both secure and inventive.
1.5 Data Science Profiles/Data scientist
 Data Science is such a broad field that includes several subdivisions
like data preparation and exploration; data representation and
transformation; data visualization and presentation; predictive
analytics; machine learning, etc.
 The principal purpose of Data Science is to find patterns within data.
 It uses various statistical techniques to analyze and draw insights from
the data extraction, wrangling and pre-processing.
Data Science Job Profiles:
1.Data Analyst
 Data Analysts are the individuals who are responsible for reviewing the data so that they
can identify the key information in the businesses of customers.
 Therefore, it is the process of collecting, processing, and analyzing the data to extract
meaningful insights and also data analyst support in decision-making processes.

Key responsibilities of Data Analyst

 To maintain the collected data in a simple form and prepare the data for business
communication.
 They use a statistical approach to visualize and produce the reports.
 Data analysts assess and understand the trends and patterns, and also evaluate the big
datasets.
2. Data Scientist
 Data Scientist are the individual who uses the data to understand it. Therefore
these data scientist are responsible to collect, analyze and interpret the data to
help to drive the decision making.

Key Responsibilities
 Data scientist is used to discover the data sources, analyze the information which based
on the patterns and trends.

 They automate the procedure of data collection and works on the data pre-processing on
the structured and unstructured data.

 Data scientist generates the predictive models and builds the machine learning algorithm.
3.Data Engineer
 Data Engineer refers to experts who are responsible for maintaining,
designing and optimizing the data infrastructure for the data
management and transform them.
Key Responsibilities
 Data Engineers are responsible for creating and optimizing the data sets for data
business and scientists.
 They suggest improvements to enhance reliable and quality of the models and
dataset.
 Data engineers develop the algorithms and the prototypes to convert those data
into some useful insights.
4.Business Analyst
 Business Analyst are the people’s who help the organization to fulfil
their goals and also assess the organization, analyze the data and
improve the systems and processes for the future.
Key Responsibilities
 Business analyst conduct several researches to evaluate in the
business models and
 Business analyst develop innovative solutions for the difficult business
problems.
 They are expert in allocating forecasting, budgeting and resources in
the businesses.
5.Machine Learning Engineer

 Machine learning Engineer refers to the critical members of the data science
team.
 These engineer tasks are building, researching and designing the AI which are
further responsible for the machine learning and improving and maintaining the
existing the systems of artificial intelligence.
Key Responsibilities

 Machine learning engineer helps in creating and designing the machine

learning systems.
 They develop the data pipelines and the effective ML models and datasets.
1.6 Definitions: Meta-data, Statistical inference, Populations and
Samples:
 Metadata means "data about data". Metadata is defined as the data
providing information about one or more aspects of the data; it is used to
summarize basic information about data that can make tracking and working
with specific data easier.
 There are three main types of metadata: Descriptive, Administrative, and
Structural.
 1.Descriptive metadata enables discovery, identification, and selection of
resources. It can include elements such as title, author, and subjects.
Examples of Descriptive Metadata
 Library Catalogs: Metadata about books, including title, author, publication
date, subject headings, and ISBN (International Standard Book Number).
 Digital Repositories: Descriptive information about digital objects, such as
datasets, images, and documents, including titles, creators, descriptions,
and formats.
 Archives: Metadata for archival materials, such as personal papers,
photographs, and historical documents, including descriptions, dates, and
contributors.
2.Administrative metadata facilities the management of
resources
 Administrativemetadata is a type of metadata that helps manage and
support the use of a resource, typically digital objects, throughout its
lifecycle.
 Itencompasses information needed for managing, preserving, and
providing access to the resource. Here are the main components and
purposes of administrative metadata:
(i)Technical Metadata: It describes the technical characteristics of the
digital resource, such as file format, creation date, software and
hardware used, and technical requirements for accessing and rendering
the file.
(ii) Preservation Metadata: It contains information necessary for the long-
term preservation of the resource, including details about the provenance
(origin and history), any actions taken to preserve it (e.g., migrations, format
conversions), and conditions or requirements for maintaining its usability over
time.

(iii) Rights Management Metadata: It includes details about the legal and
access rights associated with the resource, such as copyright status,
intellectual property rights, access permissions, restrictions, and any licensing
information.
Examples of Administrative Metadata
 Digital Libraries: Metadata about digitized books, manuscripts, and
other resources, including technical details, preservation actions, and
rights information.
 Archives: Metadata for archival collections, detailing provenance,
custodial history, and access permissions.
 Repositories: Metadata for datasets, software, and other digital
objects, including technical specifications, usage statistics, and
licensing information.
3.Structural Metadata: Structural metadata is metadata that describes the structure,
type, and relationships of data. For example, in a SQL database, the data is
described by metadata stored in the Information Schema and the Definition Schema.
Examples of Structural Metadata
 Books and Documents: Information about chapters, sections, and sub-sections,

as well as pagination and links between different parts of the document.

 Multimedia Objects: Metadata describing scenes, segments, tracks, or frames in

videos and audio files.

 Websites: The organization of web pages, including navigation structures, links,

and the relationship between different sections of the site.

 Digital Collections: How items in a collection are grouped, ordered, and related

to each other, such as collections of photographs, datasets, or archival materials.

Statistical Inference
 Statistical inference is a method of making decisions about the
parameters of a population, based on random sampling.
 It helps to assess the relationship between the dependent and
independent variables.
 The purpose of statistical inference to estimate the uncertainty or
sample to sample variation.
2.Population and Sample in big data:

Data: Both population and sample involve data. Population refers to the entire group
or set of individuals, objects, or events being studied, while a sample is a subset of
the population that is used for analysis.

Descriptive Statistics: Descriptive statistics can be used to analyse both

populations and samples.

Example: All the students in the class are population whereas the top 10 students in
the class are the sample.

All the members of the parliament is population and the female candidates present
there is the sample.
Population vs. sample
 First, you need to understand the difference between a population
and a sample, and identify the target population of your research.
 The population is the entire group that you want to draw conclusions
about.
 The sample is the specific group of individuals that you will collect
data from.
Data Modelling
 Data modelling: Data modelling is the process of creating a visual
representation of either a whole information system or parts of it to
communicate connections between data points and structures.
 Data modelling is a process of creating a conceptual representation of
data objects and their relationships to one another.
 The process of data modelling typically involves several steps,
including requirements gathering, conceptual design, logical design,
physical design, and implementation.
 Data Modelling in software engineering is the process of simplifying

the diagram or data model of a software system by applying certain

formal techniques.

 It involves expressing data and information through text and symbols.

 The data model provides the blueprint for building a new database or

reengineering legacy applications.

Philosophy of Exploratory Data Analysis
 Exploratory Data Analysis (EDA) is an analysis approach that identifies general
patterns in the data. These patterns include outliers and features of the data that
might be unexpected. EDA is an important first step in any data analysis.
Important steps:
 Import Libraries
 Configure Settings
 Prepare Data
 Import Data set files
 Null Value Check
 Exploratory data analysis (EDA) is used by data scientists to analyze and
investigate data sets and summarize their main characteristics, often employing
data visualization methods.
 EDA helps determine how best to manipulate data sources to get the answers you
need, making it easier for data scientists to discover patterns, spot anomalies, test
a hypothesis, or check assumptions.
 EDA is primarily used to see what data can reveal beyond the formal modelling or
hypothesis testing task.
 It provides a better understanding of data set variables and the relationships
between them.
 The main purpose of EDA is to help look at data before making any
assumptions.
 It can help identify obvious errors, as well as better understand
patterns within the data, detect outliers or anomalous events, find
interesting relations among the variables.
 Data scientists can use exploratory analysis to ensure the results they
produce are valid and applicable to any desired business outcomes
and goals.
 EDA can help answer questions about standard deviations,

categorical variables, and confidence intervals.

 Once EDA is complete and insights are drawn, its features can then

be used for more sophisticated data analysis or modelling, including

machine learning.
Data Science Process
 Step 1: Defining the problem. The first step in the data science
lifecycle is to define the problem that needs to be solved
 Step 2: Data collection and preparation
 Step 3: Data exploration and analysis
 Step 4: Model building and evaluation
 Step 5: Deployment and maintenance
Data Science Process
The various operations involved in Data Science process are:
 Define the Problem and Set Objectives
 Data Collection and Understanding
 Data Preprocessing and Cleaning
 Exploratory Data Analysis (EDA)
 Model Building and Machine Learning
 Interpretation and Insights
 Deployment and Monitoring
 Deployment
Problem Definition
• The project lead or product manager manages this phase. The
problem definition involves the following steps:
 State clearly the problem to be solved and why
 Motivate everyone involved to push toward this why
 Define the potential value of the forthcoming project
 Identify the project risks including ethical considerations
 Identify the key stakeholders
 Align the stakeholders with the data science team

 Research related high-level information

 Assess the resources (people and infrastructure) you’ll likely need

 Develop and communicate a high-level, flexible project plan

 Identify the type of problem being solved

 Get buy-in for the project

Data Investigation and Cleaning

• The investigation team needs to identify what data is needed to solve

the underlying problem, then determine how to get the data:

 Is the data internally available? -> Get access to it

 Is the data readily collectable? -> Start capturing it

 Is the data available for purchase? -> Buy it

• Once the data is collected, start exploring it. Data scientists or
business/data analysts will lead several activities such as:
 Document the data quality
 Clean the data
 Combine various data sets to create new views
 Load the data into the target location (often to a cloud platform)
 Visualize the data
 Present initial findings to stakeholders and solicit feedback
DATA CLEANING AND PREPARATION

Before making analyzing the data, it is important to clean and prepare

data. The methods used to clean and prepare the data are as listed below:
 Changing Data Types of Columns from object to Floats
 Filling in Missing Information
 Checking for Duplicate Rows
 Splitting Long Strings
 Creating Various New Columns
Minimal Viable Model/Product
 All data science life cycle frameworks have some sort of modelling phase.
 However, I want to emphasize the importance of getting something useful
out as quickly.
 This concept borrows from the idea of a Minimal Viable Product.
 “The minimum viable product is that version of a new product which
allows a team to collect the maximum amount of validated learning about
customers with the least effort.”
Deployment and Enhancements:
Deployment:
 Typically the more “engineering-focused” team members such as data engineers,
cloud engineers, machine learning engineers, application developers, and quality
assurance engineers execute this phase.

 “No machine learning model is valuable, unless it’s deployed to production.”

Enhancements:

 Extend the model to similar use cases (i.e. a new “Problem Definition” phase)

 Add and clean data sets (i.e. a new “Data Investigation and Cleaning” phase)

 Try new modelling techniques (i.e. developing the next “Viable Model”)
Data Science Ops

 As data science matures into mainstream operations, companies

need to take a stronger product focus that includes plans to

maintain the deployed systems long-term.

 There are three major overlapping facets of management to this.

Exploratory Data Analysis (EDA)

 Descriptive Statistics: Calculate summary statistics like mean,

median, mode, and standard deviation.

 Data Visualization: Use plots (e.g., histograms, scatter plots, box

plots) to visualize the data and uncover patterns and relationships.

Feature Engineering

 Feature Selection: Identify the most relevant features for your

model.

 Feature Creation: Create new features that may enhance model

performance.
Modeling

 Model Selection: Choose the appropriate algorithms and models

based on the problem type (e.g., regression, classification, clustering).

 Model Training: Train the model using the training dataset.

 Hyperparameter Tuning: Optimize model parameters to improve

performance.
Model Evaluation
 Validation: Evaluate the model using a validation set to test its
performance.
 Metrics: Use appropriate metrics (e.g., accuracy, precision, recall, F1-
score, RMSE) to assess model performance.
 Cross-Validation: Perform cross-validation to ensure the model’s
robustness and generalizability.
 The F1 score in Machine Learning is an important evaluation
metric that is commonly used in classification tasks to evaluate
the performance of a model. It combines precision and recall into
a single value.
 Precision represents the accuracy of positive predictions. It
calculates how often the model predicts correctly the positive
values.
 Recall represents how well a model can identify actual positive
cases. It is the number of true positive predictions divided by the
total number of actual positive instances.
 Root mean square error or root mean square deviation is one of the
most commonly used measures for evaluating the quality of
predictions.
 In machine learning, it is extremely helpful to have a single number
to judge a model’s performance.
Model Deployment

 Implementation: Deploy the model into a production

environment where it can be used to make predictions on new
data.
 Integration: Integrate the model with existing systems and
workflows.
Monitoring and Maintenance

 Monitoring: Continuously monitor the model’s performance to

ensure it remains accurate and relevant.

 Updating: Regularly update the model as new data becomes

available or as the underlying problem changes.

Communication and Reporting

 Insights: Communicate findings and insights to stakeholders

through reports, dashboards, and presentations.

 Decision-Making: Use the insights to inform decision-

making and drive business strategies.

Iterative Improvement

 Feedback Loop: Use feedback from stakeholders and model

performance to iteratively improve the process and models.

Role of Data Scientist
 A Data Scientist's role: involves leveraging data to derive actionable insights and
support data-driven decision-making. Their responsibilities can be summarized into
several key areas:

1.Problem Definition
 Collaborate with stakeholders to understand business objectives and translate them
into data science problems.

2.Data Collection
 Identify and gather relevant data from various sources using techniques like web
scraping, APIs, and database querying.
3.Data Cleaning and Preprocessing
 Clean and preprocess data by handling missing values, removing
duplicates, and transforming data into a suitable format for analysis.
4.Exploratory Data Analysis (EDA)
 Perform descriptive statistics and create visualizations to uncover
patterns and insights within the data.
5.Feature Engineering
 Create and select important features to enhance model performance
and reduce dimensionality.
6. Model Building
 Choose appropriate algorithms, train models, and fine-tune parameters to
optimize performance.

7.Model Evaluation
 Evaluate models using relevant metrics and validate them to ensure they
generalize well to new data.

8.Model Deployment
 Develop and implement a strategy for deploying models into production
environments, ensuring seamless integration with existing systems.
9.Monitoring and Maintenance
 Continuously monitor model performance, update and retrain models as
necessary, and perform error analysis to refine models.

10.Communication and Reporting

 Present findings and insights through clear and compelling storytelling,
creating reports and dashboards for stakeholders.

11.Ethical Considerations
 Ensure data privacy, compliance with regulations, and mitigate biases to
promote fairness and ethical use of data.
12. Continuous Learning
 Stay updated with the latest advancements in data science and
continuously experiment with new techniques and tools.
A Data Scientist combines technical skills with business process to
transform data into valuable insights that drive strategic decisions and
operational improvements.
Case Study
Case Study in Data Science - Urban Planning and Smart Cities

1. Singapore
 Singapore is pioneering the smart city concept, using data science to
optimize urban planning and public services.
 They gather data from various sources, including sensors and citizen
feedback, to manage traffic flow, reduce energy consumption, and
improve the overall quality of life in the city-state.
Singapore - Efficient Urban Planning using Data Science:
 Singapore's real-time traffic management system, powered by data
analytics, has led to a 25% reduction in peak-hour traffic congestion,
resulting in shorter commute times and lower fuel consumption.
 Singapore has achieved a 15% reduction in energy consumption
across public buildings and street lighting, contributing to significant
environmental sustainability gains.
 Citizen feedback platforms have seen 90% of reported issues
resolved within 48 hours, reflecting the city's responsiveness in
addressing urban challenges through data-driven decision-making.
 The implementation of predictive maintenance using data science
has resulted in a 30% decrease in the downtime of critical public
infrastructure, ensuring smoother operations and minimizing
disruptions for residents.
2. Barcelona
 Barcelona has embraced data science to transform into a smart city as well.
 They use data analytics to monitor and control waste management, parking,
and public transportation services.
 Barcelona improves the daily lives of its citizens and makes the city more
attractive for tourists and businesses.
 Data science has significantly influenced Barcelona's urban planning and
the development of smart cities, reshaping the urban landscape of this
vibrant Spanish metropolis.
 Barcelona's data-driven waste management system has led to a 20%

reduction in the frequency of waste collection in certain areas,

resulting in cost savings and reduced environmental impact.

 The implementation of smart parking solutions using data science has

reduced the average time it takes to find a parking spot by 30%,

easing congestion and frustration for both residents and visitors.

 Public transportation optimization through data analytics has

improved service reliability, resulting in a 10% increase in daily

ridership and reduced waiting times for commuters.

 Barcelona's efforts to become a smart city have attracted 30% more

tech startups and foreign investments over the past five years,

stimulating economic growth and job creation in the region.

Case Study in Data Science - Housing Market Analysis

 Predicting or estimating the selling price of a property can be of great help when
making important decisions such as the purchase of a home or real estate as an
investment vehicle.
 It can also be an important tool for a real estate sales agency, since it will allow
them to estimate the sale value of the real estate that for them in this case are
assets.

1. Analysing Data to Predict Market Trends

 Data science in real estate helps to forecast property market trends and any risks
that might exist in the investment.
 By using data that consists of a combination of different variables and
predictive analysis implemented to that, data scientists understand and
analyse how-
 consumer groups have been behaving overtime
 what type of properties have been in demand
 the kind of leisure activities consumers are involving themselves
 facilities that can be integrated with residential spaces to enhance consumer
experience
 evolution in the rents being charged
2. Formulating the Property Price Indices
 One of the most significant applications of data science in real estate is
to collect and leverage information relating to the adjoining local areas.
 These include, supermarkets in the vicinity, educational institutes,
business and commerce hubs, traffic in the neighbourhood, crime rates,
cafes and restaurants, and physical infrastructure.
 These qualitative and quantitative variables play in to influence the
pricing of individual properties.
 Now, the variables affected by the floor number, size of rooms and the
view from the window, work as additions that are charged for
additionally.
 Therefore, the internal variables of the property alongside the

hyperlocal variables work to formulate the property price indices and

help real estate agents to cater better to the needs of the clients.
3. Understanding Investment Performance
 In the field of real estate, no two properties can ever be identical.
 Variables differ even with properties in the same building, not to

mention the changing value of properties with time.

 Understanding individual sub-market performance is therefore a
difficult problem to deal with.
 As a solution to this issue, the changing price of an asset can be
tracked over time by using data science in real estate.
 In the world of real estate, each property is unique, with factors
varying even among those situated within the same building.
 Adding to the complexity, property values change constantly due to
fluctuations in the market and evolving infrastructure.
4. Estimating Profitability of Investment and Construction
 Whether one invests in a commercial real estate space or a residential
one, location intelligence acts as a very important aspect to gauge
whether the investment would be able to yield the expected profits in the
future.
 With the proper information about the geography of a particular property,
accessibility of services around it, land ownership, zoning, regional laws,
etc. an investor or a real estate consultant can make a more informed
decision by visualizing and analyzing prospects.
5. Managing Finances of Properties
 Let’s assume that you manage a diverse pool of properties across
various localities in Mumbai.
 All through the work is the same, you need to evaluate the reasons
why one property is draining more resources in comparison to
another.
 This could be in terms of losses incurred due to higher vacancy rates
or systems malfunctions.
 Fortunately, data science in real estate management helps you in
identifying the root cause.
 This is done by gathering data such as, receivables and budget, profitability
and cost analysis, planning for tenant build outs from different properties.
 The data can then be evaluated based on various metrics, you can zone

down to the bottom of the problem, and formulate solutions for the same.
6. Trimming Down Energy Consumption
 With the incorporation of data science in real estate, identifying the root

cause of energy wastage has now become possible.

 Nowadays, there are a plethora of apps and software available that

gather and assess energy data from smart meters and sensors, and can
also detect faults in the heating, ventilation, and air conditioning (HVAC)
systems.
 Based on the weather changes and the usage pattern, these apps offer a holistic
understanding of energy spendings.
7. Simplifying Home Searching or Buying Process
 Data science usage in real estate not only benefits the investor and broker class,
but it also streamlines the home searching, buying and renting process.
 It is very much possible that real estate property prices vary drastically across
different cities.
 By examining user behavior, their lifestyle preference, budget range, amenities
preference and other such factors, you can offer property suggestions that
match the requirements of the users.
 This will therefore save customers’ time in scooping through multiple property
listings.
 Data science in real estate also simplifies the process of finding, purchasing, or
renting homes for individuals and families.
 Property prices can fluctuate significantly between cities, influenced by
factors such as connectivity to nearby areas, proximity to commercial
centers, and availability of transportation options.
 In addition, examining user behavior, preferences in terms of lifestyle,
budget constraints, desired amenities, and other relevant aspects can
lead to personalized property recommendations tailored to meet
customer needs.
8. Revamping the Marketing Strategy
 Data science in real estate aids in collecting and examining information
through multiple sources.
 This can help agencies in understanding the behavior and preferences
of the consumers, assessing the competition, and marketing their
services in a more creative way.
 Once user preference is understood, virtual staging, 3D rendering and
visualization, Google or Facebook ads, and listings can be optimized in
order to attract the target audience.
9. Identifying and Segregating Leads
A very interesting way to harness the power of data science in real
estate is in the field of lead nurturing and segregation.
 With the help of data science-backed applications and softwares,
giving a “seller or buyer score” to leads which are most likely to
sell/buy properties has now become possible.
 This assessment is made by evaluating factors like demographics,
income changes and purchasing behavior.
Prospects for Growth
 Modern technologies have revolutionized the real estate market.
 Many companies have already shifted to big data-machine learning
powered software for analyzing data, calculating the profitability of an
apartment purchase, portfolio management, and estimating property
rentals.
 The study of how customers, groups, or organizations select, buy, use,
and dispose of ideas, goods, and services can impact, inform and
govern the decision-making process of the producing firms and
organizations to a large extent.

Microsoft.az 400.LAB .VMay 2024.by .Ancher.126q
No ratings yet
Microsoft.az 400.LAB .VMay 2024.by .Ancher.126q
116 pages
Hospital Management Software
No ratings yet
Hospital Management Software
10 pages
Bda Unit 1
No ratings yet
Bda Unit 1
74 pages
Data Science Unit-I
No ratings yet
Data Science Unit-I
13 pages
Data Science and Big Data Analytics Unit 1 notes
No ratings yet
Data Science and Big Data Analytics Unit 1 notes
13 pages
DA-1,2,3[1]_merged
No ratings yet
DA-1,2,3[1]_merged
39 pages
20IT501_BDA_Unit1
No ratings yet
20IT501_BDA_Unit1
18 pages
Introduction to Data Science_students
No ratings yet
Introduction to Data Science_students
237 pages
dataanalyticsunit-1[1]
No ratings yet
dataanalyticsunit-1[1]
26 pages
BUSINESS ANALYTICS NOTES
No ratings yet
BUSINESS ANALYTICS NOTES
31 pages
Lecture1 Introductiontobigdata 190301171350
No ratings yet
Lecture1 Introductiontobigdata 190301171350
63 pages
DSBDA_UNIT1
No ratings yet
DSBDA_UNIT1
232 pages
Chapter - 2 - Data Science
No ratings yet
Chapter - 2 - Data Science
32 pages
W1 L11 Introduction
No ratings yet
W1 L11 Introduction
26 pages
Introduction To Datasciecne
No ratings yet
Introduction To Datasciecne
50 pages
Chapter 2
No ratings yet
Chapter 2
22 pages
R Programming UNIT-1
No ratings yet
R Programming UNIT-1
48 pages
Chapter-2 Data Science2
No ratings yet
Chapter-2 Data Science2
24 pages
Big Data Analytics - Unit 1
No ratings yet
Big Data Analytics - Unit 1
43 pages
ETCh2
No ratings yet
ETCh2
36 pages
Data Science Introduction
No ratings yet
Data Science Introduction
82 pages
Unit 1
No ratings yet
Unit 1
76 pages
BDA U1 copy
No ratings yet
BDA U1 copy
78 pages
BIG DATA ANALYTICS
No ratings yet
BIG DATA ANALYTICS
10 pages
EmgTech Chapter 02
No ratings yet
EmgTech Chapter 02
52 pages
Chapter 2 EMTE@Kibru 014914
No ratings yet
Chapter 2 EMTE@Kibru 014914
40 pages
Chapter 2 - EMTE_240216_133452
No ratings yet
Chapter 2 - EMTE_240216_133452
47 pages
1 DataScience
No ratings yet
1 DataScience
91 pages
Data Science Vs Big Data
No ratings yet
Data Science Vs Big Data
34 pages
22UCS303 DS-Unit I-N
No ratings yet
22UCS303 DS-Unit I-N
42 pages
BIG DATA INTRODUCTION hadoop
No ratings yet
BIG DATA INTRODUCTION hadoop
24 pages
Big Data Analytics
No ratings yet
Big Data Analytics
58 pages
Unit-1 Bda
No ratings yet
Unit-1 Bda
72 pages
mod 3
No ratings yet
mod 3
96 pages
ict Ch. 2
No ratings yet
ict Ch. 2
38 pages
Data
No ratings yet
Data
43 pages
Chapter Two
No ratings yet
Chapter Two
14 pages
U - 02 ET
No ratings yet
U - 02 ET
24 pages
KCA 034 - Unit 1
No ratings yet
KCA 034 - Unit 1
48 pages
Chapter - 2 - Data Science
No ratings yet
Chapter - 2 - Data Science
33 pages
Foundations of Data Science PPT TEXT BOOK
No ratings yet
Foundations of Data Science PPT TEXT BOOK
132 pages
Big Data Analytics
No ratings yet
Big Data Analytics
14 pages
BIG DATA ANALTICS (UNIT 1)
No ratings yet
BIG DATA ANALTICS (UNIT 1)
31 pages
Unit 2 Da
No ratings yet
Unit 2 Da
69 pages
Chapter 2 - Data Science
No ratings yet
Chapter 2 - Data Science
20 pages
Unit 1 Bda Complete Notes
No ratings yet
Unit 1 Bda Complete Notes
15 pages
#2 Data Science
No ratings yet
#2 Data Science
32 pages
Chapter 2 - Introduction to Data Science
No ratings yet
Chapter 2 - Introduction to Data Science
37 pages
business analytics
No ratings yet
business analytics
34 pages
Chapter 2 - Introduction To Data Science
No ratings yet
Chapter 2 - Introduction To Data Science
36 pages
Lecture 1 & 2
No ratings yet
Lecture 1 & 2
53 pages
e4f1fb7f-a61e-4090-9018-344695f0d7d4 (2)
No ratings yet
e4f1fb7f-a61e-4090-9018-344695f0d7d4 (2)
30 pages
CH 2 Data Science
No ratings yet
CH 2 Data Science
28 pages
Unit-1 Final sgs
No ratings yet
Unit-1 Final sgs
24 pages
Chapter 2 - Data Science
No ratings yet
Chapter 2 - Data Science
57 pages
Unit 1
No ratings yet
Unit 1
137 pages
Introduction To Data Analytics
No ratings yet
Introduction To Data Analytics
33 pages
Hamid Seminar Doc
No ratings yet
Hamid Seminar Doc
57 pages
fundatmental of data analysis Week 1-4
No ratings yet
fundatmental of data analysis Week 1-4
23 pages
Module 1
No ratings yet
Module 1
21 pages
Data Analytics and Data Processing Essentials
From Everand
Data Analytics and Data Processing Essentials
gareth thomas
No ratings yet
Data Analysis: An In-depth Insight
From Everand
Data Analysis: An In-depth Insight
Pasquale De Marco
No ratings yet
S7600 Series Switches Datasheet
No ratings yet
S7600 Series Switches Datasheet
21 pages
Ipg Client
No ratings yet
Ipg Client
13 pages
Jira Process To Change Overview
No ratings yet
Jira Process To Change Overview
14 pages
Grade 3 - Study Notes For The Third Quarter
No ratings yet
Grade 3 - Study Notes For The Third Quarter
49 pages
Log
No ratings yet
Log
50 pages
SG 8
67% (6)
SG 8
2 pages
Blender HotKeys
No ratings yet
Blender HotKeys
3 pages
2024 Pica Guidelines XL
No ratings yet
2024 Pica Guidelines XL
14 pages
Ooad Unit-3
No ratings yet
Ooad Unit-3
21 pages
Vapotherm Precision Flow Plus Technical Service Manual
No ratings yet
Vapotherm Precision Flow Plus Technical Service Manual
33 pages
Anishka's Resume
No ratings yet
Anishka's Resume
1 page
ALU GSM Radio System Synchronisation Architecture Specification Release B11
No ratings yet
ALU GSM Radio System Synchronisation Architecture Specification Release B11
71 pages
Document Version: 1.5 Image Version: V1.0: Lorawan Soil Moisture & Ec Sensor User Manual
No ratings yet
Document Version: 1.5 Image Version: V1.0: Lorawan Soil Moisture & Ec Sensor User Manual
31 pages
Thesis Report On Manet
100% (4)
Thesis Report On Manet
7 pages
Acer V12LC-2x
No ratings yet
Acer V12LC-2x
5 pages
SCOUT Instrument Reference Guide
No ratings yet
SCOUT Instrument Reference Guide
272 pages
Convergence Hints (Aspen)
No ratings yet
Convergence Hints (Aspen)
13 pages
4007ES Fire Control Panels
No ratings yet
4007ES Fire Control Panels
8 pages
Soa MCQ PDF
100% (2)
Soa MCQ PDF
40 pages
Javascript PPT
No ratings yet
Javascript PPT
18 pages
UltraMax Deactivator Products and POS Equipment Frequently Asked Questions 8k264504 - 1
No ratings yet
UltraMax Deactivator Products and POS Equipment Frequently Asked Questions 8k264504 - 1
2 pages
CS-3001 Submitted
No ratings yet
CS-3001 Submitted
6 pages
Procedure - Exporting From TrainView (General)
No ratings yet
Procedure - Exporting From TrainView (General)
6 pages
Brochure Medecom Clipper EN
No ratings yet
Brochure Medecom Clipper EN
2 pages
Unit2_1) Introduction to Data Science.pptx
No ratings yet
Unit2_1) Introduction to Data Science.pptx
8 pages
Readme
No ratings yet
Readme
3 pages
UM BasicConfig L3P Rel42 en
No ratings yet
UM BasicConfig L3P Rel42 en
240 pages
Tribhuwan University Institute of Science and Technology: Project Report ON Momo Rater
No ratings yet
Tribhuwan University Institute of Science and Technology: Project Report ON Momo Rater
93 pages