Module 1 - Data Science Introduction _Detailed
Module 1 - Data Science Introduction _Detailed
Semester - I
Course Title: Fundamentals of Data Science
Course code: : 24BTELY107
Politics
Logistic companies
E-commerce
Data science is all about:
Understanding the data to make better decisions and finding the final
result.
Data Science
Example:
Let suppose we want to travel from station A to station B by car.
We need to take some decisions such as which route will be the
best route to reach faster at the location, in which route there will
be no traffic jam, and which will be cost-effective.
All these decision factors will act as input data, and we will get an
appropriate answer from these decisions, so this analysis of data
is called the data analysis, which is a part of data science.
Big Data
1.2 Big Data
Big data refers to significant volumes of data that cannot be processed effectively
with the traditional applications that are currently used.
The processing of big data begins with raw data that isn’t aggregated and is most
often impossible to store in the memory of a single computer.
Big data is used to analyze insights, which can lead to better decisions and
strategic business moves.
Big data is a combination of structured, semi-structured and unstructured data that
organizations collect, analyze and mine for information and insights.
Big data is high-volume, and high-velocity or high-variety information
assets that demand cost-effective, innovative forms of information
processing that enable enhanced insight, decision making, and process
automation.
Companies use big data in their systems to improve operational efficiency,
provide better customer service, create personalized marketing campaigns
and take other actions that can increase revenue and profits.
Businesses that use big data effectively hold a potential competitive
advantage over those that don't because they're able to make faster and
more informed business decisions.
Medical researchers use big data to identify disease signs and risk factors.
Doctors use it to help diagnose illnesses and medical conditions in
patients.
In addition, a combination of data from electronic health records, social
media sites, the web and other sources gives healthcare organizations and
government agencies up-to-date information on infectious disease threats
and outbreaks.
Big data helps oil and gas companies identify potential drilling locations
and monitor pipeline operations.
Likewise, utilities use it to track electrical grids.
Financial services firms use big data systems for risk management
and real-time analysis of market data.
Manufacturers and transportation companies rely on big data to
manage their supply chains and optimize delivery routes.
Government agencies use big data for emergency response, crime
prevention and smart city initiatives.
Data collection is the process of acquiring, collecting, extracting, and
storing the voluminous amount of data which may be in the structured
or unstructured form like text, video, audio, XML files, records, or
other image files used in later stages of data analysis.
In the process of big data analysis, “Data collection” is the initial step
before starting to analyse the patterns or useful information in data.
The data which is to be analysed must be collected from different
valid sources.
The actual data is then further divided mainly into two types known
as:
1. Primary data
2. Secondary data
Applications of Big Data
Applications of Big Data
Big Data for Financial Services
Credit card companies, retail banks, private wealth management advisories,
insurance firms, venture funds, and institutional investment banks all use big data for
their financial services.
The common problem among them all is the massive amounts of multi-structured
data living in multiple disparate systems, which big data can solve.
Big data is used in several ways, including:
Customer analytics
Compliance analytics
Fraud analytics
Operational analytics
Types of Big Data
Structured Data
E-mails
in the book.
In other words, we can say that metadata is the summarized data that
Healthcare
Presence of Artificial Intelligence in the healthcare domain helps health
managers to analyze simple to most complicated medical conditions.
It allows them to examine symptoms, diagnose diseases, and even suggest
medical treatments.
The medical industry is applying AI concepts to enhance their
accuracy and bringing improvements.
Apart from diagnosing diseases and suggesting treatments, both AI
and ML algorithms are being utilized to improve the healthcare quality
as well as cutback on high-end medical costs.
Manufacturing
Manufacturing is one of the most vital industries in our country.
However, they constantly face challenges while maintaining logistics,
product forecasting, and supply chain management as well.
Manufacturing companies can enhance their efficiency through
automation with AI and machine learning.
Agriculture
A noticeable presence of AI has also been seen in the field of
agriculture.
Agriculture takes the help of artificial intelligence to improve
production and minimize wastages.
Farmers are integrating traditional farming practices with AI to
automate the processes.
Data Science Applications
Healthcare: Data science can identify and predict disease, and
incarceration rates.
E-commerce: Data science can automate digital ad placement.
partners.
Fintech: Data science can help create credit reports and financial profiles,
Key Responsibilities
Data scientist is used to discover the data sources, analyze the information which based
on the patterns and trends.
They automate the procedure of data collection and works on the data pre-processing on
the structured and unstructured data.
Data scientist generates the predictive models and builds the machine learning algorithm.
3.Data Engineer
Data Engineer refers to experts who are responsible for maintaining,
designing and optimizing the data infrastructure for the data
management and transform them.
Key Responsibilities
Data Engineers are responsible for creating and optimizing the data sets for data
business and scientists.
They suggest improvements to enhance reliable and quality of the models and
dataset.
Data engineers develop the algorithms and the prototypes to convert those data
into some useful insights.
4.Business Analyst
Business Analyst are the people’s who help the organization to fulfil
their goals and also assess the organization, analyze the data and
improve the systems and processes for the future.
Key Responsibilities
Business analyst conduct several researches to evaluate in the
business models and
Business analyst develop innovative solutions for the difficult business
problems.
They are expert in allocating forecasting, budgeting and resources in
the businesses.
5.Machine Learning Engineer
Machine learning Engineer refers to the critical members of the data science
team.
These engineer tasks are building, researching and designing the AI which are
further responsible for the machine learning and improving and maintaining the
existing the systems of artificial intelligence.
Key Responsibilities
(iii) Rights Management Metadata: It includes details about the legal and
access rights associated with the resource, such as copyright status,
intellectual property rights, access permissions, restrictions, and any licensing
information.
Examples of Administrative Metadata
Digital Libraries: Metadata about digitized books, manuscripts, and
other resources, including technical details, preservation actions, and
rights information.
Archives: Metadata for archival collections, detailing provenance,
custodial history, and access permissions.
Repositories: Metadata for datasets, software, and other digital
objects, including technical specifications, usage statistics, and
licensing information.
3.Structural Metadata: Structural metadata is metadata that describes the structure,
type, and relationships of data. For example, in a SQL database, the data is
described by metadata stored in the Information Schema and the Definition Schema.
Examples of Structural Metadata
Books and Documents: Information about chapters, sections, and sub-sections,
Data: Both population and sample involve data. Population refers to the entire group
or set of individuals, objects, or events being studied, while a sample is a subset of
the population that is used for analysis.
Example: All the students in the class are population whereas the top 10 students in
the class are the sample.
All the members of the parliament is population and the female candidates present
there is the sample.
Population vs. sample
First, you need to understand the difference between a population
and a sample, and identify the target population of your research.
The population is the entire group that you want to draw conclusions
about.
The sample is the specific group of individuals that you will collect
data from.
Data Modelling
Data modelling: Data modelling is the process of creating a visual
representation of either a whole information system or parts of it to
communicate connections between data points and structures.
Data modelling is a process of creating a conceptual representation of
data objects and their relationships to one another.
The process of data modelling typically involves several steps,
including requirements gathering, conceptual design, logical design,
physical design, and implementation.
Data Modelling in software engineering is the process of simplifying
formal techniques.
The data model provides the blueprint for building a new database or
Once EDA is complete and insights are drawn, its features can then
machine learning.
Data Science Process
Step 1: Defining the problem. The first step in the data science
lifecycle is to define the problem that needs to be solved
Step 2: Data collection and preparation
Step 3: Data exploration and analysis
Step 4: Model building and evaluation
Step 5: Deployment and maintenance
Data Science Process
The various operations involved in Data Science process are:
Define the Problem and Set Objectives
Data Collection and Understanding
Data Preprocessing and Cleaning
Exploratory Data Analysis (EDA)
Model Building and Machine Learning
Interpretation and Insights
Deployment and Monitoring
Deployment
Problem Definition
• The project lead or product manager manages this phase. The
problem definition involves the following steps:
State clearly the problem to be solved and why
Motivate everyone involved to push toward this why
Define the potential value of the forthcoming project
Identify the project risks including ethical considerations
Identify the key stakeholders
Align the stakeholders with the data science team
Enhancements:
Extend the model to similar use cases (i.e. a new “Problem Definition” phase)
Add and clean data sets (i.e. a new “Data Investigation and Cleaning” phase)
Try new modelling techniques (i.e. developing the next “Viable Model”)
Data Science Ops
model.
performance.
Modeling
performance.
Model Evaluation
Validation: Evaluate the model using a validation set to test its
performance.
Metrics: Use appropriate metrics (e.g., accuracy, precision, recall, F1-
score, RMSE) to assess model performance.
Cross-Validation: Perform cross-validation to ensure the model’s
robustness and generalizability.
The F1 score in Machine Learning is an important evaluation
metric that is commonly used in classification tasks to evaluate
the performance of a model. It combines precision and recall into
a single value.
Precision represents the accuracy of positive predictions. It
calculates how often the model predicts correctly the positive
values.
Recall represents how well a model can identify actual positive
cases. It is the number of true positive predictions divided by the
total number of actual positive instances.
Root mean square error or root mean square deviation is one of the
most commonly used measures for evaluating the quality of
predictions.
In machine learning, it is extremely helpful to have a single number
to judge a model’s performance.
Model Deployment
1.Problem Definition
Collaborate with stakeholders to understand business objectives and translate them
into data science problems.
2.Data Collection
Identify and gather relevant data from various sources using techniques like web
scraping, APIs, and database querying.
3.Data Cleaning and Preprocessing
Clean and preprocess data by handling missing values, removing
duplicates, and transforming data into a suitable format for analysis.
4.Exploratory Data Analysis (EDA)
Perform descriptive statistics and create visualizations to uncover
patterns and insights within the data.
5.Feature Engineering
Create and select important features to enhance model performance
and reduce dimensionality.
6. Model Building
Choose appropriate algorithms, train models, and fine-tune parameters to
optimize performance.
7.Model Evaluation
Evaluate models using relevant metrics and validate them to ensure they
generalize well to new data.
8.Model Deployment
Develop and implement a strategy for deploying models into production
environments, ensuring seamless integration with existing systems.
9.Monitoring and Maintenance
Continuously monitor model performance, update and retrain models as
necessary, and perform error analysis to refine models.
11.Ethical Considerations
Ensure data privacy, compliance with regulations, and mitigate biases to
promote fairness and ethical use of data.
12. Continuous Learning
Stay updated with the latest advancements in data science and
continuously experiment with new techniques and tools.
A Data Scientist combines technical skills with business process to
transform data into valuable insights that drive strategic decisions and
operational improvements.
Case Study
Case Study in Data Science - Urban Planning and Smart Cities
1. Singapore
Singapore is pioneering the smart city concept, using data science to
optimize urban planning and public services.
They gather data from various sources, including sensors and citizen
feedback, to manage traffic flow, reduce energy consumption, and
improve the overall quality of life in the city-state.
Singapore - Efficient Urban Planning using Data Science:
Singapore's real-time traffic management system, powered by data
analytics, has led to a 25% reduction in peak-hour traffic congestion,
resulting in shorter commute times and lower fuel consumption.
Singapore has achieved a 15% reduction in energy consumption
across public buildings and street lighting, contributing to significant
environmental sustainability gains.
Citizen feedback platforms have seen 90% of reported issues
resolved within 48 hours, reflecting the city's responsiveness in
addressing urban challenges through data-driven decision-making.
The implementation of predictive maintenance using data science
has resulted in a 30% decrease in the downtime of critical public
infrastructure, ensuring smoother operations and minimizing
disruptions for residents.
2. Barcelona
Barcelona has embraced data science to transform into a smart city as well.
They use data analytics to monitor and control waste management, parking,
and public transportation services.
Barcelona improves the daily lives of its citizens and makes the city more
attractive for tourists and businesses.
Data science has significantly influenced Barcelona's urban planning and
the development of smart cities, reshaping the urban landscape of this
vibrant Spanish metropolis.
Barcelona's data-driven waste management system has led to a 20%
tech startups and foreign investments over the past five years,
Predicting or estimating the selling price of a property can be of great help when
making important decisions such as the purchase of a home or real estate as an
investment vehicle.
It can also be an important tool for a real estate sales agency, since it will allow
them to estimate the sale value of the real estate that for them in this case are
assets.
down to the bottom of the problem, and formulate solutions for the same.
6. Trimming Down Energy Consumption
With the incorporation of data science in real estate, identifying the root
gather and assess energy data from smart meters and sensors, and can
also detect faults in the heating, ventilation, and air conditioning (HVAC)
systems.
Based on the weather changes and the usage pattern, these apps offer a holistic
understanding of energy spendings.
7. Simplifying Home Searching or Buying Process
Data science usage in real estate not only benefits the investor and broker class,
but it also streamlines the home searching, buying and renting process.
It is very much possible that real estate property prices vary drastically across
different cities.
By examining user behavior, their lifestyle preference, budget range, amenities
preference and other such factors, you can offer property suggestions that
match the requirements of the users.
This will therefore save customers’ time in scooping through multiple property
listings.
Data science in real estate also simplifies the process of finding, purchasing, or
renting homes for individuals and families.
Property prices can fluctuate significantly between cities, influenced by
factors such as connectivity to nearby areas, proximity to commercial
centers, and availability of transportation options.
In addition, examining user behavior, preferences in terms of lifestyle,
budget constraints, desired amenities, and other relevant aspects can
lead to personalized property recommendations tailored to meet
customer needs.
8. Revamping the Marketing Strategy
Data science in real estate aids in collecting and examining information
through multiple sources.
This can help agencies in understanding the behavior and preferences
of the consumers, assessing the competition, and marketing their
services in a more creative way.
Once user preference is understood, virtual staging, 3D rendering and
visualization, Google or Facebook ads, and listings can be optimized in
order to attract the target audience.
9. Identifying and Segregating Leads
A very interesting way to harness the power of data science in real
estate is in the field of lead nurturing and segregation.
With the help of data science-backed applications and softwares,
giving a “seller or buyer score” to leads which are most likely to
sell/buy properties has now become possible.
This assessment is made by evaluating factors like demographics,
income changes and purchasing behavior.
Prospects for Growth
Modern technologies have revolutionized the real estate market.
Many companies have already shifted to big data-machine learning
powered software for analyzing data, calculating the profitability of an
apartment purchase, portfolio management, and estimating property
rentals.
The study of how customers, groups, or organizations select, buy, use,
and dispose of ideas, goods, and services can impact, inform and
govern the decision-making process of the producing firms and
organizations to a large extent.