Lec 1 - Data Science
Lec 1 - Data Science
Data science is the field of study that combines domain expertise, programming skills, and
knowledge of mathematics and statistics to extract meaningful insights from data.
Data science is the study of data to extract meaningful insights for business. It is a
multidisciplinary approach that combines principles and practices from the fields of
mathematics, statistics, artificial intelligence, and computer engineering to analyse large
amounts of data.
1. Descriptive analysis
Descriptive analysis examines data to gain insights into what happened or what is
happening in the data environment. It is characterized by data visualizations such as pie
charts, bar charts, line graphs, tables, or generated narratives. For example, a flight booking
service may record data like the number of tickets booked each day. Descriptive analysis will
reveal booking spikes, booking slumps, and high-performing months for this service.
2. Diagnostic analysis
Diagnostic analysis is a deep-dive or detailed data examination to understand why
something happened. It is characterized by techniques such as drill-down, data discovery,
data mining, and correlations. Multiple data operations and transformations may be
performed on a given data set to discover unique patterns in each of these techniques.For
example, the flight service might drill down on a particularly high-performing month to better
understand the booking spike. This may lead to the discovery that many customers visit a
particular city to attend a monthly sporting event.
3. Predictive analysis
Predictive analysis uses historical data to make accurate forecasts about data patterns that
may occur in the future. It is characterized by techniques such as machine learning,
forecasting, pattern matching, and predictive modeling. In each of these techniques,
computers are trained to reverse engineer causality connections in the data.For example,
the flight service team might use data science to predict flight booking patterns for the
coming year at the start of each year. The computer program or algorithm may look at past
data and predict booking spikes for certain destinations in May. Having anticipated their
customer’s future travel requirements, the company could start targeted advertising for those
cities from February.
4. Prescriptive analysis
Prescriptive analytics takes predictive data to the next level. It not only predicts what is likely
to happen but also suggests an optimum response to that outcome. It can analyze the
potential implications of different choices and recommend the best course of action. It uses
graph analysis, simulation, complex event processing, neural networks, and
recommendation engines from machine learning.
The data science life cycle is a systematic approach to solving complex problems and
extracting insights from data. It involves a series of steps and processes that help data
scientists and analysts transform raw data into actionable information. Below is an
explanation of the data science life cycle along with a diagram to illustrate it:
1.Problem Definition:
The first step in the data science life cycle is to clearly define the problem or question you
want to answer. This involves understanding the business context and the goals of the data
analysis.
2. Data Collection:
Once the problem is defined, you need to gather relevant data. Data can come from various
sources, such as databases, APIs, spreadsheets, or sensors. It's crucial to collect clean and
high-quality data for meaningful analysis.
3. Data Preprocessing:
Raw data is often messy and incomplete. Data preprocessing involves cleaning, handling
missing values, and transforming the data into a format suitable for analysis. This step also
includes data normalization and feature engineering.
EDA is the process of visually and statistically exploring the data to gain insights and
identify patterns, outliers, and relationships within the dataset. It helps in formulating
hypotheses and refining the analysis approach.
5. Data Modeling:
In this step, you select the appropriate machine learning or statistical models to address the
problem. You split the data into training and testing sets and train the models on the training
data.
6. Model Evaluation:
After training, you evaluate the models' performance using appropriate metrics (e.g.,
accuracy, F1-score, ROC curve) on the testing data. This step helps you choose the
best-performing model.
7. Model Tuning:
If the model's performance is not satisfactory, you can fine-tune hyperparameters, try
different algorithms, or adjust the feature selection process to improve the model's accuracy.
8. Deployment:
After deployment, it's essential to continuously monitor the model's performance and make
necessary updates. Models can drift over time, and data distributions can change.
Feedback from the deployed model and end-users is essential for continuous improvement.
It may lead to refining the problem definition or reiterating through the data science life cycle.
A data product is an application or tool that uses data to help businesses improve their
decisions and processes. Data products that provide a friendly user interface can use data
science to provide predictive analytics, descriptive data modelling, data mining, machine
learning, risk management, and a variety of analysis methods to non-data scientists.
A data product is a reusable data asset, built to deliver a trusted dataset, for a specific
purpose. It collects data from relevant data sources — including raw data — processes it,
ensures data quality, and makes it accessible and understandable to anyone who needs it to
meet specific needs