100% found this document useful (1 vote)
40 views19 pages

Data Science Real World Applications

This PDF covers more about real world application of data science today

Uploaded by

shineqmwareya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
40 views19 pages

Data Science Real World Applications

This PDF covers more about real world application of data science today

Uploaded by

shineqmwareya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 19

Data Science: Real World Applications.

• Data Science has become an essential tool for businesses and


organizations to make informed decisions and drive growth.

• As the amount of data being generated continues to grow, the


importance of data science in solving complex problems and
making data-driven decisions will only continue to increase.

• Data science combines elements of mathematics, statistics,


computer science, and domain expertise to solve real-world
problems and make data-driven decisions.
• Data science involves the analysis of large and complex datasets using
statistical methods, machine learning techniques, and data
visualization.

• A dataset is a collection of data.

• Machine learning is a subset of artificial intelligence that involves


training computer systems to learn and improve from experience
without being explicitly programmed.

• In other words, it is a way of teaching computers to learn from data,


identify patterns, and make predictions or decisions.
• data science has a wide range of applications across many industries.

• One of the most important applications of data science is in the healthcare


industry. Machine learning algorithms can be used to analyze medical images
and diagnose diseases, e.g. use of chest x-ray images to detect whether a
person has TB or Covid-19

• Data science is also widely used in the finance industry. Financial institutions
use data science to detect fraudulent transactions and prevent financial
losses.
• In the education sector, data science can be used to improve student
outcomes. Student performance data can be analyzed to identify
areas of weakness and provide personalized learning experiences.

• Data science is also being used to improve transportation systems.


Real-time traffic data can be used to optimize traffic flow and reduce
congestion.

• Below is an example of an entire process of solving a real world


problem using Data Science
Problem

You are the Senior Data Scientist at a major private bank. Since the last
6 months, the number of customers who are not able to repay their
loan has increased. Keeping this in mind, you have to look at your
customer data and analyze which customers should be given the loan
approval and which customers should be denied.
Tasks to be performed
• Domain: Banking
• Programming language: Python
• Of note is that Python is the most widely used programming language
in data science.

1. Data collection

• The first step in applying data science to loan default prediction is to


collect relevant data.
• the relevant data is primarily information about the borrower such as their
gender, income, employment history and other financial details as shown
below. Data was obtained from the bank in line with our problem

• The structure of our dataset is a DataFrame. A DataFrame is a two-


dimensional data structure composed of rows and columns. It is a
fundamental data structure for data manipulation and analysis in Python
• The dataset has 12 independent variables and 1 target variable (i.e.
Loan_Status).
2. Data Cleaning
• Once the data has been collected, it is important to clean and preprocess
it. Data preprocessing is a process of preparing the raw data and making it
suitable for a machine learning model

• Often this is the lengthiest task. Without it, you’ll likely fall victim to
garbage-in, garbage-out.

• This task involves removing any duplicates, correcting errors, filling in


missing values, and transforming the data into a format that can be easily
analyzed. For example, you might convert string values that store numbers
to numeric values so that you can perform mathematical operations.
3. Feature Engineering
• Feature engineering involves creating new variables or features that can
be used to improve the accuracy of the loan default prediction model.
• Based on the domain knowledge, we can come up with new features that
might affect the Loan_Status variable. We will create the following three
new features:

a) Total Income – By combining the Applicant Income and Coapplicant


Income. If the total income is high, chances of loan approval might also
be high.
b) Equated Monthly Installment (EMI) – EMI is the monthly amount to be
paid by the applicant to repay the loan. Idea behind making this
variable is that people who have high EMI might find it difficult to pay
back the loan. We calculated the EMI by taking the ratio of loan amount
with respect to loan amount term.
c) Balance Income - This is the income left after the EMI has been
paid. Idea behind creating this variable is that if this value is high,
the chances are high that a person will repay the loan and hence
increasing the chances of loan approval.

Let us now drop the variables which we used to create these new
features. Reason for doing this is, the correlation between those old
features and these new features will be very high and this may result in
a noisy dataset, so removing correlated features will help in reducing
the noise.
Checking the dataset after feature
engineering
4. Model selection
• After feature engineering, the next step is to select a predictive model.
Different classification models such as LightGBM, Decision Trees,
Random Forest, Support Vector Machine, Logistic regression, Neural
Network, or other machine learning algorithms can be used for this
purpose.

5. Model training
• The selected model is trained on the cleaned and preprocessed data.
• The model is iteratively adjusted and fine-tuned until it can accurately
predict loan defaults.
• At this stage, the dataset is split into a training set and test set.
• A common split ratio is 70-30, which means that 70% of the data is
used for training and 30% is used for testing.
6. Model evaluation
• Once your machine learning model is built (with your training data),
you need unseen data to test your model. This data is called testing
data, and you can use it to evaluate the performance and progress of
your algorithms' training and adjust or optimize it for improved results.
• This can be done by computing various evaluation metrics such as
accuracy, precision, recall, F1 score and so on.

7. Model deployment
• Once the model has been trained and evaluated, it can be deployed in
a real-world scenario to predict loan defaults.
• This may involve integrating the model into an existing loan processing
system or developing a new system specifically for loan default
prediction.
8. Continuous Improvement
• The final stage of the process is continuous improvement. It involves
monitoring the model's performance, updating the model as
required, improving the data quality, and integrating new data
sources. This stage ensures that the model continues to provide
accurate predictions over time
Questions

You might also like