0% found this document useful (0 votes)
33 views6 pages

PROJECTS

The document discusses two machine learning projects - a logistic regression model to predict risk of heart disease using a dataset of 8000 records, and an analysis of an insurance dataset in Tableau to minimize losses for an insurance firm. It describes the data cleaning, feature engineering, model fitting and evaluation steps for the heart disease prediction model and creates various visualizations of the insurance data by factors like age, education and car use.

Uploaded by

kirathaka19999
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views6 pages

PROJECTS

The document discusses two machine learning projects - a logistic regression model to predict risk of heart disease using a dataset of 8000 records, and an analysis of an insurance dataset in Tableau to minimize losses for an insurance firm. It describes the data cleaning, feature engineering, model fitting and evaluation steps for the heart disease prediction model and creates various visualizations of the insurance data by factors like age, education and car use.

Uploaded by

kirathaka19999
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

PROJECTS

1.LOGISTIC MODEL TO PREDICT RISK OF HEART DISEASE


The main objective of this project is to predict whether a patient has a risk of
possessing a heart disease based on the given dataset.
This dataset consists of around 8000 records and I found this dataset in Kaggle
website, firmly known as Framingham dataset.
It has columns/fields such as gender, age, bp, glucose, BMI, cholesterol etc.
STEP1: Importing the dataset into notebook.
• The first step is to import necessary libraries in order to import the dataset
into the jupyter notebook. Here I have imported numpy: That is used for basic
mathematical calculations, pandas: known has panel data is used to store
data in the form of table i.e rows and columns. Seaborn: used for animations,
visualization. Matplotlib: used for 2d graphs, charts. Sklearn: it mostly
comprises of machine learning tools.
• We are storing the dataset in the form of data frame , which is a 2d table.
And naming it as df.
• To view the first 15 records, I have used df.head(15)
• To gain knowledge about the dataset, I have used df.info() which describes
the general information regarding each column, number of entries,
datatypes etc.
• Df.describe() Is used to gain information such as count, mean,std, and
quantiles. Here only the numerical datatype columns will be considered since
statistical measures such as mean, median can only be used on numerical
data.
• To describe information regarding categorical data, we can used
include=object. Here it describes count, unique values, frequently occurred
value and the number of times it has occurred.
STEP2: DATA CLEANING
• After gaining knowledge on the dataset, we can proceed to clean the data.
Here cleaning the data means removing garbage values, handling null values,
outliers and encoding columns.
• Df.isna(),sum() is used to know the sum of null values present in each column
in the dataset. And to drop these null values, we can use df.dropna(). Here
the null values are removed since it can affect the accuracy of the model. If
the %of null values is greater the 5%, then ideally we can impute them with
mean/median depending upon the outliers present in that particular field.
• I have used df.info to check the columns that as of datatype object even
though it is numerical. For example totchol is of datatype objectbut, it must
float. Hence this tells us that there might be some garbage values present in
the column. And to confirm this, I have used sort_values to sort them in
ascending order. As we can see, there are garbage values such as #, !, ? that
must be removed.
• Hence to automate this process, I have used a for loop to iterate over the
columns and remove the garbage values.
• Once the garbage values are removed, I have converted them to float
datatype since, logistic model only takes into account numerical datatype.
• Here I have used label encoder, which is one of the encoding technique that
converts categorical datatype into a numerical datatype by simply assigning
values ranging from 0 to n-1 where n is the number of unique values.
• Now we can see the columns in the dataset have been transformed into
numerical datatype as is ready for analysis.
• I have used df.skew() to gain knowledge of skewness present in each column
of the dataset. Skewness is nothing but the asymmetric distribution of data.
And is caused due to outliers present in the data. There are two types of
skewness, Positive and negative. Positive is caused when there are outliers
present in maxima region of the data, whereas negative is caused when there
are outliers present in minima region of the data.
• And to visualize the outliers I have used a boxplot. Also called as wisker plot
contains quantiles such as q1, q2, q3 , minima and maxima. Q1-1.5IQR
• Since the outliers are relevant in this dataset, for example, a patient can have
a bp of over 200, which does not mean it is wrong data. Hence we aren’t
handling the outliers in this case. Else we can use log transformation or
interquantile method to remove/supress the outliers.

STEP3: DATA MINING AND VISUALIZATION


• Here I have visualized the spread of data for the fields age, bp, chol, and bmi.
Histogram plot basically represents the mod of the data. i.e how many times
the age 40 as occurred.
• The second graph here represents the correlation between two columns.
Here correlation describes how strong two fields are related to each other
and its value ranges between -1 to +1, if the value is closer to +1, then it is
strongly positive correlated, closer to -1, strong negative correlation, and
closer to 0, weak relation.
• The third graph here represents the count of active smokers. And we can
conclude that active smokers are a bit less than in active.
• The forth graph here is a pie chart that represents the distribution of patients
with heart disease in %.
• This graph here is called a scatter plot which is used between 2 quantitative
fields to asses the relationship between them. And we can conclude that
there is a slight trend/pattern that is observed. i.e as age increases, chol also
increases.
• This graph here represents the number of diabetic patients among the
patients that are diagnosed with heart disease. And there is conclusive
evidence that most of them are diabetic.
• Similar to the previous graph, I have visualized the age factor for patients
that have heart disease.
• This graph is regarding the education levels and heart disease
• Here I am doing a statistical test called p-value which is a probability value
that tells us whether a column will fall under alternate hypothesis or not.
Alternate hypothesis means that there is a relation between 2 fields. And p-
value is said to be significant only If its value is less than 0.05, since all the
fields have values <0.05, I am considering all of them for the model.

STEP4: Fitting Machine Learning Model


• The first step in model fitting is to divide the columns into independent and
dependent variables. Independent variables are called predictor variables
that are used to predict the dependent variable and denoted by X. Y is
dependent variable also known as target variable denoted by Y.
• The next step is to split the dataset into train and test. Because we do not
want the model to memorize the values and predict the dependent variable.
According to standards, 30% is of test and 70% is train.
• And now we are using standard scaling technique to transform all the
columns into a single scale by making it unitless. Mathematically,
ss = x – xbar / sd.
• Now we are fitting the model. Logistic regression is a machine learning
technique used to model the relationship between independent variables
and dependent variable and fitting the best fit line into a sigmoid function.
The main object of logistic regression is to predict whether an event occurs
or not based on probability. In this case, whether patient has a heart disease
or not.

STEP5: Model Evaluation


• Once the model is fitted and the values are predicted, we need to evaluate
the model based on evaluation metrics such as accuracy, precision, recall,
f1score.
• Accuracy is nothing but number of truly predicted / total number of
predictions. Problem here is that it takes into account both the classes. I,e 0
and 1. Hence we used precision/ recall.
• Confusion matrix is a matrix that shows us the distribution of predicted
values. 1-1 : True positive. 0-0: True negative, 0-1: false positive, 1-0 : False
negative
• The evaluation metric that is relevant to this model is precision since it tells
us amongst the predicted values, how many are actually correct. And is used
in life/death situation. Example: patient…..
• Recall is a metric that describes how many are true positive values are
correct. And is used in marketing models. Example: Store.…
• F1 score is a harmonic mean.
• To conclude, Since the score of precision id 0.61, we can say that our model
isn’t too bad, but can be improved used other advanced techniques such as
smoothing, neural network and gradient boosting machines.
2. Insurance Firm Data Visualization using Tableau
Objective: Analyse an insurance firm dataset and provide insights to minimize
the loss to the firm, including a recommendation statement to increase the
premium amount for selected customers based on the analysis.
Scope: This dataset is contains around 7600 records and includes fields such as
kidsdrive, homekids, singleparent, homevalue, education, occupation, caruse,
cartype, urbancity.
• I have considered 1 numerical field and tried to visualize it based on other
fields for example, here I have taken age and made a dashboard out of it.
Dashboard is nothing but a combination of multiple single views/
visualizations. Age vs claim frequency is a line chart that represents the
frequency of claim for each age , age v/s claim amount is a bar graph that
shows us the sum of claim amount for each customers age. And the third one
here is a pareto chart that is a combination of bar and a line chart that is used
to analyse cumulative effect of claim amount and age. I have set the pareto
parameter to be 70% here which tells us that 70% of the claim amount is
being claimed between birth year of 1960 to 1945.
• Next considering the education factor, from this dashboard what we can
conclude is that customers whose education level is masters and phd have a
less claiming freq as well as led claim amount.
• The next factor is caruse, which describes whether the car is used as
commercial or private purpose. I have used an interactive filter that interacts
with all the charts present in the dashboard. For females, private use is
claiming a larger amount and in case of male, the claiming amount is a bit
more on commercial types.
• The factor used here is cartype, and it is evident here that sportscars, and
panel trucks are an issue since the amount as well as freq are more. This chart
is called as a treemap chart which is a visual method based on hierarchy. It
comprises of nested rectangles called branches and bigger the branch, bigger
the proportion.
• Now I have considered occupation factor, where it consists of professions
such as doctors, student, homemaker etc. according to our visualization, it is
clear that students and homemakers have the highest claimfreq as well as
they have least income. Whereas doctors and managers have a huge income.
The chart that is used here is a highlighted table.
3.SUPPLY CHAIN FIRM ANALYSIS USING MYSQL.
Objective: Analyse a supply chain firm and provide insights regarding orders,
suppliers, customers, products in demand, revenue generated by each
product month-wise and year-wise, competitors for a particular product,
ranking suppliers based on revenue generated, discovering inactive clients
who need attention, and so on.
Scope: It has around 64000 records and fields, and tables such as customers,
department, orders, product, shipping where each of the table consists of
their respective columns suggested by the table name.
I have performed the analysis based on granularity level and hence divided
this analysis into 4 parts. The first part consists of analysis of orders, where
we can analyse what is problem in late deliveries.
The second level is based on customers, where we can analyse the customers
who have ordered the most and hence prioritizing them.
The third level is based on product as well as city level analysis where I have
analysed the departments that are in demand, products that needs attention
etc.

4. Analysis of a Real Estate Firm using Excel


Business Objective: Analyse the magnitude of each variable and how it can
affect the price of a house in a particular locality and build a linear regression
model to predict the price of the house.
Scope : The dataset is based on a city known as boston in usa. It contains
records of around 500 houses and has fields such as age, tax, distance from
nearest highway, indus: non-retail shops in acres, avg_room, lstat: lower
status of population, ptratio, pupil teacher ratio, nox: nitric oxide.
• Generate inference on kurtosis and skewness of each variable.
• Covaraince matrix
• Correlation matrix
• Model of all variables
• Model of only p-value variables.

You might also like