0% found this document useful (0 votes)
151 views

Logistic Regression

The document outlines a data mining project to predict election winners in India using state-level polling data from 2004, 2008, and 2012. It describes cleaning and normalizing the data, building logistic regression models on the training data from 2004 and 2008, and evaluating the models' accuracy on the 2012 test data, finding an accuracy of 96.77%. The conclusion is that the model performs well for predicting state winners.
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
151 views

Logistic Regression

The document outlines a data mining project to predict election winners in India using state-level polling data from 2004, 2008, and 2012. It describes cleaning and normalizing the data, building logistic regression models on the training data from 2004 and 2008, and evaluating the models' accuracy on the 2012 test data, finding an accuracy of 96.77%. The conclusion is that the model performs well for predicting state winners.
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 15

Data Mining Project

(Predict Election Winners)


- By Harshal Kolhatkar
Problem Statement

• An election is to be held in next month, ABC Corporation


a data analytics company wants to predict the future of
two largest parties in country.

• Two major parties are BJP & Congress.

• Goal : Use Polling data to predict state Winner.


Given Dataset

Instance represent a state in a given election

• State : Name of the state

• Year : Election year (2004,2008,2012)

Dependent Variable

• BJP : 1 if BJP won state, 0 if congress won.

Independent Variable

• Times now, India Today : Polled BJP% - Polled Congress%

• DiffCount : Polls with BJP winner – Polls with congress winner

• PropBJP : Polls with BJP winner / # polls


Data Cleaning
 Summary of Polling Data
Data Cleaning – Packages to handle Missing Values

List of R Packages
1. MICE (Multiple Imputation Via Chain Equation)
2. Amelia
3. miss Forest
4. Hmisc
5. mi
Data Cleaning
 Graphical Representation of Missing Value

Before Cleansing After Cleansing


Data Visualization
Mean = 0.2525253 Mean = 0.02858385
Standard Deviation = 14.27238 Standard Deviation = 1.026924

Before Normalizing After Normalizing


Data Visualization
Mean = 0.3838384 Mean = 0.02858385
Standard Deviation = 15.45745 Standard Deviation = 1.026924

Before Normalizing After Normalizing


Data Visualization – Checking Normality
Before Normalizing After Normalizing

Times Now
Data Visualization – Checking Normality
Before Normalizing After Normalizing

India Today
Data Modeling

 Collinearity is a linear association between two explanatory variables.

 Two variables are perfectly collinear if there is an exact linear relationship

between them.
Data Modeling (Using Train & Test)

Years : 2004, 2008, 2012

Train : 2004, 2008

Test : 2012
Data Model ( Logistic regression )
With India Today + Prop BJP With Prop BJP
Data Model ( Logistic regression )
• Train model ( Year 2004,2008)
• Accuracy of model = 94.11%
• Test Model (Year 2012)
• Accuracy of data = 96.77%
Conclusion

• Finally I conclude that the model that I have


made is performing well to predict data of
year 2012.
• So we can use this model to predict the state
winners.

You might also like