0% found this document useful (0 votes)
76 views8 pages

DMDW Lab Progress Report: 'C:/Users/KIIT/Desktop/DMDW Lab/players - CSV'

The document summarizes applying statistical analysis methods in R on a dataset of football players: 1. The data was imported into RStudio and the top 100 players were extracted for analysis. Visualizations including scatter plots and bar graphs were created to examine relationships between variables like price, overall rating, age, etc. 2. Additional data on countries and continents was merged in. Variables were cleaned and filtered to select relevant attributes for modeling. 3. Multiple linear regression identified that age, overall rating, and potential best predicted a player's price, providing a formula to estimate predicted price. New variables for predicted and actual price difference were added.

Uploaded by

Ahmad Alsharef
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
76 views8 pages

DMDW Lab Progress Report: 'C:/Users/KIIT/Desktop/DMDW Lab/players - CSV'

The document summarizes applying statistical analysis methods in R on a dataset of football players: 1. The data was imported into RStudio and the top 100 players were extracted for analysis. Visualizations including scatter plots and bar graphs were created to examine relationships between variables like price, overall rating, age, etc. 2. Additional data on countries and continents was merged in. Variables were cleaned and filtered to select relevant attributes for modeling. 3. Multiple linear regression identified that age, overall rating, and potential best predicted a player's price, providing a formula to estimate predicted price. New variables for predicted and actual price difference were added.

Uploaded by

Ahmad Alsharef
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 8

DMDW Lab Progress Report

Applying different R statistical operations and analysis methods on a


dataset contains information about football players from all over the
world including their names, nationalities, overall power, club…etc.
1. First I’ve collected the data and imported it into RStudio.

> players=read.csv ('C:/Users/KIIT/Desktop/DMDW Lab/players.csv').

2. Then I extracted the top 100 players in the world to a new dataset and
called it Top100Players to simplify simulating.

> top100players<-head(players,100).

3. I used ggplot2 library to draw plots of Top100Players:


The following instruction shows a qplot of the top 100 players in the world
with their nationalities
> qplot(top100players$ID,top100players$Club, color= top100players$
Nationality)

 Using ggplot to visualize how many professional players among


top100players each club includes:
> ggplot(top100players, aes(x=Club)) + geom_bar(color="black", f
ill="lightblue", + linetype="dashed",alpha=0.5)+theme(axis.text.
x = element_text(angle = 90, hjust = 1)).

X axis represents the club


Color of line is black
Filing color is light blue
Line type is dashed

 Visualizing how many professional players from each country in


each club:
> qplot(top100players$ID,top100players$Club,color=top100players$
Nationality).
4. Adding The continent of the player national team to the data
set by merging two datasets:
I imported a dataset called continents includes each country in
the world and the continent which contains it.
> continent=read.csv('C:/Users/KIIT/Desktop/2nd Semester/DMDW
Lab/UNSD.csv')
I merged continents dataset with the top100players dataset using
inner join.
> m=merge(top100players,continent,by="Nationality").
5. Cleaning and filtering dataset to make it useful for
prediction:
 first installed the package which allows to select number of
attributes among the dataset columns:

 install.packages(tidyverse)
 library(tidyverse).

 select useful attributes for top 1000 players:


> top1000players <- head(players,1000) %>% SELECT(Name,
Club,Overall,Position,LS,ST,RS,LW,LF,CF,RF,RW,LAM,CAM,RAM,LM,LCM
,CM,RCM,RM,LWB,LDM,CDM,RDM,RWB,LB,LCB,CB,RCB,RB,Value,Wage,Joine
d,Contract.Valid.Until,Release.Clause).
 Removing k from wage and m from value.
 Removing lbs from weight.
 Eliminating missing values completely from the entire data
frame
na.omit((top1000players)
 converting feets and inches to cm using the following
function:
convert_to_cm <-function(Height) {
feets<-as.integer(substr(Height,1,1))
inches<-as.integer(substr(Height,3,4))
cm<-round(30.48*feets+2.54*inches)
return(cm)
}

6. Using Scatter Plots to see whethere there is a relationship


between the player price and each variable:
> top1000players$Height<-convert_to_cm(top1000players$Height)
> plot(x=top1000players$Overall, y=top1000players$Value,ylim =
c(2.5,100))
 Relationship between The price and the Overall.
> plot(x=top1000players$Potential, y=top1000players$Value,ylim =
c(2.5,100)).
 The relationship between the Value and the Potential.

> plot(x=top1000players$International.Reputation,
y=top1000players$Value,ylim = c(2.5,100))
No Relationship between the price and the Reputation

> plot(x=top1000players$Weak.Foot, y=top1000players$Value,ylim =


c(2.5,100))
No Relationship
> plot(x=top1000players$Skill.Moves, y=top1000players$Value,ylim
= c(2.5,100))
No Relationship
> plot(x=top1000players$Age, y=top1000players$Value,ylim =
c(2.5,100)).
 The relationship between Value and Age.

> plot(x=top1000players$Contract.Valid.Until,
y=top1000players$Value,ylim = c(2.5,100))

 The attributes that affects the total price are:


Age, Overall, Potential
7. Multiple Regression:
model <- lm(Value ~ Age +Overall+Potential,data=top1000players)
# Show the model.
print(model)
Model:
Call:
lm(formula = Value ~ Age + Overall + Potential, data =
top1000players)
Coefficients:
(Intercept) Age Overall Potential
-282.7408 -0.5814 3.5326 0.4033

 The Formula that represents the relation the price(value) and


the attributes Age, Overall, Potential is:
PredictedPrice = - 282.7408 - 0.5814*Age + 3.5326*Overall +
0.4033*Potential

8. We add 2 new attributes to the dataset:


#contains the predicted price of the player:
> top1000players$PredictedPrice <- with(top1000players,-
282.7408 - 0.5814*Age + 3.5326*Overall + 0.4033*Potential)
#contains the difference between the real price and the
predicted price:
#If it is <= 0 then the player worth his price; The real price
is less than the predicted
#If it is >0 then the player doesn't worth his price; The real
price is more than the predicted
> top1000players$Difference <- with (top1000players, Value -
PredictedPrice).
We can See Difference column and see who does worth his price
and who doesn’t

You might also like