0% found this document useful (0 votes)
2 views16 pages

DATA I Revision Data Analysis

The document covers sampling techniques, including probability and non-probability sampling methods, and their applications in statistical analysis. It also discusses bivariate analysis methods such as Pearson's correlation and ANOVA, as well as principal components analysis (PCA) for data dimensionality reduction. Additionally, it provides practical examples of data manipulation and analysis using R programming.

Uploaded by

Rana Ben Fraj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views16 pages

DATA I Revision Data Analysis

The document covers sampling techniques, including probability and non-probability sampling methods, and their applications in statistical analysis. It also discusses bivariate analysis methods such as Pearson's correlation and ANOVA, as well as principal components analysis (PCA) for data dimensionality reduction. Additionally, it provides practical examples of data manipulation and analysis using R programming.

Uploaded by

Rana Ben Fraj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Chapter1: Sampling:

Sampling vs Census:
-Census: all the population members are enumerated
-Sampling: a data set (subgroup) is selected randomly from the reference population.
[NB: Sampling Frame: the list of all contacts or identifiers of population subjecst/
units from which those to be contacted for inclusion in the sample are obtained.]

Probability vs non probability sampling:


-Probability sampling: all the subjects of the population get an opportunity to be in the sample.
-Non-probability sampling: it is not known if a given individual will be in the sample or not.
=> we use the non-probability sampling methods when we do not have the frame.

When is a sample representative?: ki tabda random and big enough


The bigger the sample, he smaller the marginal error.

Probability Sampling Techniques:


1) Simple Random Sampling(SRS):
● all the subjects of the reference population are equally likely
● We assign numbers to the population subjects and then randomly choose from
those numbers through an automated process.

2) Systematic Sampling:
● when you choose every \nth" individual to be a part of the sample (There is a gap,
or a constant interval between each two successive selected
● Steps to follow:
1. Number from 1 to N all units of the population.
2. Determine the sampling interval K by dividing the number N (population
size), by the number n (sample size): K = N/n
3. Randomly select a number between 1 and K. This number d is the origin and
it is the first number included in the sample.
4. Select each K th unit after this first number. The sample obtained is formed
by the units of order:d; d + K; d + 2K; d + 3K; : : : ; d + (n - 1)K

3) Stratified random sampling:


● The population is subdivided into strata (relatively homogeneous groups) which are
mutually exclusive
● Could be Proportionate or Disproportionate:
The only difference between proportionate and disproportionate stratified random
sampling is their sampling fractions.
+Proportionate: From each strata the same proportion of individuals is drawn. The
sampling rate is the same in all strata
+Disproportionate: the different strata have different sampling fractions

4) Cluster Sampling:
● Separate the population into subgroups called clusters. A random number of
clusters are selected, and all their elements are in the sample.

5) Multistage Sampling:
● Similar to cluster sampling, except that in this case a sample is taken from each
cluster.

Non-Probability Sampling Techniques:

● Judgemental sampling:Sample selection based on some judgments about the entire


population
● Quota Method:ensuring the representativity of a sample by assigning it a structure
similar to that of the base population.
The main difference between it and stratified sampling, is that in stratified
sampling, you draw a random sample from each subgroup (probability sampling).
In quota sampling you select a predetermined number or proportion of units, in a
non-random manner (non-probability sampling).
● Route Sampling: field staff do not enumerate all households within a selected area;
instead, they are provided a starting point and a set of instructions for selecting
households while in the field.

Sample adjustment: results often depend on how representative the sample is of the population.
The results of statistical surveys can be improved by integrating additional. information into the
calculation of estimators. The process is generally called. sample adjustment.
● Adjusting by weight:
● Adjusting by deleting:In order to have the same characteristics of the reference population,
● we can select randomly some observations from the sample to delete
● Adjusting by bootstrapping:In order to have the same characteristics of the reference
population,we can create a new bootstrap sample from the available sample

Chapter2: Bivariate Analysis


Types of bivariate relationships: Dependence and interdependence:
-Dependency relationship, we distinguish between the independent variable and the dependent
variable
-Interdependency relationship:the two variables influence each other
1) Case of two quantitative variables, Pearson's Correlation analysis:
- Calculate r =cov(X; Y )/(SX* SY)
- Interpret r:
If r is close to 1 ) strong positive linear correlation.
If r is close to 0 ) lack of linear correlation.
If r is close to -1 ) strong negative linear correlation

2) Case of two categorical variables: Cross table analysis and chi square test
3) Comparing the averages of two independent samples:
4) Fisher’s 1 way Anova test

Chapter3: Principal Components Analysis


Individual Scatter Plot:
1) Write Initial data Matrix
𝑖
𝑥𝑗−𝑥𝑗
𝑖
2) Construct Standardised and and centred Z matrix 𝑍𝑗 = σ𝑗

[Why do we standardize? : in order to make the variable dispersion have the same scale.]
𝑡 1
3) Construct Correlation Matrix 𝑅 = 𝑍 𝐷𝑍 where D is the weight matrix such as 𝐷 = 1
𝑙𝑛
4) Determine eigenvalues of the eigenvectors of the correlation matrix R
5) Sort the eigenvalues in a decreasing order in a matrix U, that is the matrix of eigenvectors
𝑢𝑗 organised in columns. These eigenvectors correspond to the principal axes.
6) Determine the coordinates, over the new axes formed by the eigenvectors, of the
given individuals through the matrix C=ZU

● Absolute Contribution of a given point ACTR


● Relative contribution of a given point RCTR

Eigenvalue/variances:
Eigenvalues can be used to determine the number of principal components to retain:

● An eigenvalue > 1 indicates that PCs account for more variance than accounted by
one of the original variables in standardised data(Kaizer). This is commonly used as
a cutoff point for which PCs are retained. This holds true only when the data are
standardised.(Ne5dhou ken elli akbar men 1)
● You can also limit the number of components to that number that accounts for a
certain fraction of the total variance(Inertia). For example, if you are satisfied with
70% of the total variance explained then use the number of components to achieve
that.
In this case, PC1 would be enough to reach the cutoff, yet we can only work with at
least 2 PCs.

A method to determine the principal axes to keep is to look at a Scree Plot, which is
the plot of eigenvalues (%) ordered from largest to the smallest. The number of
components is determined at the point, beyond which the remaining eigenvalues are
all relatively small and of comparable size. In this case, we compare to 1. The
number of axes to retain can be known by determining the “elbow” of the curve.( el
elbow ne7sbouh zeda)

Remarks on individual Factors Map:


● C1 is the variable which gives the best description of the data dispersion.
● The best plane data visualization is given by the factors map formed by the two
axes C1 and C2).
● The variables Cα (α = 1; : : : ; p) are orthogonal (not correlated).
● The variables Cα (α = 1; : : : ; p) are a linear combination of Zj variables, so, Cα is
also standardized.
● For all α ≤ p : Var(Cα) = λα

Variable Scatterplot:
The representation of variables differs from the plot of the observations: The observations are
represented by their projections, but the variables are represented by their correlations.
Interpretation of variables’s factor Map:
● Variables to keep: We keep only the variables that are close to the correlation circle
(close to the correlation circle = well represented on the factor map (i.e. variables))
● Variable-axis: variables strongly correlated with a factor will contribute to the definition of
this axis. (the closer to the axis = stronger correlation)
● Variable-variable: the proximity of projections of 2 variables indicates a strong positive
correlation between them. item 2 Diametrically opposite projections indicate a negative
correlation between them.
● Nearly orthogonal directions indicate a weak linear correlation

Quality of representation of the variables on the factor map: Cos2


● The cos2 values are used to estimate the quality of the representation
● The closer a variable is to the circle of correlations, the better its representation on the
factor map (and the more important it is to interpret these components)
● Variables that are close to the centre of the plot are less important for the first components.
DATA ANALYSIS/ R STUDIO

Using packages:
install.packages(“tidyverse”) #Lets you install new packages (e.g., tidyverse package)
library(tidyverse) #Lets you load and use packages (e.g., tidyverse package)

Setting Working Directories:


getwd() #Returns your current working directory
setwd(“C://file/path”) #Changes your current working directory to a desired filepath

Creating Dataframes:
#This creates the data frame df, seen on the right
df <- data.frame(x = 1:3, y = c(“h”, “i”, “j”), z = 12:14)
#This selects all columns of the third row
df[ ,3]
#This selects all rows of the second column
df[2 , ]
#This selects the third column of the second row
df[2,3]
#This selects the column z
df$z

Manipulating Dataframes:
#Takes a sequence of vector, matrix or data frame arguments and combines them by
columns
bind_cols(df1, df2)
#Takes a sequence of vector, matrix or data frame arguments and combines them by rows
bind_rows(df1, df2)
#Extracts rows that meet logical criteria
filter(df, x == 2)
#Removes rows with duplicate values
distinct(df, z)
#Selects rows by position
slice(df, 10:15)
#Selects rows with the highest values
slice_max(df, z, prop = 0.25)
#Extracts column values as a vector, by name or index
pull(df, y)
#Extracts columns as a table
select(df, x, y)
#Moves columns to a new position
relocate(df, x, .after = last_col())
#Renames columns
rename(df, “age” = z)
#Orders rows by values of a column from high to low
arrange(df, desc(x))
#Computes table of summaries
summarise(df, total = sum(x))
#Use group_by() to create a "grouped" copy of a table grouped by columns (similarly to a
pivot table in spreadsheets). dplyr functions will then manipulate each "group" separately
and combine the results
df %>%
group_by(z) %>%
summarise(total = sum(x))

Chapter1: Sampling with R:

#Randomly select a sample of size 100 out of a list l without replacement


sample(x=l,size=100,replace=FALSE)
#R program for systematic sampling
ech_sys = function(N, n)
{
K = N/n
e = as.integer(K) # integer part of K #random choice of the starting point
point_dep = sample(1:e,1 )# choice of sample elements
ech = seq(point_dep,N,by=e) #seq(starting_oint,ending_point,by=interval)
ech
}

Chapter2: Bivariate Analysis:


#R program for Pearson Correlation Analysis
Y=c(516,512,528,560,584,616,644,676,712,720 ) #Data for variable y
X=c(75,83,86,83,99,101,106,112,131,130 )#Data for variable x
#plot code where xlab is the label of the x-axis, ylab the y-axis, main the title while
pch is the type of the point
plot(X,Y,xlab="Revenus",ylab="Depenses",main="nuage de points",pch=16)
cor(X,Y) #Give correlation value
cor.test(X,Y) #Gives hypothesis test values and results where H0 is the hypothesis
of the absence of a relationship

2
#R Program for Prearson’s chi square test 𝑋
x=c(50, 70, 110, 135, 60, 75, 100, 50)
#creates a matrix with 2 rows and that is otherwise the matrix is filled by rows.
TC=matrix(x,nr=2,byrow=TRUE)
rownames(TC)=c("Homme","Femme") #names rows
colnames(TC)=c("1-2","2-3","3-4","4-5") #names columns
chisq.test(TC) #performs chi square test where H0 is the hypothesis that there is no
relationship

#Comparing means of two independent samples


x1=c(9,3,4,5,6,2,10,5,6,10,9,1,8,9,8)
x2=c(7,7,6,5,4,9,6,8,5,9,4,8,11,4,11)
t.test(x1,x2) ) #performs test where H0 is the hypothesis of equal means

#Fisher’s 1 way Anova test


G1=c(54,50,50,58,57,55,51,58,50,53)
G2=c(55,54,61,55,55,58,59,54,56,57)
G3=c(62,58,58,66,65,63,59,66,58,61)
G4=c(56,52,52,60,59,57,53,60,52,55)
X=c(G1,G2,G3,G4)
#Creates a list classe where 1 is repeated 10 times, 2 repeated 10 times etc.The
point is to associate the data to 4 classes (Notice that each 𝐺𝑖 has 10 elements)
classe=c(rep(1,10),rep(2,10),rep(3,10),rep(4,10))
l=aov(X ~ classe) #Performs anova test where H0 is that all means are equal
summary(l) #summarises the results of the test

Chapter3: Principal Components Analysis:


Example1:

library(FactoMineR) #Library we use when dealing with PCA


# Tuto: voiture
X=read.table("voiture.txt",header = T,row.names = 8)
#header=T in this case tells R that there is indeed a header in the data file
#row.names=8 tells it that the column 8 contains the names of the rows of
data
head(X)
X=X[,-1]
#This has deleted the first column of the data table called MODELE
head(X)
#head(X) shows you a part of the data
r=PCA(X)
#this is used to show the PCA Plot of the data
r$eig
#This displays a table describing the eigenvalues of the dataset
r$var$cos2
#this displays cos2 for the variables
r$ind$cos2
#this displays cos2 for the individuals
r$eig:
eigenvalue percentage of variance cumulative percentage of variance
comp 1 4.65602121 77.6003536 77.60035
comp 2 0.91522148 15.2536914 92.85404
comp 3 0.24043062 4.0071770 96.86122
comp 4 0.10270953 1.7118255 98.57305
comp 5 0.06465625 1.0776042 99.65065
comp 6 0.02096090 0.3493484 100.00000

● An eigenvalue > 1 indicates that PCs account for more variance than accounted
by one of the original variables in standardised data(Kaizer). This is commonly
used as a cutoff point for which PCs are retained. This holds true only when the
data are standardised.(Ne5dhou ken elli akbar men ). In this case, ta3tina 1 axe
akahaw, we7na au minimum lezemna zoz. Donc nchoufou el screeplot wala l
inertia method..

plot(r$eig[,1],type="b",xlab="Dimensions",ylab="Eigenvalues", main="Scree plot")


#Performs Screeplot

We can now base our choice on the scree plot and elbow method, we can choose the first and
second axes..

r$cos2:
Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
HC 0.8766111128 2.192808e-02 0.0603139677 3.555069e-02 2.995767e-03
R19 0.6623990534 1.930481e-02 0.2130804944 4.966035e-02 5.555186e-02
FiT 0.7639797857 8.658382e-02 0.1010144358 1.657507e-02 1.362437e-02
P405 0.1069827320 3.081190e-01 0.0495708133 5.330757e-01 1.104165e-03
R21 0.0335769220 4.183204e-01 0.0042947373 4.152715e-01 1.108675e-01
CBX 0.4826435993 8.168291e-02 0.0432456291 3.270202e-01 7.193900e-03
Quality of representation of the variables on the factor map: Cos2
● The cos2 values are used to estimate the quality of the representation. The
Closer the cos2 the better the variable is represented.
● The closer a variable is to the circle of correlations, the better its representation on
factor map (and the more important it is to interpret these components)

Example2:
# Tuto Toothpaste
X=read.csv("toothpaste.csv",header = T,sep = ",") #to read data from a csv file
Y=X[,-8] #Deletes column number 8
head(Y) #displays the first 6 rows of the data
PCA(Y )#shows results (graph) for the principal components Analysis

Example3:
data(decathlon)
res.pca = PCA(decathlon[,1:10], scale.unit=TRUE, ncp=5, graph=T)
#decathlon: the used data
#scale.unit: whether to reduce the data or not
#ncp: number of dimensions to keep
#graph: to decide whether to show graph or not
res.pca = PCA(decathlon[,1:12], scale.unit=TRUE, ncp=5, quanti.sup=c(11:
12), graph=T)
#quanti.sup:a vector indicating the indexes of the quantitative supplementary
variables
res.pca = PCA(decathlon, scale.unit=TRUE, ncp=5, quanti.sup=c(11: 12),
quali.sup=13, graph=T)
plot.PCA(res.pca, axes=c(1, 2), choix="ind", habillage=13)
DATA ANALYSIS/ Python

Libraries:
Import math #(built-in) common math functions and constants in Python
Import statistics #(built-in) for descriptive statistics
Import numpy as np #used for working with arrays, matrices, mathematical
operations…
Import scipy.stats #for probability distributions, summary and frequency
statistics, correlation functions and statistical tests,
Import pandas as pd #provides many functions and methods to expedite the
data analysis process
Import matplotlib.pyplot as plt #for data visualisation in python

Importing data/ Creating lists/:


A = [43,64,45,55,43,38,64,55,47,64,29,46,29,64,43,39,19,55,29,19] #Creates
list of data

Numpy Pandas
A= A=
[43,64,45,55,43,38,64,55,47,64,29,46,29, [43,64,45,55,43,38,64,55,47,64,29,46,
64,43,39,19,55,29,19] #Creates list of 29,64,43,39,19,55,29,19]
data df = pd.DataFrame(A)#creates
dataframe with list A

Basic statistical Measures:

Numpy Pandas

mean = statistics.mean(A) #calculates df = pd.DataFrame(A)


mean of data mean_df = df.mean()#calculates mean
print(mean) print(mean_df)
mean_h = statistics.harmonic_mean(A) print(df.describe())#general description of
#calculates harmonic mean statistical measures
print(mean_h
median_ = statistics.median(A)
#calculates median
print(median_)
stats.mode(A,axis=d0 )#gives mode and
number of repetition
print(stats.mode(A,axis=0))
mode_ = statistics.mode(A) #gives mode
print(mode_)
A_min = min(A)#gives minimum
print(A_min)
A_max = max(A)#gives maximum
print(A_max)

variance_ = statistics.variance(A) #gives


the variance:
print(variance_)
standards_deviation_ = statistics.stdev(A)
#gives standard deviation
print(standards_deviation_)

Example 1 in Pandas: Values, effective, Frequency, Cumulative Frequency

N=len(A)
L=[]
Z=[] #effective
X=[] #Frequency
print('Value','effective','Frequency')
for i in range(0,N):
if A[i] not in L:
L.append(A[i])
Z.append(A.count(A[i])) #effective
X.append(A.count(A[i])/N) #frequency
print(A[i],'\t',A.count(A[i]),'\t',A.count(A[i])/20)

L1=[] # Cumulative effective


def effectivs_cumul_croissant(effective_list): #creates function
partial_sum = 0
for x in effective_list :
partial_sum += x
L1.append(partial_sum)
return L1
print(effectivs_cumul_croissant(Z) )

L2=[] # Cumulive Frequency


def frequence_cumul_croissant(Frenquency_liste):
partial_sum = 0
for x in Frenquency_liste :
partial_sum += x
L2.append(partial_sum)
return L2
print(frequence_cumul_croissant(X)
#Drawing plot using Matplotlib:

#Regular plot
plt.xlabel('Age')
plt.ylabel('effective')
plt.title("Distribution of library visitors by age")
plt.bar(L,Z)
plt.show()

#pie chart
fig, ax = plt.subplots()
plt.title("MPREF _STUDENTS" )
ax.pie((C,F,M), labels=('C', 'F', 'M'), autopct='%1.1f%%') #labels depicts the
names of the pie slice while autopct depicts how the percentages will be
written inside the slice
plt.show()

#histogram
fig, ax = plt.subplots()
ax.hist(x, frq.sort(), cumulative=False)
plt.title("MPREF _STUDENTS" )
ax.set_xlabel('Subject')
ax.set_ylabel('Frequency')
plt.show()

Example 2 in Pandas:

df = pd.DataFrame({'effectifs': [152,82,125,981,686,92,25,22]})
df.index = ['Bank account or service', 'consumer loan', 'Credit card', 'Credit
reports', 'Debt recovery', 'Mortgage', 'Student loan', 'Others']
df = df.sort_values(by='effectifs',ascending=False) #sorts values
df["frequences_cum"] = df["effectifs"].cumsum()/df["effectifs"].sum()*100
print(df)

#Pareto chart
ig, ax = plt.subplots()
ax.bar(df.index, df["effectifs"], color="C0")
ax2 = ax.twinx()
ax2.plot(df.index, df["frequences_cum"], color="C1", marker="D", ms=7)

ax.tick_params(axis="y", colors="C0")
ax2.tick_params(axis="y", colors="C1")
plt.title("complaint categories")
plt.show()
Section 1: Numpy DataTypes

import numpy as np
X1 = np.array([1,2,3,4],dtype="str")#Creates array with elements written as
strings (dtype could be float, int …)
Print (X1)

X6=np.zeros((5,6),dtype="float")#array of 5 rows and 6 columns of 0 written


as float
print(X6)
X6=np.ones((5,6),dtype="float")#array of 5 rows and 6 columns of 1 written
as float
print(X6)

X7 = np.full((3,5),3.14)# array of 3 rows and 5 columns repeating 3.14


print(X7)

X8 = np.arange(0,20,2)#starts at 0, ends at 20, with step 2 (20 not included)


print(X8)

X = np.linspace(1,2,5)#starts with 1, ends at 2, with 5 equal intervals


print(X)

X = np.random.random((3,1))#3 rows and 1 column of random numbers


between 0 and 1
print(X)

X = np.random.normal(0,1,(3,3))#0 is the mean 1 is the st dev, 3*3 array of


random numbers
print(X)

Y1 = np.random.randint(0,10,(3,3))# random 3*3 array of random integers


between 0 and 10
print(Y1)

X = np.eye(2)#2by2 identity matrix


print(X)

X1 = np.empty((2,3),dtype="int"#creates array of desired shape, faster than


np zeros and ones
print(X1)
Section 2: Numpy Basics:
Attributes of arrays:

import numpy as np
X = np.random.seed(0) # seed for reproducibility
X1 = np.random.randint(10, size=6) # One-dimensional array of size 6, with
numbers between 0 and 20
X2 = np.random.randint(10, size=(3,4)) # Two-dimensional array
X3 = np.random.randint(10, size=(3,4,5)) # Three-dimensional array
print(X)
print('__')
print(X1)
print(X2)
print(X3)

print("x1 ndim: ",X1.ndim) #returns number of dimensions of array


print("x1 shape: ",X1.shape) #returns shape of array
print("x1 size: ",X1.size) #gives the size as the total number of elements
print("dtype: ",x1.dtype) #the data type of the array
print("itemsize:",x1.itemsize,"bytes")#lists the size (in bytes) of each array
element
print("nbytes:",x1.nbytes,"bytes")#lists the total size (in bytes) of all array
elements

print(X1[1]) #prints element of index 1 in array X1 (Python starts from 0


print(X1[-1])#prints last element in array X1

Array Slicing:
x = np.arange(10)#creates array of range 0 to 10 (reaches 9)
print(x[:5])# prints 5 elements (until index 5-1)
print(x[5:])# prints elements starting from index 5
print(x[4:7])# prints elements starting from index 4 to index 7-1
print(x[::2])# prints elements with step = 2
print(x[1::2])# prints elements with step = 2 starting from index 1
print(x[-7:-2:2])#prints starting from position -7 to position -2with step =2
print(x[::-1])# prints all elements, reversed
print(x[5::-2]) #start from index 5 with step= (-2)

x2 = np.random.randint(10, size=(3,4))
print(x2[-1,-3])# print the 3rd element from last line
print(x2[:2, :3])#print until column if index 3-1 and until row of index 2-1
print(x2[:3,::2])#print until rox 3-1, with step 2 for the columns
print(x2[::-1,::-1])#reverses whole array
print(x2[:, 0]) # first column of x2
print(x2[0,:]) # first row of x2
x2_sub_copy = x2[:2,:2].copy()#copies specific sub array into a new array
print(x2_sub_copy)

x2_sub_copy[0,0] = 42# assigns new value 42 to the specific element 0,0


print(x2_sub_copy)

x = np.array([1,2,3])
y = np.array([3,2,1])
print(np.concatenate((x, y)))#concatenates 2 arrays

x = np.array([1,2,3])
grid = np.array([[9,8,7],
[6,5,4]])
print(np.vstack([x,grid]))#vertically stacks the arrays on each other

y = np.array([[99],
[99]])
print(np.hstack([grid,y]))#horizontally stacks the arrays next to each other

x = [1,2,3,99,99,3,2,1]
x1, x2, x3 = np.split(x, [3,5])#split array into subarrays such as the first has the
elements until index 3-1, and the second the rest until 5-1 etc..
print(x1, x2, x3)

grid = np.arange(36,dtype=np.float).reshape((6,6))
print(grid)
upper, lower = np.vsplit(grid, [2])#Splits into 2 arrays, the first has rows until
index 2-1, the second the rest
print(upper)
print(lower)

upper,middle, lower = np.vsplit(grid, [2,3])


#el zoz row lwela bech yet7attou fi array, wel theni supposé yèott el thletha
lwela, ama puisque zoz déja felloul, bech ye5ou wa7ed bark elli 93ad, w the
rest for the rest
print("upper: ",upper).
print("middle: ",middle)
print("lower: ",lower)

You might also like