Deneesha Tharunika Sooriyaarachchi CL-HDCSE-CMU-102-40 CSE5014 1668472 412159309

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 15

Task 1

The Department of Commerce and also, Industry in Sri Lanka can benefit greatly from data
from business and scientific intelligence, particularly when making judgements about
planning for human resources for economic development. Several of these benefits include,
Choosing High-Priority Sectors and Careers: Data science could indeed assist Ministry of
Industries and also, Commerce officials throughout identifying the elevated industries as well
as professions which must be given preference in the planning of national development. For
instance, this same expert group can determine the most prestigious professions as well as the
factors that contribute by studying the set of data of occupations in Canada; this information
is able to be utilized to decide which industries as well as occupations to take priority.
Predictive Analysis: Prescriptive analytics is a potent tool that can assist decision-makers in
foreseeing future trends and taking preventative action. Officials can forecast future staffing
requirements and also, formulate plans accordingly through looking at past information and
recent trends.
Using evidence to guide decisions: Data science can assist decision-makers in making
decisions that are supported by evidence. Machine learning can provide both qualitative and
quantitative data which can be utilized to make a decision instead of relying solely on
intuition and otherwise subjective judgment.
Savings on costs: Using data science, officials can find areas in which costs can be cut. For
instance, officials can find jobs that are less highly regarded but pay well by looking at the
dataset of jobs in Canada. Officials can lower costs without sacrificing the caliber of their
human resource department by concentrating on these professions.
Competitive advantage: By utilizing data science, representatives can outperform other
nations in the area. Sri Lanka can draw in foreign investment as well as compete more
successfully in global economy through making educated choices about planning for human
resources.
Efficiency Gains: Data science can aid in process simplification and efficiency gains for
government agencies. Officials can automate processes like data cleaning as well as data
analysis, for example, using algorithms that utilize machine learning, which can help to
shorten the time and also, effort needed to make choices.
In summary, the Ministry of Commerce and Industry through Sri Lanka could indeed benefit
greatly from data from business and scientific intellectual ability. Sri Lanka can accomplish
its goals for national growth and enhance its ability to compete on the international stage
through making educated choices regarding the management of its human resources.
1
Task 2
To write and run, as well as debug R programming code, one utilizes the free Integrated
Development Environment (IDE) known as R Studio. It has become one of the most well-
liked data analysis instruments by scientists because it offers a user-friendly functionality
with numerous capabilities as well as techniques for analyzing and displaying data such as
data visualization.
Here are a few typical steps in the data analysis process using R Studio.

Data import: You can bring data to R Studio through such a number of different sources,
including CSV, Excel, and databases. For importing data in various formats, R Studio offers
functions like read.csv() and read excel(), as well as read.csv2().

Data cleaning and manipulation: After importing the data, you might need to clean and
modify it. For this purpose, R Studio offers a variety of functions and packages, including
proper environment. Among other things, those same packages are capable of helping
combine sets of data, remove incomplete data, and filter as well as select particular columns
and rows of data.

Visualizing the data: R Studio offers strong methods of data visualization, like ggplot2, that
make it simple to make unique graphics as well as plots. To start exploring and also,
summarize the data, users can make a number of visualizations, including scatterplots and
histograms, as well as boxplots.

Data analysis: To help users analyze the data, R Studio offers a number of machine learning
as well as statistical packages, including stats, caret, and more. Among several other analyses,
users can conduct hypotheses, regression analysis and clustering, as well as classification.

Presenting and reporting the findings: The tools provided by R Studio enable user to develop
reports and visual representations of the findings of the analysis. Generate new reports as
well as interactive features that highlight your analysis using tools like markdown and shiny.

2
R Studio is a robust data analysis program that offers a variety of features and functions
which can be utilized to clean and manipulate and also, visualize, as well as analyze data in
addition to start creating reports and visual representations of the findings.
Techniques
The process of trying to present information in a visually appealing format, such as maps and
also, charts, as well as graphs, is commonly referred to as data visualization. It can be utilized
to quickly and simply extract insights from data. It is essential to use visual representations of
data because they make it simple to spot developments and patterns.
The procedures of trying to present data in the graphical as well as visual layout that
integration of information understanding, analysis, and also, communication is known as data
visualization. The widely used R statistical programming language's Integrated Development
Environment (IDE), R studio, offers a variety of resources and libraries for producing
powerful data visualizations.

Methodologies
A common statistical technique known as hypothesis testing includes testing a population-
based hypothesis using sample data. It allows us to determine if we have sufficient data to
approve or disapprove of a hypothesis. The connection between two variables can be
modeled using the simple linear regression method, in which one of the variables is the
regression coefficient while the remainder serves as the variable and the independent.
Identifying the interactions between the independent and dependent variables is useful.
Summary and description of a dataset's attributes, such as central tendency measurements,
variability, as well as distributions, are tasks associated with statistical analysis. Measures,
such as mean and median and also, mode, as well as standard deviation, were indeed
numerical measures something which highlight the key features of a set of data.
 Hypothesis

A statistical technique called hypothesis testing is used to evaluate the veracity of an assertion
or assumption made about a population. A null hypothesis as well as an alternative
hypothesis are established, data are gathered, and statistical methods are used to ascertain that
whether data support or refute the null. The kind of information getting evaluated and indeed
the research problem is been asked influence the selection of hypothesis test.

 Simple Linear Regression

3
A straight line is adapted to the information in simple linear regression to construct the
interaction among two continuous variables and one outcome variable (x) and also, one
predictor variables (y). Finding the line that most accurately reflects the information while
providing a realistic prediction of the connection that exists among the two variables is the
aim of simple linear regression.

 Descriptive Data Analysis

Descriptive statistical analysis is a technique used to enumerate and describe the features of a
dataset. Providing a summary of the data and identifying patterns and trends, as well as
connections among variables are the two main objectives of descriptive data analysis.
Indicators of central tendency measures of variability, as well as data visualization are all
included in the descriptive analysis of data.

 Summary Statistics

Summary statistics have been data fusion that are used to encapsulate and explain a dataset's
characteristic features. Central tendency measurements which are mean and median, as well
as mode, measures of variations, and also, measures of position in relation are all included in
descriptive statistic. A dataset's key characteristics can be succinctly described using
summary statistics, which may be utilized as well to compare sets of data and otherwise spot
outliers and perhaps unusual results.

4
Task 3

# mean value
mean = mean(myData$income)
print(mean)

# median value
median = median(myData$income)
print(median)

# mode value
mode = function(){
return(sort(-table(myData$income))[1])
}

mode()
min(myData$income)
max(myData$income)

The dataset's greatest value and lowest values were two relatively straightforward
measurements.
Minimum = 1611
Maximum = 26879

The "means" would be the standard "estimate," which is determined through adding every
value plus dividing the result by the maximum population of entries.
The appropriately by considering in the number collection is known as the median.
Sometimes might have to redo the lists ahead of time to discover the median since the values
must be written in correct sequence form least to greatest in the hopes of finding the median.
The number which is most frequently seen is the "mode". There isn't any mode for the listing
unless no value inside the listing gets replicated.

Median = 6930.5
Mode = 4485
Mean = 7797.902

5
Task 4

Table 1: summary statistics of prestige, education, income

Minimum 1st Quantile Median Mean 3rd Quantile Maximum


prestige 24.80 45.23 53.60 56.83 69.28 97.20
education 6.380 8.445 10.540 10.738 12.648 15.970
income 1611 5106 6930 7798 9187 26879

6
Task 5

hist(myData$prestige,main = "Distribution of Prestige", col = "green",probability = TRUE)

lines(density(myData$prestige))

hist(myData$education,main = "Distribution of Education", col = "green",probability = TRUE)

lines(density(myData$education))

hist(myData$income,main = "Distribution of income", col = "green",probability = TRUE)

lines(density(myData$income))

7
Task 6
Hypothesis to be tested: H 0 : p=0 vs H 1 : p ≠ 0

p-value = 2.2*e-16, which is less than 0.05. Hence, H0 is rejected at 5% level of significance
Therefore, we can conclude that prestige and type of occupation has a linear relationship
between them. Looking at the Adjusted R-squared value (which is 0.69), we can further
conclude that there is a positive correlation between the two variables. Following scatter plot
further confirms what we conclude is correct. The plot shows a linear relationship.

8
Task 7

Hypothesis to be tested: H 0 : p=0 vs H 1 : p ≠ 0

p-value = 2.2*e-16, which is less than 0.05. Hence, H0 is rejected at 5% level of significance
Therefore, we can conclude that prestige and education has a linear relationship between them.
Looking at the estimated correlation value (which is 0.85), we can further conclude that there is
a strong positive correlation between
the two variables. Following scatter plot further confirms what we conclude is correct. The plot
shows a linear relationship.

Figure 1: Scatter Plot of Education vs Prestige

9
Task 8
Hypothesis to be tested: H 0 : p=0 vs H 1 : p ≠ 0
p-value = 2.2*e-16, which is less than 0.05. Hence, H0 is rejected at 5% level of significance
Therefore, we can conclude that prestige and income has a linear relationship between them.
Looking at the estimated correlation value (which is 0.71), we can further conclude that there is
a positive correlation between the two variables. Following scatter plot further confirms what
we conclude is correct. The plot shows a linear relationship.

Figure 2: Scatter Plot of Income vs Prestige

10
Task 9
Hypothesis to be tested: H 0 : p=0 vs H 1 : p ≠ 0
p-value = 2.079*e-10, which is less than 0.05. Hence, H0 is rejected at 5% level of significance
Therefore, we can conclude that prestige and income has a linear relationship between them.
Looking at the estimated correlation value (which is 0.57), we can further conclude that there is
a positive correlation between the two variables. Following scatter plot further confirms what
we conclude is correct. The plot shows a linear relationship.

Figure 2: Scatter Plot of Income vs Education

11
Task 10

Multiple linear regression model: Y = β0 + β 1 X 1 + β 2 X 2 + β 3 X 3 +e

 The dependent variable Y is prestige, and also, indicate as the response, is the one
that we are trying for the prediction.

 The intercept of the regression line is the β 0 , corresponding to the expected result
if X is not present.

 β 1 X 1 indicates that the regression coefficient ( β 1) according to the first


independent variable ( X 1 = education). All of the residual regression coefficients
as well as variables are subject to the identical analysis.

 β 2 X 2 indicates the regression coefficient ( β 2) regarding the second independent


variable ( X 2 = income). All of the residual regression coefficients as well as
variables are subject to the identical analysis.

 β 3 X 3 shows the regression coefficient ( β 3) and the first independent variable ( X 3 =


type of occupation). All of the residual regression coefficients as well as variables
are subject to the identical analysis.

 e is the model error (residuals), which specifies the amount of variation that the
concept includes when estimating Y.

12
Hypothesis to be tested: H 0 : β 0=0 vs H 1 : β 0 ≠ 0
p-value = 2.2*e-16, which is less than 0.05. Hence, H0 is rejected at 5% level of significance
Therefore, we can conclude that education,income, type has a positive,linear relationship on
prestige. Looking at the adjusted R-Squared value (which is 0.82), we can further conclude
that there is a positive correlation between the two variables and model is a good fit.

Multiple linear regression model

In order to simulate the connection that exists between a dependent variable as well as an
additional independent variable, multiple linear regression is a method of statistical analysis.
It is a development of simple linear regression in which the variation among the dependent
variables is explained by a single independent variable. Finding the best-fit line and otherwise
hyperplane which depicts the linear connection among the dependent variable as well as the
independent variables is the objective of multiple linear regression. Predicting the coefficient
values for every independent component of the model, in addition to the intercept term, yields
the line or hyperplane. Following is an expression of the equation for such multiple linear
regression approach:

Y = b0 + b1X1 + b2X2 + ... + bnxn + ε

The dependent variable has always been Y, the independent variables have become X1,
X2,..., Xn, the inverse term is b0, the correlations for every predictor variables are bn, as well
as the residue left standard error is. The model's coefficients, which hold all those other
independent variables relatively steady, symbolize the variation of the dependent variable for
just a unit increase in the corresponding predictor variables. Different statistical measures,

13
including the R-squared value, that also calculates the percentage of something like the
dependent variable's variability that can be accounted for by the model's independent factors,
are used to assess the model's convergent validity. In order to model the connection between
several variables as well as predict the outcomes of the dependent variable predicated on the
results of the independent variables, a multiple linear regression model is frequently used in a
variety of fields, including financial services, economy, as well as sociology. Additionally, it
is applied to machine learning projects like regression analysis and also, prediction.

Regarding to the results that taken from the analysis, it is indicating that relationship between
the prestige and type of, prestige and education as well as education and income. In these
analyses it represents the R-squared value of 0.69 for prestige and type, R-squared value of
0.71 for prestige and education as well as R-squared value of 0.57 for prestige and income
relationships. In fact, if the value of R-squared value is bigger than 0.5 it has a strong linear
relationship and also, if it is less than 0.5 it doesn’t have strong linear relationship, but it has
relationship. According to the calculations and analysis this represents that the variables of
education, income and type of effect the prestige. The variables of education and income as
well as type of has impact to the prestige positively and also, the variables of type of and
education variables have affected most positively to the prestige. This is a summary of the
findings from a study that examined the connections between four different variables:
prestige and type, education, as well as income. The analysis discovered a strong linear
relationship among prestige as well as type (correlation coefficient, and otherwise R-squared
value, of 0.69), prestige and also, education (correlation coefficient, and otherwise R-squared
value, of 0.71), and prestige and income (correlation coefficient, and otherwise R-squared
value, of 0.71). (with a of 0.57). R-squared value According to the analysis, the factors type
as well as education, and income all have an impact on prestige, of education and also,
income having a favorable effect. This states that the factors with the most beneficial
influences on prestige are type and education.

14
15

You might also like