DADS301 MBA Sem 3programming in DS
DADS301 MBA Sem 3programming in DS
SESSION 2023
PROGRAM MASTER OF BUSINESS ADMINISTRATION (MBA)
SEMESTER III
COURSE CODE & NAME DADS301 – PROGRAMMING IN DATA SCIENCE
CREDITS 4
NUMBER OF ASSIGNMENTS & MARKS 02
30 Marks each
Note: Answer all questions. Kindly note that answers for 10 marks questions should be approximately of 400 - 450
words. Each question is followed by evaluation scheme.
A1) Input code mentioned below includes explanations entered in the comments using ‘#’.
Output
Q2) What do you mean by descriptive statistics. Write 5 important functions used for calculating
the descriptive stats.
A2) Descriptive Statistics aims at summarizing, describing and presenting the data values in R. It
helps in understanding the data by giving clear overview. It is good step to deep dive into any
analysis especially RCA’s. There exists many measures that summarizes the dataset. They are divided
into two types:
1. Location Measure – it gives good understanding of the central tendency of the data.
Varun Asthana
Roll No. 2114501153
Program – Online MBA
Course – Programming in Data Science Directorate of Online Education
2. Dispersion Measure – it gives a good understanding about the spread of the data.
1# Minimum & Maximum Function can be found by using the below function:
2. by() function - In the below example we have requested the summary on the basis of the
species.
Q3) Create three vectors of the same length, two of them with discrete random values and one
with continuous random values. Next, create a data frame with these vectors.
A3) Set Seed in R is used for creating reproducible results when we are writing code that includes
creating a random variables. When we use set seed then it is ensured that whenever we will run the
code then some random values will be produced every time.
1st Discrete variable is created to the length of 10 and from the range of numbers from 1 to 5.
2nd Discrete variable is created to the length of 10 and from the range of alphabets from a to e.
Then dataframe is defined with the three variables that are created.
Q4) Create a for loop that goes through a numeric sequence, computes e to the power of each
value and if the result is greater than 1000 it stores this result in another vector.
#now we will print the result for the values that meet criteria
print(result_vector)
Varun Asthana
Roll No. 2114501153
Program – Online MBA
Course – Programming in Data Science Directorate of Online Education
Output:
Varun Asthana
Roll No. 2114501153
Program – Online MBA
Course – Programming in Data Science Directorate of Online Education
Q5) Create a box and whisker plot for mpg variable present in mtcars data sets.
A5) For creating a Box and Whisker Plot using R first we will load the mtcars dataset
Now boxplot function will be used for plotting the Box and Whiskers Plot for the ‘mpg’ variable. This
shall be creating the box and whiskers plot with all the ‘mpg’ variables on the x-axis and its
corresponding values shall be on the y-axis.
The plot formed from the above code shall provide the plot as following:
It should be noted that the box in the middle of the plot represents the interquartile range also
known as the IQR that ranges from Q1 Quartile (25th Percentile) to the Q3 Quartile (75th Percentile)
of the mtcars data.
The line inside the box represents the median value at the 50th Percentile. Whiskers extending from
the box represents the range of values that fall within 1.5 times the IQR whereas any points falling
outside the range is considered to be as the outlies and plotted as individual points.
Q6) What do you mean by outlier, describe some methodology to treat an outlier.
A6) Outlier is any value which lay outside most of the other values in the set of data. These values
can be exceptions that stand outside of the individual samples of population as well. To be an
outlier, the outlier value needs to significantly vary and must be confirmed via calculations. One
graphical representation of the outlier can be shown as below:
Outliers can cause serious variation from the parameters to be measured like Median, Standard
Deviation etc and can cause error in the analysis. Here are some common methods to treat outliers
in data analysis:
Varun Asthana
Roll No. 2114501153
Program – Online MBA
Course – Programming in Data Science Directorate of Online Education
Identify and remove outliers: This is the simplest approach, where outliers are identified based on a
certain threshold, and then removed from the dataset. The threshold can be set based on domain
knowledge, or using statistical methods such as the Z-score or the interquartile range (IQR). The
identification can be done by applying filters in the dataset. In most of the cases, deletion of the
outliers is not done only removal from the final dataset (used for analysis) is done.
Winsorization: In this method, the extreme values are replaced with less extreme values, usually the
closest non-outlying values. This approach can preserve the data distribution and reduce the impact
of outliers. This has become a common way and most of the analysis are performed based on the
method.
Transformation: Data transformation can be used to reduce the effect of outliers. Common
transformations include logarithmic, square root, or inverse transformation. This approach is useful
when the data is highly skewed, and the transformation can make the distribution more
symmetrical.
Imputation: Outliers can also be treated by imputing their values based on statistical methods such
as mean, median, or mode imputation. This method can preserve the sample size and the
distribution of the data.
Robust statistical methods: Robust statistical methods are designed to be less sensitive to outliers.
These methods include the median, trimmed mean, or M-estimators. They can provide more reliable
estimates and reduce the impact of outliers on the results.
It is important to note that the choice of method should be based on the nature and extent of
outliers, the type of data, and the research question. Additionally, it is always a good practice to
report the methods used to treat outliers, and to perform sensitivity analyses to assess the
robustness of the results to the different methods.