0% found this document useful (0 votes)
3 views

DADS301 MBA Sem 3programming in DS

MBA answer Key.

Uploaded by

Varun Asthana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

DADS301 MBA Sem 3programming in DS

MBA answer Key.

Uploaded by

Varun Asthana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Varun Asthana

Roll No. 2114501153


Program – Online MBA
Course – Programming in Data Science Directorate of Online Education
ASSIGNMENT

SESSION 2023
PROGRAM MASTER OF BUSINESS ADMINISTRATION (MBA)
SEMESTER III
COURSE CODE & NAME DADS301 – PROGRAMMING IN DATA SCIENCE
CREDITS 4
NUMBER OF ASSIGNMENTS & MARKS 02
30 Marks each

Note: Answer all questions. Kindly note that answers for 10 marks questions should be approximately of 400 - 450
words. Each question is followed by evaluation scheme.

Q.No Assignment Set – 1 Marks Total Marks


Questions
1. Create a vector that contains a sequence of integers between 0 and 9, 10 10
plus a sequence of 50 numbers between 10 and 45.

2. What do you mean by descriptive statistics. Write 5 important functions 10 10


used for calculating the descriptive stats.
3. Create three vectors of the same length, two of them with discrete 10 10
random values and one with continuous random values. Next, create a
data frame with these vectors.

Q.No Assignment Set – 2 Marks Total Marks


Questions
1. Create a for loop that goes through a numeric sequence, computes e to 10 10
the power of each value and if the result is greater than 1000 it stores
this result in another vector.
2. Create a box and whisker plot for mpg variable present in mtcars data 10 10
sets.
3. What do you mean by outlier, describe some methodology to treat an 10 10
outlier.
Varun Asthana
Roll No. 2114501153
Program – Online MBA
Course – Programming in Data Science Directorate of Online Education
Q1) Create a vector that contains a sequence of integers between 0 and 9, plus a sequence of 50
numbers between 10 and 45.

A1) Input code mentioned below includes explanations entered in the comments using ‘#’.

Output

Q2) What do you mean by descriptive statistics. Write 5 important functions used for calculating
the descriptive stats.

A2) Descriptive Statistics aims at summarizing, describing and presenting the data values in R. It
helps in understanding the data by giving clear overview. It is good step to deep dive into any
analysis especially RCA’s. There exists many measures that summarizes the dataset. They are divided
into two types:
1. Location Measure – it gives good understanding of the central tendency of the data.
Varun Asthana
Roll No. 2114501153
Program – Online MBA
Course – Programming in Data Science Directorate of Online Education
2. Dispersion Measure – it gives a good understanding about the spread of the data.

Important functions for calculating descriptive stats is as following:

We will be working on iris data set. Therefore, loading the same.

Common calculations done are as following :

1# Minimum & Maximum Function can be found by using the below function:

2# Median can be calculated using the below function

3# Mean can be calculated using the below function

4# Quartiles can be calculated using the below function

5# Standard Deviation can be calculated as following:

Common functions to obtain descriptive stats are as following:

1. summary() function shall give various measures tabulated in the summary:


Varun Asthana
Roll No. 2114501153
Program – Online MBA
Course – Programming in Data Science Directorate of Online Education

2. by() function - In the below example we have requested the summary on the basis of the
species.

3. Hmisc package we can use hmisc() function


Varun Asthana
Roll No. 2114501153
Program – Online MBA
Course – Programming in Data Science Directorate of Online Education

4. Using Psych () function the descriptive stats can be obtained

5. doBy () function can helpful in obtaining descriptive stats

Q3) Create three vectors of the same length, two of them with discrete random values and one
with continuous random values. Next, create a data frame with these vectors.
A3) Set Seed in R is used for creating reproducible results when we are writing code that includes
creating a random variables. When we use set seed then it is ensured that whenever we will run the
code then some random values will be produced every time.

Then we will create random variables:

1st Discrete variable is created to the length of 10 and from the range of numbers from 1 to 5.

2nd Discrete variable is created to the length of 10 and from the range of alphabets from a to e.

Continuous variable is created using rnorm.

Then dataframe is defined with the three variables that are created.

Lastly, the dataframe is printed


Varun Asthana
Roll No. 2114501153
Program – Online MBA
Course – Programming in Data Science Directorate of Online Education

Q4) Create a for loop that goes through a numeric sequence, computes e to the power of each
value and if the result is greater than 1000 it stores this result in another vector.

A4) Please find the code as following:

# import math module


# math module provides exp () function for computing e to the power of a value import math

# Now we will define a sequence


sequence = [1,2,3,4,5,6,7,8,9,10]

# we will now define an empty list to store the result


# If the criteria meets then results will be printed in result vector
result_vector = []

#now we will put the for condition


for value in sequence:
e_power = math.exp(value)
if e_power > 1000:
result_vector.append(e_power)

#now we will print the result for the values that meet criteria
print(result_vector)
Varun Asthana
Roll No. 2114501153
Program – Online MBA
Course – Programming in Data Science Directorate of Online Education

Output:
Varun Asthana
Roll No. 2114501153
Program – Online MBA
Course – Programming in Data Science Directorate of Online Education
Q5) Create a box and whisker plot for mpg variable present in mtcars data sets.

A5) For creating a Box and Whisker Plot using R first we will load the mtcars dataset

Now boxplot function will be used for plotting the Box and Whiskers Plot for the ‘mpg’ variable. This
shall be creating the box and whiskers plot with all the ‘mpg’ variables on the x-axis and its
corresponding values shall be on the y-axis.

The plot formed from the above code shall provide the plot as following:

It should be noted that the box in the middle of the plot represents the interquartile range also
known as the IQR that ranges from Q1 Quartile (25th Percentile) to the Q3 Quartile (75th Percentile)
of the mtcars data.

The line inside the box represents the median value at the 50th Percentile. Whiskers extending from
the box represents the range of values that fall within 1.5 times the IQR whereas any points falling
outside the range is considered to be as the outlies and plotted as individual points.

There’s another way to plot box and whiskers using ggplot.


Varun Asthana
Roll No. 2114501153
Program – Online MBA
Course – Programming in Data Science Directorate of Online Education

Q6) What do you mean by outlier, describe some methodology to treat an outlier.

A6) Outlier is any value which lay outside most of the other values in the set of data. These values
can be exceptions that stand outside of the individual samples of population as well. To be an
outlier, the outlier value needs to significantly vary and must be confirmed via calculations. One
graphical representation of the outlier can be shown as below:

Outliers can cause serious variation from the parameters to be measured like Median, Standard
Deviation etc and can cause error in the analysis. Here are some common methods to treat outliers
in data analysis:
Varun Asthana
Roll No. 2114501153
Program – Online MBA
Course – Programming in Data Science Directorate of Online Education
Identify and remove outliers: This is the simplest approach, where outliers are identified based on a
certain threshold, and then removed from the dataset. The threshold can be set based on domain
knowledge, or using statistical methods such as the Z-score or the interquartile range (IQR). The
identification can be done by applying filters in the dataset. In most of the cases, deletion of the
outliers is not done only removal from the final dataset (used for analysis) is done.

Winsorization: In this method, the extreme values are replaced with less extreme values, usually the
closest non-outlying values. This approach can preserve the data distribution and reduce the impact
of outliers. This has become a common way and most of the analysis are performed based on the
method.

Transformation: Data transformation can be used to reduce the effect of outliers. Common
transformations include logarithmic, square root, or inverse transformation. This approach is useful
when the data is highly skewed, and the transformation can make the distribution more
symmetrical.

Imputation: Outliers can also be treated by imputing their values based on statistical methods such
as mean, median, or mode imputation. This method can preserve the sample size and the
distribution of the data.

Robust statistical methods: Robust statistical methods are designed to be less sensitive to outliers.
These methods include the median, trimmed mean, or M-estimators. They can provide more reliable
estimates and reduce the impact of outliers on the results.

It is important to note that the choice of method should be based on the nature and extent of
outliers, the type of data, and the research question. Additionally, it is always a good practice to
report the methods used to treat outliers, and to perform sensitivity analyses to assess the
robustness of the results to the different methods.

You might also like