FDS Apr - May 2024
FDS Apr - May 2024
(Common to: Computer Science and Engineering / Artificial Intelligence and Machine
Learning / Computer Science and Engineering (Cyber Security)/Computer and
Communication Engineering/Electronics and Instrumentation Engineering/
Instrumentation and Control Engineering / Information Technology)
(Regulation 2021)
Time : Three hours Answer ALL questions Maximum : 100 marks
PART A — (10 x 2 - 20 marks)
1. How missing values present in a dataset are treated during data analysis phase?
2. Identify and write down various data analytic challenges faced in the conventional
system.
3. Will treating categorical variables as continuous variables result in a better predictive
model? Justify your answer.
4. Issue: Feeding date which has variables correlated to one another is not a good statistical
practice, since we are providing multiple weightage to the same type of data.
Solution: Correlation Analysis. Show how such issues are prevented by correlation
analyais technique. Justify with a small instance dataset.
5. State the purpose of adding additional quantitative and/or categorical explanatory
variables to any developed linear regression model. Justify with an example.
6. Give an example of a data set with a non-GausBian distribution.
7. Under what circumstances, the pivot_table( ) in pandas is used?
8. Using appropriate data visualization modules develop a python code snippet that
generates a simple sinusoidal wave in empty gridded axes?
9. Write a python code snippet that generates a time series graph representing COVID-19
incidence cases for a particular week.
Day 1 Day 2 Day 3 Day 4 Day 5 Day 6 Day 7
7 18 9 44 2 5 89
69898
10. Write a python code snippet that draws a histogram for the following list of positive
numbers.
7 l8 9 44 2 5 89 91 11 6 77 85 91 6 55
PART B — (5 x 13 – 65 marks)
11. (a). i. Suppose there is a dataset having variables with missing values of more than (6)
30%, how will you deal with such dataset?
ii. List down the various feature selection methods for selecting the right variables (7)
for building efficient predictive models. Explain about any two selection
methods.
OR
(b). i. Explain Data Analytics life cycle. Brief about Time-Series Analysis. (6)
ii. Outline the purpose of data cleaning. How missing and nullified data attributes (7)
are handled and modified during pre-processing stage?
12. (a). i. Indicate whether each of the following distributions is positively or negatively
skewed. The distribution of
(1) Incomes of tax payers have a mean of $48,00d and a median of $43,600. (3)
(2) GPAs for all students at some college have a mean of 5.01 and a median (3)
of 3.20.
ii. During their first swim through a water maze, 15 laboratory rats made the
following number of errors (blind alleyway entrances):
2, 17, 5, 3, 28, 7, 5, 8, 5, 6, 2, 12, 10, 4, 3.
(1) Find the mode, median, and mean for these data. (3)
(2) Without constructing a frequency distribution or graph, would it be (4)
possible to characterize the shape of this distribution an balanced,
positively skewed, or negatively skewed?
OR
69898
(b). i. Assume that SAT math scores approximate a normal curve with a mean of 500
and a standard deviation of 100. Sketch a normal curve and shade in the target
areas described by each of the following statements:
• More than 570 (2)
• Less than 515 (2) (2)
• Between 520 and fi40 (2) (2)
• Convert to z scores and find the target areas specific to the above values. (1)
ii. Assume that the burning times of electric light bulbs approximate a normal
curve with a mean of 1200 hours and a standard deviation of 120 hours. If a
large number of new lights are installed at the same time (possibly along a
newly opened freeway), at what time will
• 1 percent fail? (2)
• 50 percent fail? (2)
• 95 percent fail? (2)
13. (a). i. In Statistics, highlight the impact when the goodness of fit test score is low? (6)
ii. Given the following dataset of employee, using regression analysis, find the (7)
expected salary of an employee if the age is 45?
Age
Salary
5467000
4243000
4955000
5771000
3525000
OR
(b). i. Define auto-correlation and how is it calculated? What does the negative (6)
correlation convey?
ii. What is the philosophy of logistic regression? What kind of model it is? What (7)
does logistic Regression predict? Tabulate tire cardinal difference of Linear and
Logistic Regression?
14. (a). i. Define Dictionary in Python. Do the following operations on dictionaries. (3)
Initialize two dictionaries (D1 and D2) with key and value pairs.
ii. Compared those two dictionaries with master key list 'M' and print (3)
iii. Find keys that are in D1 but NOT in D2. (3)
iv. Merge D1 and D2 and create D3 using expressions? (4)
OR
69898
(b). i. How to create hierarchical data from the existing data frame? (6)
ii. How to use group by with 2 columns in data set? Give a python code snippet. (7)
15. (a) Write a code snippet that projects our globe as a 2-D flat surface (using (13)
cylindrical project) and convey information about the location of any three
major Indian cities in the map (using scatter plot).
OR
(b). i. Write a working code that performs a simple Gaussian process regression (6)
(GPR), using the Scikit-Learn API.
ii. Briefly explain about visualization with Seaborn. Give an example working (7)
code segment that represents a 2D kernel density plot for any data.
PART C — (1 x 15 – 15 marks)
16. (a). Given an unsorted multi indexes that represents the distance between two cities, (15)
write a python code snippet using appropriate libraries to find the appropriate
distance between any two given cities. The following matrix representation can
be used to create the data frame that can be served as an input for the prescribed
program.
A B C D E
A 0 30 24 6 13
B 16 0 19 5 10
C 7 16 0 15 12
D 9 17 22 0 18
E 21 8 9 11 0
OR
(b). An URL Server wants to consolidate a history of websites visited by an user 'U'. (15)
Every website visit information is stored in a 2-tuple format viz., (website_id,
Duration_of_visit) in the URL cache. Using split, apply and combine operations,
device a code snippet that consolidates the website history and find out the
website whose duration of visit is maximum.
Example :
Input: [(4,2), (5,1), (4,3), (1,4), (7,3), (5,2), (1,1), (7,1))
Output: [(4,5), (5,3), (1,5), (7,4)].
The website with key_id '1' has the max. duration of visit = 5.
69898