MCS-226 Data Science and Big Data
MCS-226 Data Science and Big Data
This assignment has 10 questions of 8 Marks each, answer all questions. Rest 20 marks are
for viva voce. You may use illustrations and diagrams to enhance the explanations. Please go
through the guidelines regarding assignments given in the Programme Guide for the format of
presentation.
Q1: What is Exploratory Data Analysis (EDA) and why is it important in the data science workflow? What
are the key components of the data science process?
Q2: Discuss the implications of hypothesis testing results in decision-making. Provide examples of real-
world situations where statistical hypothesis testing is commonly used.
Q3: What is data preprocessing, and why is it a crucial step in the data science workflow? Why is it
important to identify and handle outliers in a dataset during data preprocessing?
Q4: Discuss the significance of the three Vs (Volume, Velocity, Variety) in the context of big data. Provide
examples of each of the three Vs in real-world scenarios. How does MapReduce facilitate parallel
processing of large datasets? Explain the functionality of the Map function in the MapReduce
paradigm with the help of an example.
Q5: Explain the purpose of Apache Hive in the Hadoop ecosystem. How does Spark address limitations of
the traditional MapReduce model?
Q6: Define NoSQL databases and explain the primary motivations behind their development. Provide
examples of scenarios where each type of NoSQL database is suitable.
Q7: How does collaborative filtering contribute to enhancing user experience and engagement in
recommendation systems? Provide examples of industries or platforms where collaborative filtering is
widely used.
Q8: What is a Data Stream Bloom Filter? Explain its primary purpose in data stream processing. Also,
introduce the Flajolet-Martin Algorithm and its role in estimating the cardinality of a data stream.
Q9: Describe the role of link analysis in the PageRank algorithm. How are links between web pages
interpreted in the context of PageRank?
Q10: Explain the concept of decision trees in classification. Provide an example of building and visualizing
a decision tree using R. How can K-means clustering be applied to a dataset in R?