S24 - Bigdata Lab Final 005

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

Department of Information Engineering Technology (IET)

National Skills University Islamabad

Final Exam Spring-2024


Department: IET Program: BS CS Session: Fall-20 Semester: 4th Date: 6/6/2024
Subject: Big Data Analytics L (SC-213L) Instructor Name: Muhammad Kamran Javed Time Allowed: 90 Min
Student’s
Asim Ali Registration: F22-bscs-005 Signature:
Name:
CLO-1 Understand the fundamental concepts of Big Data and its programming paradigm.
CLO-2 Apply Hadoop/MapReduce Programming, Framework, and Ecosystem.
CLO-3 Express the experimental data in the appropriate format in the form of a LAB report
Engineering Knowledge: An ability to apply knowledge of mathematics, natural science, engineering
PLO-1 technology fundamentals, and engineering technology specialization to defined and applied
engineering technology procedures, processes, systems, and methodologies.
PLO-4 Design/Development of Solution: An ability to design solutions for broadly defined engineering
technology problems and contribute to designing systems, components, or processes to meet
specified needs with appropriate consideration for public health and safety, cultural, societal, and
environmental considerations.
PLO-10 Individual and Teamwork

Question # 1 2 3 4 Instructor Signature


CLO’s CLO-2
CLO-1 CLO-1 CLO-2
addressed
Domain/level Cognitive/2 Cognitive/2 Phy /4 Phy /4
PLO’s
PLO-1 PLO-1 PLO-4 PLO-4 Total Marks
addressed
Marks 5 5 10 5 25

Marks Obtained

Note:
 Question Understanding is also a part of the Exam.
 Write to-the-point answer. No need to give extra details to make your answer long.
 Return the question paper to the invigilator

Q1. You are tasked with implementing a Scala program to process a list of student exam scores. Your program should
include the following functionalities:

1. Processing Functions: Implement the following functions to process the list of exam scores:
o calculateAverage(scores: List[Int]): Double: Calculates the average score from the
given list of exam scores.
o countPassFail(scores: List[Int]): (Int, Int): Counts the number of passing and failing
scores in the list.
o findTopScore(scores: List[Int]): Int: Finds the highest score in the list.
2. Pattern Matching: Utilize pattern matching to handle different cases efficiently.
3. Recursive Function: Implement recursive functions where appropriate to iterate through the list.
4. Main Program: Write a main program that demonstrates the usage of the implemented functions. Provide a
sample list of exam scores, and then calculate and display the average score, the number of passing and failing
scores, and the highest score.
Page 1|9
********** Best of Luck **********
Department of Information Engineering Technology (IET)
National Skills University Islamabad

Evaluation Criteria:

 Correct implementation of processing functions using pattern matching and recursion - 10 marks
 Efficient use of pattern matching to handle different cases - 5 marks
 Successful demonstration of the main program with appropriate output - 10 marks

Page 2|9
********** Best of Luck **********
Department of Information Engineering Technology (IET)
National Skills University Islamabad

Page 3|9
********** Best of Luck **********
Department of Information Engineering Technology (IET)
National Skills University Islamabad

===================================================
===================================================
Q2 You are given a dataset containing information about students' performance in exams. Your task is to use
PySpark to perform the following operations:

1. Data Loading: Load the dataset into a PySpark DataFrame.


2. Data Exploration: Conduct basic data exploration to understand the structure and content of the dataset.
Display the schema and show the first few rows of the DataFrame.
3. Data Preprocessing: Perform necessary data preprocessing steps, such as handling missing values and data
cleaning.
4. Data Analysis: Utilize PySpark to answer the following questions:
o What is the average score of each subject?
o How many students passed and failed in each subject? (Passing score: >= 50)
o What is the overall pass rate of students across all subjects?
5. Data Visualization: Visualize the results of the analysis using appropriate PySpark visualization tools (e.g.,
matplotlib integration with PySpark).
6. Conclusion: Provide a brief conclusion summarizing the insights gained from the analysis.

Evaluation Criteria:

 Correct loading and exploration of the dataset - 5 marks


 Proper preprocessing of the data including handling missing values and data cleaning - 5 marks
 Accurate analysis of average scores, pass/fail counts, and overall pass rate - 8 marks
 Effective visualization of analysis results - 5 marks
Page 4|9
********** Best of Luck **********
Department of Information Engineering Technology (IET)
National Skills University Islamabad

 Clear and concise conclusion summarizing insights gained - 2 marks

!pip install pyspark

from pyspark.sql import SparkSession

import matplotlib.pyplot as plt

spark = SparkSession.builder.appName("Student Performance Analysis").getOrCreate()

from google.colab import files

uploaded = files.upload()

from pyspark.sql import SparkSession

from pyspark.sql.functions import mean, col, avg, sum

import matplotlib.pyplot as plt

spark = SparkSession.builder.appName("Student Performance Analysis").getOrCreate()

df = spark.read.csv("Book1.csv", header=True, inferSchema=True)

print("Schema:")

df.printSchema()

print("First few rows:")

df.show(5)

mean_scores = df.select([mean(c).alias(c) for c in df.columns[1:]]).collect()[0]

mean_scores_dict = {c: v for c, v in zip(df.columns[1:], mean_scores)}

df = df.fillna(mean_scores_dict)

avg_scores = df.select([col(c).alias(c) for c in df.columns[1:]]).agg(*[avg(c) for c in df.columns[1:]])

avg_scores.show()

Page 5|9
********** Best of Luck **********
Department of Information Engineering Technology (IET)
National Skills University Islamabad

passed_failed_counts = df.select([sum((col(c) >= 50).cast("int")).alias(c) for c in df.columns[1:]])

failed_counts = df.select([sum((col(c) < 50).cast("int")).alias(c) for c in df.columns[1:]])

passed_failed_counts.show()

failed_counts.show()

overall_pass_rate = df.select([sum((col(c) >= 50).cast("int")) for c in df.columns[1:]]).collect()[0][0] / (df.count() *


len(df.columns[1:]))

print(f"Overall pass rate: {overall_pass_rate:.2f}%")

avg_scores_df = avg_scores.toPandas()

plt.bar(avg_scores_df.columns, avg_scores_df.iloc[0])

plt.xlabel("Subjects")

plt.ylabel("Average Score")

plt.title("Average Scores by Subject")

plt.show()

passed_failed_counts_df = passed_failed_counts.toPandas()

failed_counts_df = failed_counts.toPandas()

plt.bar(passed_failed_counts_df.columns, passed_failed_counts_df.iloc[0], label="Passed")

plt.bar(failed_counts_df.columns, failed_counts_df.iloc[0], label="Failed")

plt.xlabel("Subjects")

plt.ylabel("Count")

plt.title("Passed and Failed Counts by Subject")

plt.legend()

plt.show()

print("The analysis provides insights into the average scores, pass rates, and overall performance of students across different
subjects.")

Schema:
root
|-- Student_ID: integer (nullable = true)
|-- BIGDATA: integer (nullable = true)
|-- OOPS: integer (nullable = true)
|-- DSA: integer (nullable = true)
|-- COAL: integer (nullable = true)
Page 6|9
********** Best of Luck **********
Department of Information Engineering Technology (IET)
National Skills University Islamabad

|-- DBMS: integer (nullable = true)


|-- MARKETING: integer (nullable = true)

First few rows:


+----------+-------+----+---+----+----+---------+
|Student_ID|BIGDATA|OOPS|DSA|COAL|DBMS|MARKETING|
+----------+-------+----+---+----+----+---------+
| 100| 78| 34| 56| 76| 9| 12|
| 101| 98| 23| 98| 23| 98| 98|
| 102| 45| 12| 45| 45| 45| 45|
| 103| 67| 67| 67| 67| 67| 67|
| 104| 87| 87| 87| 87| 87| 87|
+----------+-------+----+---+----+----+---------+
only showing top 5 rows

+------------+---------+--------+---------+---------+--------------+
|avg(BIGDATA)|avg(OOPS)|avg(DSA)|avg(COAL)|avg(DBMS)|avg(MARKETING)|
+------------+---------+--------+---------+---------+--------------+
| 50.36| 42.04| 49.2| 45.44| 44.16| 43.56|
+------------+---------+--------+---------+---------+--------------+

+-------+----+---+----+----+---------+
|BIGDATA|OOPS|DSA|COAL|DBMS|MARKETING|
+-------+----+---+----+----+---------+
| 16| 7| 9| 7| 7| 7|
+-------+----+---+----+----+---------+

+-------+----+---+----+----+---------+
|BIGDATA|OOPS|DSA|COAL|DBMS|MARKETING|
+-------+----+---+----+----+---------+
| 9| 18| 16| 18| 18| 18|
+-------+----+---+----+----+---------+

Overall pass rate: 0.11%

Page 7|9
********** Best of Luck **********
Department of Information Engineering Technology (IET)
National Skills University Islamabad

Page 8|9
********** Best of Luck **********
Department of Information Engineering Technology (IET)
National Skills University Islamabad

The analysis provides insights into the average scores, pass rates, and overall
performance of students across different subjects.

=========================================================
=========================================================
=========================================================

Page 9|9
********** Best of Luck **********

You might also like