Data Data Data
Data Data Data
Answer:
Data Visualization is the process of converting raw data into a visual format
such as charts, graphs, or maps, to make the information easier to understand
and analyze.
Explanation and Key Points:
Purpose: To communicate data clearly and efficiently to users using
visual elements.
Usefulness: Helps in identifying patterns, trends, outliers, and
correlations in data.
Applications: Widely used in business intelligence, scientific research,
journalism, and more.
Tools: Common tools include Tableau, Power BI, matplotlib, and plotly.
18. Explain different techniques to deal with unstructured data. (15 Marks)
Answer:
Unstructured data lacks a predefined format, making it difficult to analyze using
traditional methods. However, with the growth of big data and AI, several
techniques have been developed to manage and analyze unstructured data
efficiently.
🔹 1. Natural Language Processing (NLP)
Analyzes and understands human language.
Used for sentiment analysis, keyword extraction, topic modeling.
Example: Analyzing customer reviews or tweets.
🔹 2. Text Mining
Extracts meaningful patterns from textual data.
Includes operations like tokenization, stemming, and clustering.
Tools: NLTK, RapidMiner.
🔹 3. Image Processing
Involves transforming and analyzing image data.
Techniques: Pattern recognition, feature extraction, object detection.
Tools: OpenCV, TensorFlow.
🔹 4. Speech and Audio Analysis
Converts speech to text using ASR (Automatic Speech Recognition).
Analyzes sound features for insights.
Tools: Google Speech API, DeepSpeech.
🔹 5. Video Analytics
Analyzing video frames for detecting motion, faces, actions, etc.
Applications: Surveillance, traffic management.
🔹 6. Data Wrangling and Transformation
Tools like Apache NiFi, Talend, and Informatica help preprocess
unstructured data.
Converts raw data into usable format.
🔹 7. Metadata Extraction
Metadata (data about data) helps in categorizing unstructured files.
Example: File type, size, creation date from a video file.
🔹 8. Use of NoSQL Databases
Databases like MongoDB and Cassandra are designed to handle
unstructured/semi-structured data.
🔹 9. Machine Learning Techniques
Clustering, classification, and anomaly detection algorithms are applied.
Example: Identifying spam emails using ML models.
🔹 10. Data Lakes
Centralized storage to hold all types of data in raw form.
Supports schema-on-read, making it ideal for unstructured formats.
20. What do you think are the challenges with unstructured data?
Answer:
Unstructured data poses several challenges due to its complexity and lack of
format.
Key Challenges:
1. Difficult to Store and Organize – No predefined schema makes storage
non-standardized.
2. Complex to Analyze – Requires advanced tools like NLP, ML, and deep
learning.
3. High Storage Requirements – Includes large media files (images, videos).
4. Scalability Issues – Traditional databases aren’t suitable.
5. Security and Privacy – Harder to monitor and protect due to scattered
nature.
25. Data visualization tools provide an accessible way to see and understand
which of the following in data?
Answer: d. All of the above
Explanation:
Visualization tools help users easily detect:
Outliers – Unusual data points
Trends – Directional movements
Patterns – Repetitions and regularities
26. Which method shows hierarchical data in a nested format?
Answer: a. Treemaps
Explanation:
Treemaps display hierarchical (tree-structured) data as nested
rectangles.
Each branch is represented by a rectangle, and sub-branches as smaller
rectangles.
28. What does deleting grid lines in a table and the horizontal lines in a chart
do?
Answer: b. Increase data link ratio
Explanation:
Data-Ink Ratio = Proportion of ink used to display actual data.
Removing non-essential elements like grid lines improves focus on data.
31. What is the data visualization tool that updates in real time and gives
multiple outputs?
Answer: a. Dashboard
Explanation:
A dashboard is a dynamic data visualization tool that aggregates real-
time data from multiple sources and displays it interactively.
It can include charts, tables, maps, KPIs, etc.
Dashboards are widely used in business analytics, operations, and
monitoring systems.
Example tools: Tableau, Power BI, Google Data Studio.
37. Which of the following does not come under structured data?
Answer: d. Email
Explanation:
Emails contain free-form text, attachments, and variable structure →
hence unstructured.
Structured data includes organized databases like Oracle, Excel, etc.
38. Which of the following does not come under unstructured data?
Answer: a. XML
Explanation:
XML is semi-structured – it has tags and a hierarchical format.
WhatsApp chats and Facebook posts are unstructured.
47. In Python, which of the following is used to remove the rows containing
empty values in cells in a DataFrame (df)?
Answer: b. df.dropna()
Explanation:
df.dropna() removes any row with NaN (missing) values.
Essential for data cleaning and preprocessing before analysis or model
training.
48. Often, some of the entries in a DataFrame contain “NaN”. NaN stands for
____.
Answer: a. Not a Number
Explanation:
NaN is used to represent missing or undefined values in pandas.
These are typically encountered in real-world datasets where not all
information is available.
50. Which of the following functions is used to read the data from a “.txt” file
into DataFrame?
Answer: d. read_table()
Explanation:
read_table() is used to load data from a .txt file into a pandas
DataFrame.
Assumes tab-separated values by default, though the delimiter can be
customized.
55. In Python, iloc[] method is used to display the ____ from a Series.
Answer: c. Values
Explanation:
iloc[] is index-based and used to access values from specific positions in
Series or DataFrames.
Example: df.iloc[0] gives the first row.
56. In Python, which of the following methods is to remove a column from a
DataFrame?
Answer: d. Drop()
Explanation:
df.drop('column_name', axis=1) removes the specified column.
Commonly used in feature selection and data cleaning.
58. After grouping data on ‘Region’ from a DataFrame, the first() method will
show _______.
Answer: b. The first entry from each group
Explanation:
groupby('Region').first() returns the first row from each group based on
the grouping key.
59. If ‘df’ is a DataFrame having data columns ‘Segment’ and ‘Region’ among
other columns, the code line “RegSeg = df.groupby([“Region”,”Segment”])” is
an example of _____.
Answer: a. Nested group
Explanation:
Grouping by multiple columns is called multi-level or nested grouping.
It allows more granular aggregation and analysis.
60. Which of the following is not true for the pandas dataframe.corr()
method?
Answer: d. None of the above
Explanation:
All the listed statements are true:
o Computes pairwise correlation.
o Ignores non-numeric columns.
o Skips NaN values by default.
61. Pandas DataFrame all() method returns ____ if all values in each
row/column are True except one.
Answer: c. False
Explanation:
The all() method checks whether all values in a row or column are True.
If even one value is False, it returns False.
69. Group the data by “qualified” and display the first record/row of each
group
import pandas as pd
data = {
"Age": [23, 26, 21, 22, 20, 19, 27],
"Qualified": [True, False, False, False, False, True, True]
}
df = pd.DataFrame(data)
grouped = df.groupby('Qualified').first()
print(grouped)
Output:
Age
Qualified
False 26
True 23
73. Which of the following is not a visualization tool in the Python library?
Answer: a. Numpy
Explanation:
Numpy is for numerical computation, not visualization.
matplotlib, seaborn, plotly are for visualization.
74. In the statement given below, what does the value 10 indicate?
fig = plt.figure(figsize=(10,8))
Answer: c. Width
Explanation:
figsize=(width, height) → 10 is the width in inches.
79. Which of the following is not part of the “Plotly” main module?
Answer: b. Seaborn
Explanation:
Seaborn is a separate visualization library, not a module in Plotly.
81. Which of the following gives the statistical summary of the data?
Answer: d. Boxplot
Explanation:
Boxplots show min, Q1, median, Q3, max, and outliers, giving a
summary.
89. Which function can be used to read the ‘.txt’ file using pandas?
Answer: c. read_table
94. What is the correct way to set the boolean variable myVar to false?
Answer: Correct syntax:
myVar = False
None of the options were fully correct.
104. Write a python program to draw a boxplot using the data set
1,1,2,2,3,3,4,4,6,7,8,10, 11,14, 15,20,23
import matplotlib.pyplot as plt
data = [1,1,2,2,3,3,4,4,6,7,8,10,11,14,15,20,23]
plt.boxplot(data)
plt.title("Boxplot")
plt.show()
105. The data given below represents the number of visualisation books sold
at different shops in Durgapur. Draw a boxplot and mention the related
statistics. 42, 41, 43, 43,43, 45, 47, 48, 50, 50.
import matplotlib.pyplot as plt
data = [42, 41, 43, 43, 43, 45, 47, 48, 50, 50]
plt.boxplot(data)
plt.title("Books Sold - Boxplot")
plt.show()
Stats:
Min = 41
Q1 ≈ 43
Median = 44
Q3 ≈ 48
Max = 50
106. Suppose, the “ïris.csv” data set is available with your computer. Group
the data by “species” and plot the “Pie”chart showing the count of different
species.
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv("iris.csv")
species_counts = df['species'].value_counts()
plt.pie(species_counts, labels=species_counts.index, autopct='%1.1f%%')
plt.title("Species Distribution")
plt.show()
107. Suppose, the “ïris.csv” data set is available with your computer. Plot a
scatter plot of the data showing “species” using “sepal_length” in x-axis and
“sepal_width” in y-axis.
import pandas as pd
import seaborn as sns
df = pd.read_csv("iris.csv")
sns.scatterplot(data=df, x='sepal_length', y='sepal_width', hue='species')
108. Write a python program to draw a Treemap using the data of your
choice.
import matplotlib.pyplot as plt
import squarify
data = [500, 250, 100, 75, 50]
labels = ["A", "B", "C", "D", "E"]
squarify.plot(sizes=data, label=labels)
plt.axis('off')
plt.title("Treemap Example")
plt.show()
109. Write a python program to draw a Histogram using the data of your
choice and 10 bins with interval 5.
import matplotlib.pyplot as plt
data = [12, 15, 21, 25, 27, 30, 35, 39, 40, 45, 50, 52, 55, 59]
plt.hist(data, bins=10)
plt.title("Histogram")
plt.show()
110. Suppose you are a system analyst in an airline company. You are asked
to provide a Line chart showing the year-wise variations of passengers in the
airline since 2015. Write a python code for that. (Assume suitable data).
import matplotlib.pyplot as plt
years = [2015, 2016, 2017, 2018, 2019, 2020]
passengers = [120, 150, 170, 160, 200, 180]
plt.plot(years, passengers, marker='o')
plt.title("Airline Passengers by Year")
plt.xlabel("Year")
plt.ylabel("Passengers (in thousands)")
plt.grid()
plt.show()
111 Consider the following code and draw the output graph. import
plotly.graph_objs as go import math xpoints = np.arange(0,math.pi*2, 0.05)
ypoints = np.sin(xpoints) ypoints1 = np.cos(xpoints) trace0 =
go.Scatter(x=xpoints,y=ypoints, name='Sine') trace1 =
go.Scatter(x=xpoints,y=ypoints1, name='Cos') data = [trace0, trace1] layout =
go.Layout(title="Sine Wave",xaxis = {'title':'Angle'}, yaxis={'title':'Sine'})
fig=go.Figure(data, layout) fig.show()
import numpy as np
import math
import plotly.graph_objs as go
113. Consider the following code and draw the output graph. import
plotly.express as px
Seg_class=Sup_store.groupby(['Segment','Category'],as_index=False).agg({'Sa
les':'sum','Profit':'sum'} ) fig = px.scatter(Seg_class,
x='Sales',y='Profit',color='Segment') fig.show()
import plotly.express as px
fig = px.scatter(Seg_class, x='Sales', y='Profit', color='Segment')
fig.show()
114. Consider the following code and draw the output graph. import
plotly.graph_objs as go branches = ['CS','CSE','DS','AIML','ECE']
fy=[63,127,58,29,124] sy=[68,148,63,33,123] ty=[65,144,61,30,121]
trace1=go.Bar(x=branches,y=fy,name='First Year')
trace2=go.Bar(x=branches,y=sy,name='Second Year')
trace3=go.Bar(x=branches,y=ty,name='Third Year')
data=[trace1,trace2,trace3] layout=go.Layout(barmode='group')
fig=go.Figure(data,layout) fig.show()
trace1 = go.Bar(x=branches, y=fy, name='First Year')
trace2 = go.Bar(x=branches, y=sy, name='Second Year')
trace3 = go.Bar(x=branches, y=ty, name='Third Year')
layout = go.Layout(barmode='group')
fig = go.Figure(data=[trace1, trace2, trace3], layout=layout)
fig.show()
# Histogram
Sup_store = pd.read_excel("Superstore.xlsx")
furniture = Sup_store['Category'] == 'Furniture'
region_sales = Sup_store[furniture]['Sales']
plt.hist(region_sales, bins=range(10, 2001, 100))
plt.title("Region-wise Sales")
plt.legend(['Bar'])
plt.show()
# Line chart
df = pd.read_csv("flights.csv")
plt.plot(df['year'], df['passengers'])
plt.title("Passengers by Year")
plt.legend(['Line'])
plt.show()
117. Read OlympicAthletes.xlsx and plot bar chart of total medals (≥200):
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_excel("OlympicAthletes.xlsx")
grouped = df.groupby('Country')['Total Medals'].sum().reset_index()
filtered = grouped[grouped['Total Medals'] >= 200].sort_values(by='Total
Medals', ascending=False)