0% found this document useful (0 votes)
5 views42 pages

Data Data Data

The document provides an overview of data visualization, its principles, and classifications of data types including structured, semi-structured, and unstructured data. It discusses the merits and demerits of data visualization, key Gestalt principles, and various techniques for handling unstructured data. Additionally, it covers exploratory data analysis (EDA) and the role of data visualization tools in simplifying data comprehension.

Uploaded by

chandasonai650
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views42 pages

Data Data Data

The document provides an overview of data visualization, its principles, and classifications of data types including structured, semi-structured, and unstructured data. It discusses the merits and demerits of data visualization, key Gestalt principles, and various techniques for handling unstructured data. Additionally, it covers exploratory data analysis (EDA) and the role of data visualization tools in simplifying data comprehension.

Uploaded by

chandasonai650
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 42

1. What do you understand about Data Visualization?

Answer:
Data Visualization is the process of converting raw data into a visual format
such as charts, graphs, or maps, to make the information easier to understand
and analyze.
Explanation and Key Points:
 Purpose: To communicate data clearly and efficiently to users using
visual elements.
 Usefulness: Helps in identifying patterns, trends, outliers, and
correlations in data.
 Applications: Widely used in business intelligence, scientific research,
journalism, and more.
 Tools: Common tools include Tableau, Power BI, matplotlib, and plotly.

2. What is the difference between Data Visualization and Infographics?


Answer:
While both serve to communicate information visually, they differ in purpose,
content, and style.
Feature Data Visualization Infographics
To educate or inform a general
Purpose To analyze and interpret data
audience
Real-time or historical
Based on Curated facts and figures
datasets
Flexibility Often interactive and real-time Usually static and image-based
In-depth, often with complex
Detail Level Simplified, visually rich
data

3. Merits and Demerits of Data Visualization


Merits:
1. Simplifies complex data – Makes huge volumes of data easily
understandable.
2. Faster decision-making – Visual cues allow quicker insights.
3. Highlights patterns and outliers – Helps in discovering correlations.
4. Enhances storytelling – Useful in reports and presentations.
Demerits:
1. Risk of misinterpretation – Poorly designed visuals can mislead.
2. Data distortion – Incorrect scales or chart types can skew the message.
3. Over-dependence on tools – May hide errors if the tool is used without
understanding.
4. Limited context – Visuals often lack detailed background or rationale.

4. Principles of Gestalt’s Visual Perception


Answer:
Gestalt principles describe how people perceive and organize visual elements
into unified wholes.
Key Principles:
1. Figure-Ground: Differentiating the object (figure) from the background
(ground).
2. Proximity: Objects close to each other are perceived as a group.
3. Similarity: Similar objects are perceived to be part of the same group.
4. Closure: Minds fill in missing information to complete a visual shape.
5. Continuity: Eye is drawn along lines and curves, perceiving a flow.

5. Why is Gestalt's principle important?


Answer:
Gestalt’s principles are crucial in design and data visualization because they
align with how humans naturally interpret visual information.
Importance:
 Improves Clarity: Ensures visuals are interpreted correctly.
 Enhances Grouping: Helps viewers to associate data elements
meaningfully.
 Increases Engagement: Aids in intuitive design, keeping user attention.
 Prevents Miscommunication: Reduces the chance of misunderstanding
visual cues.

6. Explain the principle of figure in visual design.


Answer:
The figure-ground principle is a core concept of Gestalt psychology that
explains how people visually distinguish an object (figure) from its surrounding
background (ground).
Key Points:
 Visual Separation: Helps users focus on the key information (figure)
while ignoring the irrelevant background.
 Contrast is Crucial: High contrast between the figure and ground
improves readability.
 Data Visualization Application: Charts often use strong colors for data
points (figures) on white or neutral backgrounds (ground).
 Avoiding Confusion: Proper figure-ground design prevents clutter and
enhances visual clarity.

7. Explain the principle of proximity in visual design.


Answer:
The principle of proximity suggests that objects placed close together are
perceived as a group.
Key Points:
 Spatial Grouping: Items near each other are seen as related, even if they
differ in shape or color.
 Chart Example: In a bar chart, clustered bars can signify a grouped
category.
 Hierarchy Formation: Proximity helps users understand grouping and
hierarchy of data.
 Enhances Readability: Proper spacing makes information more intuitive.

8. Explain the principle of similarity in visual design.


Answer:
The similarity principle explains that elements that look alike are perceived to
be part of the same group.
Key Points:
 Attributes: Similarity can be based on color, shape, size, or orientation.
 Helps Classification: Viewers intuitively classify similar data elements.
 Data Visualization: Line colors in a graph representing different
categories use this principle.
 Aids Pattern Recognition: It simplifies complex visuals by highlighting
similar trends.

9. Explain the principle of closure in visual design.


Answer:
Closure refers to the human tendency to perceive a complete image even
when parts are missing.
Key Points:
 Visual Completion: The mind fills gaps in shapes or patterns to create a
whole image.
 Used in Minimalist Design: Simple visuals still convey complete
information.
 Example: A broken circle in a diagram is still seen as a circle.
 User Efficiency: Closure improves information processing speed.

10. Explain the principle of continuity in visual design.


Answer:
Continuity refers to the eye’s tendency to follow lines or curves in a predictable
path.
Key Points:
 Smooth Flow: The eye prefers to see continuous movement or direction.
 Data Charts: Line graphs use continuity to show trends over time.
 Reduces Confusion: Prevents distraction by guiding user focus naturally.
 Improves Visual Storytelling: Allows users to see logical progression in
data.

11. Classify digital data.


Answer:
Digital data can be classified into three major categories based on format and
structure.
Types:
1. Structured Data:
o Organized into rows and columns.
o Stored in databases, spreadsheets.
o Examples: SQL databases, Excel files.
2. Semi-Structured Data:
o Has some organizational properties but not rigid structure.
o Examples: JSON, XML, email headers.
3. Unstructured Data:
o No predefined format or organization.
o Examples: Images, videos, audio, social media posts.

12. Illustrate structured data.


Answer:
Structured data is well-organized and easily searchable using SQL or similar
query languages.
Example Illustration:
Employee_ID Name Department Salary
101 Arjun HR ₹30,000
102 Priya IT ₹45,000
Characteristics:
 Stored in relational databases.
 Uses schemas and defined data types.
 Easy to retrieve, sort, and filter.

13. Define unstructured data with examples.


Answer:
Unstructured data refers to data that lacks a predefined format or
organization, making it difficult to analyze with traditional tools.
Examples:
 Text documents (Word files, PDFs)
 Social media content (Tweets, Facebook posts)
 Multimedia (images, audio, video)
 Webpages (HTML content)
Key Characteristics:
 Difficult to store in relational databases.
 Often large in volume (big data).
 Requires AI, ML, or NLP for analysis.

14. What do you mean by semi-structured data?


Answer:
Semi-structured data lies between structured and unstructured data. It doesn’t
follow the strict data model of structured data but still includes tags or markers
to separate elements.
Examples:
 JSON files
 XML documents
 Email metadata
Characteristics:
 Not stored in tables but has identifiable structure.
 Easier to analyze than unstructured data.
 Frequently used in web applications and APIs.

15. Write the merits and demerits of structured data.


Merits:
1. Easy Storage and Retrieval: Can be stored in relational databases with
indexing.
2. High Accuracy: Structured data is often validated and cleansed.
3. Ease of Analysis: Traditional analytics tools like SQL work well.
4. Supports Automation: Enables process automation in reporting.
Demerits:
1. Limited Flexibility: Rigid schema may not support complex or varied
data.
2. Difficult to Scale: As the size grows, managing schema becomes
complex.
3. Not Suited for Rich Media: Cannot handle video, images, or free-form
text.
4. Manual Data Entry Risk: Error-prone if not automated.

16. Classify the attributes of structured data. (15 Marks)


Answer:
Structured data is highly organized and typically resides in relational databases
or spreadsheets. The attributes (or features) of structured data can be
classified into several types based on their role, behavior, and utility in data
analysis.
🔹 1. Descriptive Attributes
 Describe the basic properties or features of an entity.
 Example: Employee Name, Product Category, City.
🔹 2. Identifier Attributes (Primary Keys)
 Uniquely identify each record.
 These attributes must be unique and non-null.
 Example: Roll Number, Employee ID.
🔹 3. Categorical Attributes
 Represent discrete categories or groups.
 Can be nominal or ordinal.
 Examples:
o Nominal: Gender, Blood Type.
o Ordinal: Customer Ratings (Low, Medium, High).
🔹 4. Numerical Attributes
 Take numerical values and are often used for mathematical operations.
 Subtypes:
o Discrete: Countable (e.g., Number of Orders)
o Continuous: Measurable (e.g., Temperature, Salary)
🔹 5. Temporal Attributes
 Related to time or date.
 Useful in time-series analysis.
 Examples: Date of Birth, Transaction Time.
🔹 6. Derived Attributes
 Not stored directly, but derived from other attributes.
 Example: Age (derived from Date of Birth), Profit (Revenue - Cost)
🔹 7. Binary Attributes
 Have only two states: True/False or Yes/No.
 Example: Qualified (Yes/No), Active Account (True/False)
🔹 8. Textual Attributes
 Contain free-form text but are stored in structured columns.
 Example: Remarks, Comments.
🔹 9. Foreign Keys
 Serve as a link between tables using primary keys from another table.
 Help maintain referential integrity in databases.
✅ Summary
Attribute Type Example Nature
Descriptive Name, City Textual
Identifier Roll No, Employee ID Unique
Categorical (Nominal) Gender Discrete
Categorical (Ordinal) Rank (High, Medium, Low) Ordered
Numerical Salary, Age Continuous
Attribute Type Example Nature
Temporal Date of Joining Time-based
Derived Age (from DOB) Calculated
Binary Yes/No, True/False Boolean
These attributes help in data modeling, data preprocessing, and designing
efficient queries.

17. Write short notes on the characteristics of the semi-structured data.


Answer:
Semi-structured data has elements of both structured and unstructured
formats, making it flexible and widely used in web and application data.
Characteristics:
 Flexible Schema: Structure is not fixed but consistent enough for
parsing.
 Tagged Elements: Uses tags (like <name>, { "age": 25 }) to organize data.
 Self-Describing: Each data object carries information about its structure.
 Supports Nested Data: Can hold complex and hierarchical data.
 Examples: JSON, XML, Email messages with metadata.

18. Explain different techniques to deal with unstructured data. (15 Marks)
Answer:
Unstructured data lacks a predefined format, making it difficult to analyze using
traditional methods. However, with the growth of big data and AI, several
techniques have been developed to manage and analyze unstructured data
efficiently.
🔹 1. Natural Language Processing (NLP)
 Analyzes and understands human language.
 Used for sentiment analysis, keyword extraction, topic modeling.
 Example: Analyzing customer reviews or tweets.
🔹 2. Text Mining
 Extracts meaningful patterns from textual data.
 Includes operations like tokenization, stemming, and clustering.
 Tools: NLTK, RapidMiner.
🔹 3. Image Processing
 Involves transforming and analyzing image data.
 Techniques: Pattern recognition, feature extraction, object detection.
 Tools: OpenCV, TensorFlow.
🔹 4. Speech and Audio Analysis
 Converts speech to text using ASR (Automatic Speech Recognition).
 Analyzes sound features for insights.
 Tools: Google Speech API, DeepSpeech.
🔹 5. Video Analytics
 Analyzing video frames for detecting motion, faces, actions, etc.
 Applications: Surveillance, traffic management.
🔹 6. Data Wrangling and Transformation
 Tools like Apache NiFi, Talend, and Informatica help preprocess
unstructured data.
 Converts raw data into usable format.
🔹 7. Metadata Extraction
 Metadata (data about data) helps in categorizing unstructured files.
 Example: File type, size, creation date from a video file.
🔹 8. Use of NoSQL Databases
 Databases like MongoDB and Cassandra are designed to handle
unstructured/semi-structured data.
🔹 9. Machine Learning Techniques
 Clustering, classification, and anomaly detection algorithms are applied.
 Example: Identifying spam emails using ML models.
🔹 10. Data Lakes
 Centralized storage to hold all types of data in raw form.
 Supports schema-on-read, making it ideal for unstructured formats.

19. In which category (structured, semi-structured or unstructured) will you


place a webpage?
Answer:
A webpage is best classified as semi-structured data.
Explanation and Points:
 Webpages contain structured parts like metadata (title, author, date) and
semi-structured or unstructured content like text, images, and
embedded videos.
 Often coded in HTML/XML, which provides a partial structure.
 Elements are tagged and can be parsed, but content varies widely.
 Tools like web scrapers and parsers can extract meaningful data due to
the underlying markup structure.

20. What do you think are the challenges with unstructured data?
Answer:
Unstructured data poses several challenges due to its complexity and lack of
format.
Key Challenges:
1. Difficult to Store and Organize – No predefined schema makes storage
non-standardized.
2. Complex to Analyze – Requires advanced tools like NLP, ML, and deep
learning.
3. High Storage Requirements – Includes large media files (images, videos).
4. Scalability Issues – Traditional databases aren’t suitable.
5. Security and Privacy – Harder to monitor and protect due to scattered
nature.

21. Is CSV structured data? Justify your answer.


Answer:
Yes, CSV (Comma Separated Values) is considered structured data.
Justification:
 It follows a clear tabular format with rows and columns.
 Headers represent attribute names and rows represent records.
 Easily readable by data processing tools like Excel, pandas, SQL engines.
 Supports direct querying and sorting, just like data in relational
databases.

22. State a few examples of human-generated and machine-generated data.


Answer:
Human-Generated Data:
 Social media posts
 Emails and text messages
 Documents and presentations
Machine-Generated Data:
 Sensor data from IoT devices
 Server logs
 Surveillance camera footage
 Transaction records from automated systems
23. What is exploratory data analysis (EDA)? Explain.
Answer:
Exploratory Data Analysis (EDA) is the process of examining and summarizing
data sets to discover their main characteristics, patterns, and anomalies before
formal modeling.
Key Components:
 Descriptive Statistics: Mean, median, mode, standard deviation.
 Visualization: Charts like histograms, box plots, scatter plots to find
patterns.
 Missing Values and Outliers: Identification and handling.
 Correlation Analysis: Finding relationships between variables.
EDA is critical to understand the data, clean it, and form hypotheses.

24. Which of the following is true about data visualization?


Answer: d. All of the above
Explanation:
 It simplifies large volumes of data.
 Enhances comprehension of complex datasets.
 Provides a graphical representation of data.

25. Data visualization tools provide an accessible way to see and understand
which of the following in data?
Answer: d. All of the above
Explanation:
Visualization tools help users easily detect:
 Outliers – Unusual data points
 Trends – Directional movements
 Patterns – Repetitions and regularities
26. Which method shows hierarchical data in a nested format?
Answer: a. Treemaps
Explanation:
 Treemaps display hierarchical (tree-structured) data as nested
rectangles.
 Each branch is represented by a rectangle, and sub-branches as smaller
rectangles.

27. Which of the following is the importance of data visualization?


Answer: c. Both a & b
Explanation:
 Helps decision makers by revealing key insights.
 Directs user attention to important data points.

28. What does deleting grid lines in a table and the horizontal lines in a chart
do?
Answer: b. Increase data link ratio
Explanation:
 Data-Ink Ratio = Proportion of ink used to display actual data.
 Removing non-essential elements like grid lines improves focus on data.

29. Which charts help in making comparisons between different data?


Answer: d. Both a & b (Bar charts and Column charts)
Explanation:
 Bar charts: Horizontal bars for comparison.
 Column charts: Vertical bars for comparison.
 Both are ideal for comparing quantities across categories.
30. ______ provides an approximation of the relationship between variables.
Answer: c. Trendline
Explanation:
 A trendline is used in scatter plots and line charts to show the general
direction of data.
 Useful in forecasting and understanding correlations.

31. What is the data visualization tool that updates in real time and gives
multiple outputs?
Answer: a. Dashboard
Explanation:
 A dashboard is a dynamic data visualization tool that aggregates real-
time data from multiple sources and displays it interactively.
 It can include charts, tables, maps, KPIs, etc.
 Dashboards are widely used in business analytics, operations, and
monitoring systems.
 Example tools: Tableau, Power BI, Google Data Studio.

32. What are the common types of data visualisation?


Answer: d. All of the above (Charts, Tables, Infographics)
Explanation:
 Charts: Bar, line, pie, scatter – ideal for trends, comparisons.
 Tables: Display structured data with rows/columns.
 Infographics: Visually rich summaries for storytelling and information
delivery.
33. What are the specific examples of methods to visualise data?
Answer: d. All of the above
Explanation:
 Area Chart: Shows cumulative total over time.
 Bubble Chart: Plots three dimensions of data using x, y, and bubble size.
 Histogram: Displays the distribution of a dataset using bins.

34. Gender of a person can be classified as ______ data.


Answer: d. Qualitative Normal (Nominal)
Explanation:
 Gender is a categorical and nominal variable.
 It has no intrinsic order (Male, Female, Other).
 Cannot be measured numerically, only categorized.

35. The best match for Data Mining is ______.


Answer: a. Methods of Statistics, AI, ML, DBs
Explanation:
 Data Mining involves extracting patterns from large datasets using:
o Statistical techniques
o Machine Learning algorithms
o Artificial Intelligence
o Database systems

36. UIMA is an open-source platform from _____.


Answer: c. IBM
Explanation:
 UIMA (Unstructured Information Management Architecture) was
developed by IBM.
 It helps in analyzing large volumes of unstructured content (e.g., emails,
documents).

37. Which of the following does not come under structured data?
Answer: d. Email
Explanation:
 Emails contain free-form text, attachments, and variable structure →
hence unstructured.
 Structured data includes organized databases like Oracle, Excel, etc.

38. Which of the following does not come under unstructured data?
Answer: a. XML
Explanation:
 XML is semi-structured – it has tags and a hierarchical format.
 WhatsApp chats and Facebook posts are unstructured.

39. Which of the following comes under semi-structured data?


Answer: d. XML
Explanation:
 XML uses tags to define data structures but doesn’t follow fixed schemas
like databases.
 It’s a classic example of semi-structured data.

40. The best match for JSON is ______.


Answer: b. REST
Explanation:
 JSON (JavaScript Object Notation) is commonly used in RESTful APIs for
data exchange.
 Lightweight and easy to parse compared to XML.

41. The best match for MongoDB is ______.


Answer: d. All of the above
Explanation:
 MongoDB stores data in JSON-like format (BSON), supports flexible
schema, and is based on document-based structure.

42. CSV stands for ______.


Answer: c. Comma Separated Values
Explanation:
 A plain text format where data is separated by commas.
 Widely used for storing structured tabular data.

43. In Python, ‘pandas’ is a ______.


Answer: c. Python library
Explanation:
 Pandas is a powerful open-source library used for data manipulation and
analysis.
 Provides structures like Series and DataFrame.

44. Which of the following will print all records on ‘df’?


Answer: d. print(df.to_string())
Explanation:
 df.to_string() displays the entire DataFrame, not just a preview (like
head()).
45. In Python, which of the following is used to set the column as the index in
DataFrame (df)?
Answer: c. index_col
Explanation:
 The index_col parameter is used in functions like read_csv() to set a
specific column as the index of the DataFrame.

46. In Python, which of the following is used to return a subset of the


columns from the DataFrame (df)?
Answer: b. Usecols
Explanation:
 The usecols parameter is used when reading a file (e.g., CSV) using
functions like read_csv() to load only specific columns.
 This is useful for optimizing memory and focusing only on relevant data.

47. In Python, which of the following is used to remove the rows containing
empty values in cells in a DataFrame (df)?
Answer: b. df.dropna()
Explanation:
 df.dropna() removes any row with NaN (missing) values.
 Essential for data cleaning and preprocessing before analysis or model
training.

48. Often, some of the entries in a DataFrame contain “NaN”. NaN stands for
____.
Answer: a. Not a Number
Explanation:
 NaN is used to represent missing or undefined values in pandas.
 These are typically encountered in real-world datasets where not all
information is available.

49. The “info()” function is used to print _____ of a DataFrame.


Answer: c. Concise Summary
Explanation:
 df.info() shows:
o Number of entries,
o Data types of each column,
o Non-null values,
o Memory usage.
It’s helpful for quickly understanding the dataset structure.

50. Which of the following functions is used to read the data from a “.txt” file
into DataFrame?
Answer: d. read_table()
Explanation:
 read_table() is used to load data from a .txt file into a pandas
DataFrame.
 Assumes tab-separated values by default, though the delimiter can be
customized.

51. A DataFrame is a ________ data structure.


Answer: b. Two dimensional
Explanation:
 A DataFrame in pandas is a 2D structure with rows and columns.
 It’s similar to a table in SQL or Excel.
52. Pie charts are not suitable for _____.
Answer: b. Data are of more than 6 categories
Explanation:
 Pie charts become cluttered and unreadable when there are too many
slices.
 Better alternatives for such cases: bar charts or treemaps.

53. To show correlation between numerical values, which of the following is


better?
Answer: d. Scatter Plot
Explanation:
 Scatter plots show the relationship between two quantitative variables.
 Patterns like linear, non-linear, or no correlation can be visually
interpreted.

54. To show distribution of a single variable, which of the following is better?


Answer: b. Histogram
Explanation:
 Histograms display the frequency distribution of continuous data.
 Help in detecting skewness, modality, outliers, etc.

55. In Python, iloc[] method is used to display the ____ from a Series.
Answer: c. Values
Explanation:
 iloc[] is index-based and used to access values from specific positions in
Series or DataFrames.
 Example: df.iloc[0] gives the first row.
56. In Python, which of the following methods is to remove a column from a
DataFrame?
Answer: d. Drop()
Explanation:
 df.drop('column_name', axis=1) removes the specified column.
 Commonly used in feature selection and data cleaning.

57. ‘Groupby’ method is not suitable for _____.


Answer: b. Filtering of data
Explanation:
 groupby() is primarily used for aggregation and group-based operations.
 Filtering is better done using boolean indexing or filter functions.

58. After grouping data on ‘Region’ from a DataFrame, the first() method will
show _______.
Answer: b. The first entry from each group
Explanation:
 groupby('Region').first() returns the first row from each group based on
the grouping key.

59. If ‘df’ is a DataFrame having data columns ‘Segment’ and ‘Region’ among
other columns, the code line “RegSeg = df.groupby([“Region”,”Segment”])” is
an example of _____.
Answer: a. Nested group
Explanation:
 Grouping by multiple columns is called multi-level or nested grouping.
 It allows more granular aggregation and analysis.
60. Which of the following is not true for the pandas dataframe.corr()
method?
Answer: d. None of the above
Explanation:
 All the listed statements are true:
o Computes pairwise correlation.
o Ignores non-numeric columns.
o Skips NaN values by default.

Continuing with Q61 to Q75 of Data Visualization (PEC-CSD601B):

61. Pandas DataFrame all() method returns ____ if all values in each
row/column are True except one.
Answer: c. False
Explanation:
 The all() method checks whether all values in a row or column are True.
 If even one value is False, it returns False.

62. Code Output:


import pandas as pd
data = [[True, False, True], [True, True, False]]
df = pd.DataFrame(data)
print(df.all(axis='index'))
Answer:
0 True
1 False
2 False
dtype: bool
Explanation:
 Axis = ‘index’ means checking column-wise.
 Column 0: All True → True
 Column 1: One False → False
 Column 2: One False → False

63. Code Output:


import pandas as pd
data = [[True, False, True], [True, True, False]]
df = pd.DataFrame(data)
print(df.all(axis='columns'))
Answer:
0 False
1 False
dtype: bool
Explanation:
 Checks row-wise for all True.
 Both rows have at least one False.

64. Code Output:


import pandas as pd
data = [[True, False, True], [True, True, False]]
df = pd.DataFrame(data)
print(df.any(axis='index'))
Answer:
0 True
1 True
2 True
dtype: bool
Explanation:
 Checks column-wise for any True.
 Every column has at least one True.

65. Code Output:


import pandas as pd
data = [[True, False, True], [True, True, False]]
df = pd.DataFrame(data)
print(df.any(axis='columns'))
Answer:
0 True
1 True
dtype: bool
Explanation:
 Each row has at least one True, so returns True for both.

66. Code Output:


import pandas as pd
data = {
"Age": [23, 26, 21, 22, 20, 19, 27],
"Qualified": [True, False, False, False, False, True, True]
}
df = pd.DataFrame(data)
newdf = df.sort_values(by='Age')
print(newdf)
Answer:
 Outputs the DataFrame sorted in ascending order of age:
Age Qualified
5 19 True
4 20 False
2 21 False
3 22 False
0 23 True
1 26 False
6 27 True

67. Code Output:


import pandas as pd
data = {
"Age": [23, 26, 21, 22, 20, 19, 27],
"Qualified": [True, False, False, False, False, True, True]
}
df = pd.DataFrame(data)
newdf = df.sort_values(by='Qualified')
print(newdf)
Answer:
 False < True, so rows with False will appear first:
Age Qualified
1 26 False
2 21 False
3 22 False
4 20 False
0 23 True
5 19 True
6 27 True

68. Code Output:


import pandas as pd
data = {
"Duration": [50, 40, None, None, 90, 20],
"Pulse": [109, 140, 110, 125, 138, 170]
}
df = pd.DataFrame(data)
print(df.count())
Answer:
Duration 4
Pulse 6
dtype: int64
Explanation:
 count() skips None/NaN values.
 2 missing values in Duration → Count = 4

69. Group the data by “qualified” and display the first record/row of each
group
import pandas as pd
data = {
"Age": [23, 26, 21, 22, 20, 19, 27],
"Qualified": [True, False, False, False, False, True, True]
}
df = pd.DataFrame(data)
grouped = df.groupby('Qualified').first()
print(grouped)
Output:
Age
Qualified
False 26
True 23

70. Group the data by “qualified”. Count the number of records/entries in


each group
grouped = df.groupby('Qualified').size()
print(grouped)
Output:
Qualified
False 4
True 3
dtype: int64

71. Display summary information for the “age” column


print(df['Age'].describe())
Output:
count 7.000000
mean 22.571429
std 2.927700
min 19.000000
25% 20.500000
50% 22.000000
75% 26.000000
max 27.000000
Name: Age, dtype: float64

72. Correlation between Age and Weight of children


import pandas as pd
data = {
'Age': [7, 6, 8, 5, 6, 9],
'Weight': [12, 8, 12, 10, 11, 13]
}
df = pd.DataFrame(data)
correlation = df['Age'].corr(df['Weight'])
print("Correlation:", correlation)
Output:
(Depends on data – approximate)
Correlation: 0.872...

73. Which of the following is not a visualization tool in the Python library?
Answer: a. Numpy
Explanation:
 Numpy is for numerical computation, not visualization.
 matplotlib, seaborn, plotly are for visualization.
74. In the statement given below, what does the value 10 indicate?
fig = plt.figure(figsize=(10,8))
Answer: c. Width
Explanation:
 figsize=(width, height) → 10 is the width in inches.

75. In a Pie chart, the sum of the slices amounts to _____.


Answer: b. 100%
Explanation:
 Pie charts represent parts of a whole.
 All segments together always total 100%.

Continuing from Q76 to Q117 of Data Visualization (PEC-CSD601B):

76. Identify the following image.


(Since the image is not visible, we rely on the options.)
Answer: c. Scatter plot
Assuming the image is a plot showing points scattered along x and y axes.

77. Identify the following image.


Answer: b. Histogram
Assuming the image shows data distribution in bins along the x-axis.

78. Identify the following image.


Answer: d. Treemap
Treemaps show hierarchical data in nested rectangles.

79. Which of the following is not part of the “Plotly” main module?
Answer: b. Seaborn
Explanation:
 Seaborn is a separate visualization library, not a module in Plotly.

80. Which of the following does not visualise data?


Answer: c. Shapes
Explanation:
 Shapes are static drawing elements, not data visualizers like charts,
graphs, or maps.

81. Which of the following gives the statistical summary of the data?
Answer: d. Boxplot
Explanation:
 Boxplots show min, Q1, median, Q3, max, and outliers, giving a
summary.

82. Which library is the most used visualisation library in python?


Answer: c. matplotlib

83. Recommended way to load matplotlib library is:


Answer: a. import matplotlib.pyplot as plt

84. Which graph should be used if we want to show distribution of elements?


Answer: b. Histogram

85. Which graph should be used if we want to find patterns in data?


Answer: b. Scatter plot
86. In the box plot, data will be divided in how many parts?
Answer: c. 5
Explanation:
 Boxplot parts: Minimum, Q1, Median (Q2), Q3, Maximum.

87. General steps in the Data Science Pipeline:


Answer: a. Preparing data → Performing data analysis → Learning from data
→ Visualization → Obtaining insights

88. Command to install numpy:


Answer: b. pip install numpy

89. Which function can be used to read the ‘.txt’ file using pandas?
Answer: c. read_table

90. DataFrame in pandas is generally:


Answer: c. 2 Dimensional

91. Which feature does not match with python?


Answer: d. Compiled

92. What is the extension for the python file?


Answer: a. .py

93. Which one of these is not a Built-in Data Structure of python?


Answer: b. Structure

94. What is the correct way to set the boolean variable myVar to false?
Answer: Correct syntax:
myVar = False
None of the options were fully correct.

95. List in python is represented with:


Answer: c. [ ]

96. In DataFrame, by default new column is added as the ___ column


Answer: c. Last

97. DF.loc[ ] method is used to:


Answer: c. Both a & b
 Add or modify rows in a DataFrame.

98. Which of the following functions is used to create DataFrame?


Answer: a. DataFrame()

99. Which library is to be imported for creating DataFrame?


Answer: c. Pandas

100. A ____ is a two-dimensional labelled data structure.


Answer: a. DataFrame

101. Which of the following is used to give a user-defined column index in


DataFrame?
Answer: c. columns

102. In DataFrame, axis 0 is for:


Answer: a. Rows

103. To display the first row of dataframe ‘DF’:


Answer: d. All of the above

104. Write a python program to draw a boxplot using the data set
1,1,2,2,3,3,4,4,6,7,8,10, 11,14, 15,20,23
import matplotlib.pyplot as plt
data = [1,1,2,2,3,3,4,4,6,7,8,10,11,14,15,20,23]
plt.boxplot(data)
plt.title("Boxplot")
plt.show()

105. The data given below represents the number of visualisation books sold
at different shops in Durgapur. Draw a boxplot and mention the related
statistics. 42, 41, 43, 43,43, 45, 47, 48, 50, 50.
import matplotlib.pyplot as plt
data = [42, 41, 43, 43, 43, 45, 47, 48, 50, 50]
plt.boxplot(data)
plt.title("Books Sold - Boxplot")
plt.show()
Stats:
 Min = 41
 Q1 ≈ 43
 Median = 44
 Q3 ≈ 48
 Max = 50
106. Suppose, the “ïris.csv” data set is available with your computer. Group
the data by “species” and plot the “Pie”chart showing the count of different
species.
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv("iris.csv")
species_counts = df['species'].value_counts()
plt.pie(species_counts, labels=species_counts.index, autopct='%1.1f%%')
plt.title("Species Distribution")
plt.show()

107. Suppose, the “ïris.csv” data set is available with your computer. Plot a
scatter plot of the data showing “species” using “sepal_length” in x-axis and
“sepal_width” in y-axis.
import pandas as pd
import seaborn as sns
df = pd.read_csv("iris.csv")
sns.scatterplot(data=df, x='sepal_length', y='sepal_width', hue='species')

108. Write a python program to draw a Treemap using the data of your
choice.
import matplotlib.pyplot as plt
import squarify
data = [500, 250, 100, 75, 50]
labels = ["A", "B", "C", "D", "E"]
squarify.plot(sizes=data, label=labels)
plt.axis('off')
plt.title("Treemap Example")
plt.show()

109. Write a python program to draw a Histogram using the data of your
choice and 10 bins with interval 5.
import matplotlib.pyplot as plt
data = [12, 15, 21, 25, 27, 30, 35, 39, 40, 45, 50, 52, 55, 59]
plt.hist(data, bins=10)
plt.title("Histogram")
plt.show()

110. Suppose you are a system analyst in an airline company. You are asked
to provide a Line chart showing the year-wise variations of passengers in the
airline since 2015. Write a python code for that. (Assume suitable data).
import matplotlib.pyplot as plt
years = [2015, 2016, 2017, 2018, 2019, 2020]
passengers = [120, 150, 170, 160, 200, 180]
plt.plot(years, passengers, marker='o')
plt.title("Airline Passengers by Year")
plt.xlabel("Year")
plt.ylabel("Passengers (in thousands)")
plt.grid()
plt.show()
111 Consider the following code and draw the output graph. import
plotly.graph_objs as go import math xpoints = np.arange(0,math.pi*2, 0.05)
ypoints = np.sin(xpoints) ypoints1 = np.cos(xpoints) trace0 =
go.Scatter(x=xpoints,y=ypoints, name='Sine') trace1 =
go.Scatter(x=xpoints,y=ypoints1, name='Cos') data = [trace0, trace1] layout =
go.Layout(title="Sine Wave",xaxis = {'title':'Angle'}, yaxis={'title':'Sine'})
fig=go.Figure(data, layout) fig.show()
import numpy as np
import math
import plotly.graph_objs as go

xpoints = np.arange(0, math.pi*2, 0.05)


ypoints = np.sin(xpoints)
ypoints1 = np.cos(xpoints)

trace0 = go.Scatter(x=xpoints, y=ypoints, name='Sine')


trace1 = go.Scatter(x=xpoints, y=ypoints1, name='Cos')
data = [trace0, trace1]
layout = go.Layout(title="Sine Wave", xaxis={'title':'Angle'},
yaxis={'title':'Value'})
fig = go.Figure(data, layout)
fig.show()
112. Plot Scatter for Sales vs Profit:
trace = go.Scatter(x=Seg_sales.Sales, y=Seg_profit.Profit, mode='markers')
layout = go.Layout(title="Scatter Plot", xaxis={'title':'Sales'},
yaxis={'title':'Profit'})
fig = go.Figure(data=[trace], layout=layout)
fig.show()

113. Consider the following code and draw the output graph. import
plotly.express as px
Seg_class=Sup_store.groupby(['Segment','Category'],as_index=False).agg({'Sa
les':'sum','Profit':'sum'} ) fig = px.scatter(Seg_class,
x='Sales',y='Profit',color='Segment') fig.show()
import plotly.express as px
fig = px.scatter(Seg_class, x='Sales', y='Profit', color='Segment')
fig.show()

114. Consider the following code and draw the output graph. import
plotly.graph_objs as go branches = ['CS','CSE','DS','AIML','ECE']
fy=[63,127,58,29,124] sy=[68,148,63,33,123] ty=[65,144,61,30,121]
trace1=go.Bar(x=branches,y=fy,name='First Year')
trace2=go.Bar(x=branches,y=sy,name='Second Year')
trace3=go.Bar(x=branches,y=ty,name='Third Year')
data=[trace1,trace2,trace3] layout=go.Layout(barmode='group')
fig=go.Figure(data,layout) fig.show()
trace1 = go.Bar(x=branches, y=fy, name='First Year')
trace2 = go.Bar(x=branches, y=sy, name='Second Year')
trace3 = go.Bar(x=branches, y=ty, name='Third Year')
layout = go.Layout(barmode='group')
fig = go.Figure(data=[trace1, trace2, trace3], layout=layout)
fig.show()

115. Histogram + Line chart from Superstore and Flights:


import pandas as pd
import matplotlib.pyplot as plt

# Histogram
Sup_store = pd.read_excel("Superstore.xlsx")
furniture = Sup_store['Category'] == 'Furniture'
region_sales = Sup_store[furniture]['Sales']
plt.hist(region_sales, bins=range(10, 2001, 100))
plt.title("Region-wise Sales")
plt.legend(['Bar'])
plt.show()

# Line chart
df = pd.read_csv("flights.csv")
plt.plot(df['year'], df['passengers'])
plt.title("Passengers by Year")
plt.legend(['Line'])
plt.show()

116. Grouped bar chart using Plotly and Matplotlib:


trace1 = go.Bar(x=branches, y=fy, name='First Year')
trace2 = go.Bar(x=branches, y=sy, name='Second Year')
trace3 = go.Bar(x=branches, y=ty, name='Third Year')
layout = go.Layout(barmode='group')
fig = go.Figure(data=[trace1, trace2, trace3], layout=layout)
fig.show()

117. Read OlympicAthletes.xlsx and plot bar chart of total medals (≥200):
import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_excel("OlympicAthletes.xlsx")
grouped = df.groupby('Country')['Total Medals'].sum().reset_index()
filtered = grouped[grouped['Total Medals'] >= 200].sort_values(by='Total
Medals', ascending=False)

plt.bar(filtered['Country'], filtered['Total Medals'])


plt.xticks(rotation=45)
plt.title("Total Medals by Country (>=200)")
plt.ylabel("Medals")
plt.show()

✅ All 117 questions answered. If you need a summarized revision sheet or


specific code explanations, feel free to ask!

You might also like