0% found this document useful (0 votes)
8 views

ModelQB - Part B&C-1

The document outlines the data science process, detailing six key stages: problem definition, data collection, data preparation, data modeling, evaluation, and deployment. It also elaborates on data mining and data warehousing, highlighting their definitions, processes, and characteristics. Additionally, it discusses different types of data, including structured, unstructured, and machine-generated data, and emphasizes the importance of using tables and graphs for data description.

Uploaded by

pradeeplavrp09
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

ModelQB - Part B&C-1

The document outlines the data science process, detailing six key stages: problem definition, data collection, data preparation, data modeling, evaluation, and deployment. It also elaborates on data mining and data warehousing, highlighting their definitions, processes, and characteristics. Additionally, it discusses different types of data, including structured, unstructured, and machine-generated data, and emphasizes the importance of using tables and graphs for data description.

Uploaded by

pradeeplavrp09
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

SRMTRP Engineering College

Irungalur,Trichy 621 105


Department of Computer Science and Engineering
CS3352-Foundations of Data Science
Model Examination 2024
13 Marks
UNIT -I
1. data science process.
The Data Science Process: Six Key Stages
The data science process involves six interconnected stages, each playing a
crucial role in transforming raw data into actionable insights. Below are the
detailed descriptions and a diagram to represent the process:
1. Problem Definition
Purpose: Clearly define the business problem or objective.
Description:
Understand the specific goals of the project and the questions to be answered.
Identify the stakeholders and their needs.
Determine the type of data and potential outcomes required.
Example: Predict which customers are likely to churn based on transaction
history.
Output: A well-defined problem statement.
2. Data Collection
Purpose: Gather the data needed to solve the problem.
Description:
Collect data from various sources such as databases, APIs, sensors, or web
scraping.
Ensure the data is relevant, accurate, and sufficient for analysis.
Example: Extract customer demographics, transaction logs, and subscription
details.
Output: Raw datasets.
3. Data Preparation (Cleaning and Exploration)
Purpose: Prepare and explore the data for analysis.
Description:
Handle missing values, outliers, and inconsistencies.
Transform and normalize data into a structured format.
Conduct exploratory data analysis (EDA) to uncover patterns and relationships.
Example: Remove duplicates, fill missing values with the mean, and analyze
trends in purchase frequency.
Output: A clean and ready-to-use dataset.
4. Data Modeling
Purpose: Develop models to solve the problem using data.
Description:
Select suitable machine learning or statistical algorithms.
Train models on historical data, validate performance, and fine-tune
hyperparameters.
Split data into training, validation, and test sets.
Example: Train a logistic regression model to classify customers as "churn" or
"retain."
Output: A well-trained model.
5. Evaluation
Purpose: Assess the model's performance.
Description:
Use metrics such as accuracy, precision, recall, or RMSE to evaluate the model.
Ensure the model generalizes well on unseen data and aligns with business
objectives.
Example: Evaluate the churn prediction model with ROC-AUC and F1-score.
Output: A validated and reliable model.
6. Deployment and Monitoring
Purpose: Make the model available for use and ensure its continuous
performance.
Description:
Deploy the model in production systems using APIs, dashboards, or applications.
Monitor its performance regularly and retrain as necessary.
Example: Integrate the churn prediction model into a CRM system for real-time
use.
Output: A deployed model with ongoing monitoring and maintenance.
2. Elaborate on Data Mining and Data Warehousing
DATA MINING
Data mining is the process of discovering actionable information from large sets
of data. Data mining uses mathematical analysis to derive patterns and trends
that exist in data. Typically, these patterns cannot be discovered by traditional
data exploration because the relationships are too complex or because there is too
much data.
These patterns and trends can be collected and defined as a data mining model.
Mining models can be applied to specific scenarios, such as:
• Forecasting: Estimating sales, predicting server loads or server downtime
• Risk and probability: Choosing the best customers for targeted mailings,
determining the probable break-even point for risk scenarios, assigning
probabilities to diagnoses or other outcomes
• Recommendations: Determining which products are likely to be sold
together, generating recommendations
• Finding sequences: Analyzing customer selections in a shopping cart,
predicting next likely events
• Grouping: Separating customers or events into cluster of related items,
analyzing and predicting affinities
Building a mining model is part of a larger process that includes everything from
asking questions about the data and creating a model to answer those questions,
to deploying the model into a working environment. This process can be defined
by using the following six basic steps:
1. Defining the Problem
2. Preparing Data
3. Exploring Data
4. Building Models
5. Exploring and Validating Models
6. Deploying and Updating Models

The following diagram describes the relationships between each step in the
process, and the technologies in Microsoft SQL Server that you can use to
complete each step.
Defining the Problem
The first step in the data mining process is to clearly define the problem, and
consider ways that data can be utilized to provide an answer to the problem.
This step includes analyzing business requirements, defining the scope of the
problem, defining the metrics by which the model will be evaluated, and defining
specific objectives for the data mining project. These tasks translate into
questions such as the following:
• What are you looking for? What types of relationships are you trying to
find?
• Does the problem you are trying to solve reflect the policies or processes of
the business?
• Do you want to make predictions from the data mining model, or just look
for interesting patterns and associations?
• Which outcome or attribute do you want to try to predict?
• What kind of data do you have and what kind of information is in each
column? If there are multiple tables, how are the tables related? Do you need to
perform any cleansing, aggregation, or processing to make the data usable?
• How is the data distributed? Is the data seasonal? Does the data
accurately represent the processes of the business?
Preparing Data
• The second step in the data mining process is to consolidate and clean the
data that was identified in the Defining the Problem step.
• Data can be scattered across a company and stored in different formats, or
may contain inconsistencies such as incorrect or missing entries.
• Data cleaning is not just about removing bad data or interpolating
missing values, but about finding hidden correlations in the data, identifying
sources of data that are the most accurate, and determining which columns are
the most appropriate for use in analysis
Exploring Data
Exploration techniques include calculating the minimum and maximum values,
calculating mean and standard deviations, and looking at the distribution of the
data. For example, you might determine by reviewing the maximum, minimum,
and mean values that the data is not representative of your customers or
business processes, and that you therefore must obtain more balanced data or
review the assumptions that are the basis for your expectations. Standard
deviations and other distribution values can provide useful information about the
stability and accuracy of the results.
Building Models
The mining structure is linked to the source of data, but does not actually contain
any data until you process it. When you process the mining structure, SQL
Server Analysis Services generates aggregates and other statistical information
that can be used for analysis. This information can be used by any mining model
that is based on the structure.
Exploring and Validating Models
Before you deploy a model into a production environment, you will want to test
how well the model performs. Also, when you build a model, you typically create
multiple models with different configurations and test all models to see which
yields the best results for your problem and your data.
Deploying and Updating Models
After the mining models exist in a production environment, you can perform
many tasks, depending on your needs. The following are some of the tasks you
can perform:
• Use the models to create predictions, which you can then use to make
business decisions.
• Create content queries to retrieve statistics, rules, or formulas from the
model.
• Embed data mining functionality directly into an application. You can
include Analysis Management Objects (AMO), which contains a set of objects
that your application can use to create, alter, process, and delete mining
structures and mining models.
• Use Integration Services to create a package in which a mining model is
used to intelligently separate incoming data into multiple tables.
• Create a report that lets users directly query against an existing mining
model
• Update the models after review and analysis. Any update requires that
you reprocess the models.
• Update the models dynamically, as more data comes into the organization,
and making constant changes to improve the effectiveness of the solution should
be part of the deployment strategy.
DATA WAREHOUSING
Data warehousing is the process of constructing and using a data warehouse. A
data warehouse is constructed by integrating data from multiple heterogeneous
sources that support analytical reporting, structured and/or ad hoc queries, and
decision making. Data warehousing involves data cleaning, data integration, and
data consolidations.
Characteristics of data warehouse
The main characteristics of a data warehouse are as follows:
• Subject-Oriented
A data warehouse is subject-oriented since it provides topic-wise information
rather than the overall processes of a business. Such subjects may be sales,
promotion, inventory, etc
• Integrated
A data warehouse is developed by integrating data from varied sources into a
consistent format. The data must be stored in the warehouse in a consistent and
universally acceptable manner in terms of naming, format, and coding. This
facilitates effective data analysis.
• Non-Volatile
Data once entered into a data warehouse must remain unchanged. All data is
read-only. Previous data is not erased when current data is entered. This helps
you to analyze what has happened and when.
• Time-Variant
The data stored in a data warehouse is documented with an element of time,
either explicitly or implicitly. An example of time variance in Data Warehouse is
exhibited in the Primary Key, which must have an element of time like the day,
week, or month.

Database vs. Data Warehouse

Although a data warehouse and a traditional database share some similarities,


they need not be the same idea. The main difference is that in a database, data is
collected for multiple transactional purposes. However, in a data warehouse, data
is collected on an extensive scale to perform analytics. Databases provide real-
time data, while warehouses store data to be accessed for big analytical queries.

Data Warehouse Architecture


Usually, data warehouse architecture comprises a three-tier structure.
Bottom Tier
The bottom tier or data warehouse server usually represents a relational
database system. Back-end tools are used to cleanse, transform and feed data
into this layer.
Middle Tier
The middle tier represents an OLAP server that can be implemented in two
ways.
The ROLAP or Relational OLAP model is an extended relational database
management system that maps multidimensional data process to standard
relational process.
The MOLAP or multidimensional OLAP directly acts on multidimensional data
and operations.
Top Tier
This is the front-end client interface that gets data out from the data warehouse.
It holds various tools like query tools, analysis tools, reporting tools, and data
mining tools.

How Data Warehouse Works

Data Warehousing integrates data and information collected from various


sources into one comprehensive database. For example, a data warehouse might
combine customer information from an organization’s point- of-sale systems, its
mailing lists, website, and comment cards. It might also incorporate confidential
information about employees, salary information, etc. Businesses use such
components of data warehouse to analyze customers.

Data mining is one of the features of a data warehouse that involves looking for
meaningful data patterns in vast volumes of data and devising innovative
strategies for increased sales and profits.

Types of Data Warehouse


There are three main types of data warehouse.

Enterprise Data Warehouse (EDW)


This type of warehouse serves as a key or central database that facilitates
decision-support services throughout the enterprise. The advantage to this type
of warehouse is that it provides access to cross-organizational information, offers
a unified approach to data representation, and allows running complex queries.

Operational Data Store (ODS)


This type of data warehouse refreshes in real-time. It is often preferred for
routine activities like storing employee records. It is required when data
warehouse systems do not support reporting needs of the business.
Data Mart
A data mart is a subset of a data warehouse built to maintain a particular
department, region, or business unit. Every department of a business has a
central repository or data mart to store data. The data from the data mart is
stored in the ODS periodically. The ODS then sends the data to the EDW, where
it is stored and used.
3. Facets of data
In data science and big data you’ll come across many different types of
data, and each of them tends to require different tools and techniques. The main
categories of data are these:
• Structured
• Unstructured
• Natural language
• Machine-generated
• Graph-based
• Audio, video, and images
• Streaming
Let’s explore all these interesting data types.
Structured data
• Structured data is data that depends on a data model and resides in a
fixed field within a record. As such, it’s often easy to store structured data in
tables within databases or Excel files
• SQL, or Structured Query Language, is the preferred way to manage and
query data that resides in databases.

Unstructured data
Unstructured data is data that isn’t easy to fit into a data model because the
content is context-specific or varying. One example of unstructured data is your
regular email
Natural language
• Natural language is a special type of unstructured data; it’s challenging to
process because it requires knowledge of specific data science techniques and
linguistics.
• The natural language processing community has had success in entity
recognition, topic recognition, summarization, text completion, and sentiment
analysis, but models trained in one domain don’t generalize well to other
domains.
• Even state-of-the-art techniques aren’t able to decipher the meaning of
every piece of text.
Machine-generated data
• Machine-generated data is information that’s automatically created by a
computer, process, application, or other machine without human intervention.
• Machine-generated data is becoming a major data resource and will
continue to do so.
• The analysis of machine data relies on highly scalable tools, due to its
high volume and speed. Examples of machine data are web server logs, call detail
records, network event logs, and telemetry.
Graph-based or network data
• “Graph data” can be a confusing term because any data can be shown in a
graph.
• Graph or network data is, in short, data that focuses on the relationship
or adjacency of objects.
• The graph structures use nodes, edges, and properties to represent and
store graphical data.
• Graph-based data is a natural way to represent social networks, and its
structure allows you to calculate specific metrics such as the influence of a
person and the shortest path between two people.

Audio, image, and video


• Audio, image, and video are data types that pose specific challenges to a
data scientist.
• Tasks that are trivial for humans, such as recognizing objects in pictures,
turn out to be challenging for computers.
• MLBAM (Major League Baseball Advanced Media) announced in 2014
that they’ll increase video capture to approximately 7 TB per game for the
purpose of live, in-game analytics.
• Recently a company called DeepMind succeeded at creating an algorithm
that’s capable of learning how to play video games.
• This algorithm takes the video screen as input and learns to interpret
everything via a complex of deep learning.
Streaming data
• The data flows into the system when an event happens instead of being
loaded into a data store in a batch.
• Examples are the “What’s trending” on Twitter, live sporting or music
events, and the stock market.

UNIT –II
1. Explain and evaluate the describing Data with Tables and Graphs.
Construct the Frequency distribution for grouped data, Relative
frequency, Cumulative frequency, Histogram and Frequency polygon.
These are the numbers of newspapers sold at a local shop over the last 10
days: 22, 20, 18, 23, 20, 25, 22, 20, 18, 20.

Explanation and Evaluation: Describing Data with Tables and Graphs


Using tables and graphs to describe data allows for both numerical
and visual summaries. Tables, such as frequency distributions, provide
exact counts or proportions. Graphs, like histograms and frequency
polygons, give a visual representation that highlights trends and
distributions.

Constructing Frequency Distribution for Grouped Data

We organize the data into intervals (groups) to summarize the distribution.

Data:
Numbers of newspapers sold:
22, 20, 18, 23, 20, 25, 22, 20, 18, 20
Steps:
Identify the range:
Range=Max−Min=25−18=7
Choose class intervals: 3 intervals of size 3, e.g., 18-20, 21-23, 24-26.
Count occurrences in each interval.

Class Interval Frequency (f)


18-20 5
21-23 4
24-26 1
Histogram
The histogram represents the frequencies using bars. Each bar height
corresponds to the frequency of a class interval.
Histogram: Each bar shows the frequency of the class intervals. It helps
visualize the distribution of the data.
Frequency Polygon: Points connected by lines represent the frequency
of each class's midpoint, showing the data distribution's shape.

2. Explain and Evaluate the Normal distribution and Z Score with formulae and
Solve: For some computers, the time period between charges of the battery is
normally distributed with a mean of 50 hours and a standard deviation of 15
hours. Rohan has one of these computers and needs to know the probability
a) Time period higher than 80 hours
b) Time Period below 35 Hours
c) Time period will be between 50 and 70 hours.
3. Explain and evaluate the Describing Data with Averages (measures of
central tendency with formulae) and various measures of variability. And
also find the mean, mode, median, variance and standard deviation for the
following Data Science CIA2 marks 76,87,34,12,89,95,67.
Evaluation:
The mean provides a central value but is sensitive to outliers (e.g.,12 in this case).
The median is robust to outliers and better represents the center for skewed data.
The standard deviation shows a high variability in marks, reflected in the wide range of
scores.

UNIT –III
1. Discuss the significance of regression, regression line and least squares
regression equation and solve: The details about technicians’ experience in a
company (in several years) and their performance rating are in the table below.
Using these values, estimate the performance rating for a technician with 21
years of experience.
Experience 16 18 4 10 12
Rating 87 89 68 80 83
Solution
2. Elaborate in detail the significance of Correlation and the various types of
correlation. Calculate and analyse the Correlation Coefficient (r) between the
study on feelings of stress and life satisfaction. Participants completed a measure
on how stressed they were feeling and Draw Scatter plot.
Stress Score 11 25 10 7 6
Life Satisfaction 7 1 6 9 8
3. What is Regression? Build regression line and least squares regression
line with suitable examples.
Regression is a statistical technique used to identify the relationship
between a dependent variable and one or more independent variables. It helps in
predicting the value of the dependent variable based on the known values of the
independent variables. One of the most common types of regression is linear
regression, where the relationship between the variables is represented by a
straight line.

Building a Regression Line


Example: Simple Linear Regression

Let's consider a simple example where we want to predict the height of a person
based on their age.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

# Sample data
ages = np.array([5, 10, 15, 20, 25, 30, 35]).reshape(-1, 1)
heights = np.array([110, 120, 130, 140, 150, 160, 170])

# Create a linear regression model


model = LinearRegression()
model.fit(ages, heights)

# Predict heights
predicted_heights = model.predict(ages)

# Plotting the data and regression line


plt.scatter(ages, heights, color='blue', label='Actual Heights')
plt.plot(ages, predicted_heights, color='red', label='Regression Line')
plt.xlabel('Age')
plt.ylabel('Height')
plt.title('Height vs Age')
plt.legend()
plt.grid(True)
plt.show()

Least Squares Regression Line


The Least Squares Regression Line is the line that minimizes the sum of the
squared differences between the observed values and the values predicted by the
line. The formula for the least squares line in simple linear regression is:

= +
where:

is the slope of the line.

is the y-intercept.

Example: Least Squares Regression Calculation


Using the same example data:

import numpy as np

# Sample data
ages = np.array([5, 10, 15, 20, 25, 30, 35])
heights = np.array([110, 120, 130, 140, 150, 160, 170])

# Mean of ages and heights


mean_age = np.mean(ages)
mean_height = np.mean(heights)

# Calculating the slope (m) and intercept (b)


m = np.sum((ages - mean_age) * (heights - mean_height)) / np.sum((ages -
mean_age) ** 2)
b = mean_height - m * mean_age

# Regression line formula


regression_line = m * ages + b

# Output the slope and intercept


print(f"Slope (m): {m}")
print(f"Intercept (b): {b}")

# Plotting the data and regression line


plt.scatter(ages, heights, color='blue', label='Actual Heights')
plt.plot(ages, regression_line, color='red', label='Least Squares Regression Line')
plt.xlabel('Age')
plt.ylabel('Height')
plt.title('Height vs Age')
plt.legend()
plt.grid(True)
plt.show()

Slope (m): 2.0 Intercept (b): 100.0


UNIT –IV
1. Discuss the Combining Datasets operations (Concat, Append, Merge and Join) in
Pandas with suitable examples.
2. Consider that, an E-Commerce organization like Amazon, have different
regions sales as NorthSales, SouthSales, WestSales, EastSales.csv files.
They want to combine North and West region sales and South and East
sales to find the aggregate sales of these collaborating regions Help them
to do so using Python code.
Soln:
To achieve this, we can use Python's Pandas library to read, combine, and
aggregate the sales data from the CSV files. Here’s the step-by-step
approach:
1. Read the CSV files using pandas.read_csv.
2. Combine the sales data for:
o NorthSales and WestSales.
o SouthSales and EastSales.
3. Calculate the aggregate sales for each combination.
4. Save the results into new CSV files or display them.
Python Code:
import pandas as pd

# Read CSV files for different regions


north_sales = pd.read_csv("NorthSales.csv")
west_sales = pd.read_csv("WestSales.csv")
south_sales = pd.read_csv("SouthSales.csv")
east_sales = pd.read_csv("EastSales.csv")

# Combine North and West region sales


north_west_sales = pd.concat([north_sales, west_sales])

# Combine South and East region sales


south_east_sales = pd.concat([south_sales, east_sales])

# Aggregate sales for North and West region


north_west_aggregate =
north_west_sales.groupby('Product').agg({'Sales': 'sum'}).reset_index()

# Aggregate sales for South and East region


south_east_aggregate =
south_east_sales.groupby('Product').agg({'Sales': 'sum'}).reset_index()

# Save the results into new CSV files


north_west_aggregate.to_csv("NorthWest_AggregateSales.csv",
index=False)
south_east_aggregate.to_csv("SouthEast_AggregateSales.csv",
index=False)

# Print the results


print("North and West Aggregate Sales:")
print(north_west_aggregate)

print("\nSouth and East Aggregate Sales:")


print(south_east_aggregate)

Explanation:
1. Reading CSV Files:
o We assume that the CSV files have at least two
columns: Product and Sales.
2. Combining Regions:
o pd.concat is used to combine rows from the North
and West sales data, and similarly for South and
East sales data.
3. Aggregating Sales:
o groupby('Product').agg({'Sales': 'sum'}) groups the
sales by Product and computes the total sales (sum)
for each product.
4. Saving Results:
o The aggregated sales are saved as new CSV files for
the combined regions.
Data Set Samples:
3. Write a NumPy program to capitalize the first letter, lowercase,
uppercase, swapcase, title-case of all the elements of a given array.
Original Array: ['python' 'PHP' 'java' 'C++']
Expected Output:
Capitalized: ['Python' 'Php' 'Java' 'C++']
Lowered: ['python' 'php' 'java' 'c++']
Uppered: ['PYTHON' 'PHP' 'JAVA' 'C++']
Swapcased: ['PYTHON' 'php' 'JAVA' 'c++']
Titlecased: ['Python' 'Php' 'Java' 'C++']

i) Capitalize, Lowercase, Uppercase, Swapcase, and Title-case all


elements of a given array
Program for String Operations:

We can use NumPy's vectorized operations to manipulate strings in the


array efficiently. Here's how we can solve the problem:

import numpy as np

# Given array
arr = np.array(['python', 'PHP', 'java', 'C++'])

# Capitalized: Capitalize the first letter of each word


capitalized = np.char.capitalize(arr)

# Lowered: Convert all characters to lowercase


lowered = np.char.lower(arr)

# Uppered: Convert all characters to uppercase


uppered = np.char.upper(arr)

# Swapcased: Swap the case of each character


swapcased = np.char.swapcase(arr)

# Titlecased: Capitalize the first letter of each word


titlecased = np.char.title(arr)
# Print results
print("Original Array:", arr)
print("Capitalized:", capitalized)
print("Lowered:", lowered)
print("Uppered:", uppered)
print("Swapcased:", swapcased)
print("Titlecased:", titlecased)
ii) Write a Numpy program to compute the mean, standard deviation, and
variance of a given array along the second axis.

Original array: [0 1 2 3 4 5]

import numpy as np

# Given 2D array
arr = np.array([[0, 1, 2], [3, 4, 5]])

# Mean along the second axis (columns)


mean = np.mean(arr, axis=1)

# Standard Deviation along the second axis (columns)


std_dev = np.std(arr, axis=1)

# Variance along the second axis (columns)


variance = np.var(arr, axis=1)

# Print results
print("Original Array:\n", arr)
print("Mean along second axis:", mean)
print("Standard Deviation along second axis:", std_dev)
print("Variance along second axis:", variance)
UNIT –V
1. Explain the Visualization with seaborn with suitable example programs
2. Elaborate the concept of subplots and its applications with suitable
examples
Concept of Subplots
Subplots are a feature used in data visualization that allows multiple
plots to be displayed within a single figure or canvas. Instead of visualizing data
in separate figures, subplots organize them into a grid layout (rows and columns)
within one figure. This is especially useful when comparing different datasets,
visualizing different aspects of the same dataset, or presenting summary
information.

Subplots are a feature available in most visualization libraries like Matplotlib


(Python), Seaborn (Python), and Matlab.

Key Applications of Subplots

1. Comparative Analysis
Subplots allow comparisons between different datasets or variables side-by-side
in a compact layout.
2. Multi-Dimensional Data Visualization
For datasets with multiple variables, subplots provide an effective way to
visualize relationships across all variables.
3. Model Diagnostics
Subplots can be used to display model performance metrics such as confusion
matrices, learning curves, or residual plots simultaneously.
4. Time Series Analysis
Multiple time series can be plotted as subplots for an overview of trends or
comparisons.
5. Presentation and Reporting
Combining plots in one figure improves clarity and reduces clutter in reports or
presentations.
Examples of Subplots
Example 1: Comparing Data Trends
Suppose you want to compare the sales performance of three regions (Region A,
Region B, Region C) over time.

import matplotlib.pyplot as plt


import numpy as np

# Sample data
months = np.arange(1, 13)
sales_a = np.random.randint(200, 500, 12)
sales_b = np.random.randint(150, 450, 12)
sales_c = np.random.randint(100, 400, 12)

# Create subplots
fig, axes = plt.subplots(3, 1, figsize=(8, 10))

axes[0].plot(months, sales_a, label="Region A", color='blue')


axes[0].set_title("Sales in Region A")
axes[0].set_xlabel("Months")
axes[0].set_ylabel("Sales")

axes[1].plot(months, sales_b, label="Region B", color='green')


axes[1].set_title("Sales in Region B")
axes[1].set_xlabel("Months")
axes[1].set_ylabel("Sales")

axes[2].plot(months, sales_c, label="Region C", color='red')


axes[2].set_title("Sales in Region C")
axes[2].set_xlabel("Months")
axes[2].set_ylabel("Sales")

plt.tight_layout()
plt.show()
Example 2: Visualizing Relationships Between Variables
In data science, it's common to analyze the relationship between features.
Subplots can visualize scatter plots for pairs of variables.

import matplotlib.pyplot as plt


import numpy as np

# Generate random data


x = np.random.rand(50)
y1 = x + np.random.rand(50) * 0.1 # Linear relationship
y2 = x**2 + np.random.rand(50) * 0.1 # Quadratic relationship

# Subplots
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Linear relationship
axes[0].scatter(x, y1, color='blue')
axes[0].set_title("Linear Relationship")
axes[0].set_xlabel("X")
axes[0].set_ylabel("Y1")

# Quadratic relationship
axes[1].scatter(x, y2, color='red')
axes[1].set_title("Quadratic Relationship")
axes[1].set_xlabel("X")
axes[1].set_ylabel("Y2")

plt.tight_layout()
plt.show()

Advantages of Subplots

 Efficient comparison and storytelling.


 Organizes complex data for better interpretation.
 Enhances reporting clarity by combining multiple visualizations in a single
figure.
Limitations
 If too many subplots are used, they may become hard to read or interpret.
o Proper scaling and formatting are necessary to avoid cluttered
visualizations. Subplots are a powerful tool for presenting insights in data
science and other analytical fields, offering a structured approach to
multivariable analysis.

3. How text and image annotations are done using Python? Give an example
of your own with appropriate Python code.

Text and image annotations in Python can be done using libraries like
Matplotlib, which allows you to add descriptive text, arrows, and markers to
plots for better clarity and communication of insights.
Annotating Text and Images
 Text Annotation: Adds text at specific locations in the plot.
o plt.text(x, y, text, ...): Places text at the specified (x, y) coordinate.
o plt.annotate(...): Offers more control, such as pointing to a location with
an arrow.
 Image Annotation: Can include overlaying images, highlighting regions, or
combining images with plots using imshow().

Example: Annotating Text and an Overlayed Image

import matplotlib.pyplot as plt


import numpy as np
from matplotlib.offsetbox import OffsetImage, AnnotationBbox

# Sample data
x = np.linspace(0, 10, 100)
y = np.sin(x)
# Create the main plot
plt.figure(figsize=(10, 6))
plt.plot(x, y, label="Sine Wave", color="blue")
plt.title("Annotated Plot with Image", fontsize=14)
plt.xlabel("X-axis")
plt.ylabel("Y-axis")
plt.grid(alpha=0.5)

# Add text annotation


plt.text(6, 0.5, "Peak of the wave", fontsize=12, color="darkgreen")

# Add an arrow pointing to the peak


plt.annotate(
"Wave Peak",
xy=(np.pi / 2, 1), # Pointed location
xytext=(3, 1.5), # Text location
arrowprops=dict(facecolor="red", shrink=0.05, width=2),
fontsize=12,
color="black",
)

# Add an image annotation


# (We will use a small dot image to simulate annotation here)
dot_image = np.zeros((10, 10, 3))
dot_image[:, :, 0] = 1 # Red color for the dot
dot_box = AnnotationBbox(OffsetImage(dot_image, zoom=1.5), (np.pi, 0),
frameon=False)
plt.gca().add_artist(dot_box)

# Add legend
plt.legend()
plt.tight_layout()
plt.show()

Output:
Explanation of Code
1. Plotting Data:
o A sine wave is plotted using plt.plot().
2. Adding Text:
o plt.text() is used to add descriptive text at a specific point.
3. Annotating with an Arrow:
o plt.annotate() adds a label pointing to the wave's peak using an arrow.
4. Adding an Image:
o AnnotationBbox and OffsetImage are used to place an image (red dot) at a
specific location.

15-Marks
1. Explain in details about the function of mpl_tool kit for Geographic data
Visualisation with suitable example programs.

mpl_toolkits.basemap for Geographic Data Visualization


The mpl_toolkits.basemap is a Python library built on Matplotlib for
geographic data visualization. It provides tools to plot data on maps, visualize
geographic distributions, and create a wide variety of map projections. While
Basemap is older and somewhat replaced by Cartopy, it remains popular due to
its simplicity and integration with Matplotlib.
Functions of mpl_toolkits.basemap
1. Map Projections:
o Supports various map projections like Mercator, Lambert Conformal,
Orthographic, and more.
2. Map Features:
o Add coastlines, countries, states, rivers, and lakes using built-in functions.
o Overlay grid lines for latitude and longitude.
3. Data Visualization:
o Plot geographic data like scatter points, line paths, or polygons.
o Use filled contours for heatmaps or choropleth maps.
4. Custom Annotations:
o Add text, markers, and legends for geographic datasets.
5. Integration:
o Easily integrates with Matplotlib for seamless customization.
Key Methods in Basemap

 Initialization: Basemap() creates a map object with a specific projection and


region.
 Adding Features:
o drawcoastlines(): Adds coastlines.
o drawcountries(): Adds country borders.
o drawrivers(): Draws major rivers.
o drawmapboundary(): Draws the map boundary.
 Plotting Data:
o scatter(): Plots data points using latitude and longitude.
o contourf(): Creates filled contour maps for data distribution.

Example 1: Plotting Markers on a World Map

from mpl_toolkits.basemap import Basemap


import matplotlib.pyplot as plt

# Create a Basemap object with Mercator projection


m = Basemap(projection='merc',
llcrnrlat=-60, urcrnrlat=85, # Latitude range
llcrnrlon=-180, urcrnrlon=180, # Longitude range
resolution='c') # 'c' for crude resolution

# Draw map features


m.drawcoastlines()
m.drawcountries()
m.drawmapboundary(fill_color='lightblue')
m.fillcontinents(color='lightgreen', lake_color='lightblue')

# Plot markers for major cities


cities = {'New York': (-74.006, 40.7128),
'London': (-0.1276, 51.5074),
'Tokyo': (139.6917, 35.6895)}

for city, (lon, lat) in cities.items():


x, y = m(lon, lat)
m.scatter(x, y, marker='o', color='red', s=100)
plt.text(x, y, city, fontsize=12, ha='left', va='center', color='black')

plt.title("World Map with Major Cities")


plt.show()
Output:

Example 2: Visualizing Data on a Choropleth Map

from mpl_toolkits.basemap import Basemap


import matplotlib.pyplot as plt
import numpy as np

# Create a Basemap object for the US


m = Basemap(projection='merc',
llcrnrlat=24, urcrnrlat=50,
llcrnrlon=-125, urcrnrlon=-66,
resolution='i')

# Draw map features


m.drawcoastlines()
m.drawcountries()
m.drawstates()
m.drawmapboundary(fill_color='lightblue')
m.fillcontinents(color='lightgreen', lake_color='lightblue')

# Randomly generate some data for visualization


lats = np.random.uniform(24, 50, 100)
lons = np.random.uniform(-125, -66, 100)
data = np.random.uniform(0, 100, 100) # Example data: population density

# Convert lat/lon to map projection coordinates


x, y = m(lons, lats)
# Scatter plot with varying size based on data
m.scatter(x, y, c=data, cmap='viridis', s=data*10, alpha=0.7)

plt.title("Population Density Visualization in the US")


plt.colorbar(label='Population Density')
plt.show()

Output:

Key Points
1. mpl_toolkits.basemap is ideal for simple geographic visualizations but is
less powerful than Cartopy for advanced tasks.
2. It supports both static maps and dynamic overlays with data-driven
visualizations.
3. Combining map features like coastlines, states, and rivers with scatter
plots or heatmaps enhances visualization quality.
2. Find the following for the given data set:
Mean, Median, Mode, Variance, Standard Deviation and skewness.

Marks 0-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80


No of students 10 40 20 0 10 40 16 14

You might also like