ModelQB - Part B&C-1
ModelQB - Part B&C-1
The following diagram describes the relationships between each step in the
process, and the technologies in Microsoft SQL Server that you can use to
complete each step.
Defining the Problem
The first step in the data mining process is to clearly define the problem, and
consider ways that data can be utilized to provide an answer to the problem.
This step includes analyzing business requirements, defining the scope of the
problem, defining the metrics by which the model will be evaluated, and defining
specific objectives for the data mining project. These tasks translate into
questions such as the following:
• What are you looking for? What types of relationships are you trying to
find?
• Does the problem you are trying to solve reflect the policies or processes of
the business?
• Do you want to make predictions from the data mining model, or just look
for interesting patterns and associations?
• Which outcome or attribute do you want to try to predict?
• What kind of data do you have and what kind of information is in each
column? If there are multiple tables, how are the tables related? Do you need to
perform any cleansing, aggregation, or processing to make the data usable?
• How is the data distributed? Is the data seasonal? Does the data
accurately represent the processes of the business?
Preparing Data
• The second step in the data mining process is to consolidate and clean the
data that was identified in the Defining the Problem step.
• Data can be scattered across a company and stored in different formats, or
may contain inconsistencies such as incorrect or missing entries.
• Data cleaning is not just about removing bad data or interpolating
missing values, but about finding hidden correlations in the data, identifying
sources of data that are the most accurate, and determining which columns are
the most appropriate for use in analysis
Exploring Data
Exploration techniques include calculating the minimum and maximum values,
calculating mean and standard deviations, and looking at the distribution of the
data. For example, you might determine by reviewing the maximum, minimum,
and mean values that the data is not representative of your customers or
business processes, and that you therefore must obtain more balanced data or
review the assumptions that are the basis for your expectations. Standard
deviations and other distribution values can provide useful information about the
stability and accuracy of the results.
Building Models
The mining structure is linked to the source of data, but does not actually contain
any data until you process it. When you process the mining structure, SQL
Server Analysis Services generates aggregates and other statistical information
that can be used for analysis. This information can be used by any mining model
that is based on the structure.
Exploring and Validating Models
Before you deploy a model into a production environment, you will want to test
how well the model performs. Also, when you build a model, you typically create
multiple models with different configurations and test all models to see which
yields the best results for your problem and your data.
Deploying and Updating Models
After the mining models exist in a production environment, you can perform
many tasks, depending on your needs. The following are some of the tasks you
can perform:
• Use the models to create predictions, which you can then use to make
business decisions.
• Create content queries to retrieve statistics, rules, or formulas from the
model.
• Embed data mining functionality directly into an application. You can
include Analysis Management Objects (AMO), which contains a set of objects
that your application can use to create, alter, process, and delete mining
structures and mining models.
• Use Integration Services to create a package in which a mining model is
used to intelligently separate incoming data into multiple tables.
• Create a report that lets users directly query against an existing mining
model
• Update the models after review and analysis. Any update requires that
you reprocess the models.
• Update the models dynamically, as more data comes into the organization,
and making constant changes to improve the effectiveness of the solution should
be part of the deployment strategy.
DATA WAREHOUSING
Data warehousing is the process of constructing and using a data warehouse. A
data warehouse is constructed by integrating data from multiple heterogeneous
sources that support analytical reporting, structured and/or ad hoc queries, and
decision making. Data warehousing involves data cleaning, data integration, and
data consolidations.
Characteristics of data warehouse
The main characteristics of a data warehouse are as follows:
• Subject-Oriented
A data warehouse is subject-oriented since it provides topic-wise information
rather than the overall processes of a business. Such subjects may be sales,
promotion, inventory, etc
• Integrated
A data warehouse is developed by integrating data from varied sources into a
consistent format. The data must be stored in the warehouse in a consistent and
universally acceptable manner in terms of naming, format, and coding. This
facilitates effective data analysis.
• Non-Volatile
Data once entered into a data warehouse must remain unchanged. All data is
read-only. Previous data is not erased when current data is entered. This helps
you to analyze what has happened and when.
• Time-Variant
The data stored in a data warehouse is documented with an element of time,
either explicitly or implicitly. An example of time variance in Data Warehouse is
exhibited in the Primary Key, which must have an element of time like the day,
week, or month.
Data mining is one of the features of a data warehouse that involves looking for
meaningful data patterns in vast volumes of data and devising innovative
strategies for increased sales and profits.
Unstructured data
Unstructured data is data that isn’t easy to fit into a data model because the
content is context-specific or varying. One example of unstructured data is your
regular email
Natural language
• Natural language is a special type of unstructured data; it’s challenging to
process because it requires knowledge of specific data science techniques and
linguistics.
• The natural language processing community has had success in entity
recognition, topic recognition, summarization, text completion, and sentiment
analysis, but models trained in one domain don’t generalize well to other
domains.
• Even state-of-the-art techniques aren’t able to decipher the meaning of
every piece of text.
Machine-generated data
• Machine-generated data is information that’s automatically created by a
computer, process, application, or other machine without human intervention.
• Machine-generated data is becoming a major data resource and will
continue to do so.
• The analysis of machine data relies on highly scalable tools, due to its
high volume and speed. Examples of machine data are web server logs, call detail
records, network event logs, and telemetry.
Graph-based or network data
• “Graph data” can be a confusing term because any data can be shown in a
graph.
• Graph or network data is, in short, data that focuses on the relationship
or adjacency of objects.
• The graph structures use nodes, edges, and properties to represent and
store graphical data.
• Graph-based data is a natural way to represent social networks, and its
structure allows you to calculate specific metrics such as the influence of a
person and the shortest path between two people.
UNIT –II
1. Explain and evaluate the describing Data with Tables and Graphs.
Construct the Frequency distribution for grouped data, Relative
frequency, Cumulative frequency, Histogram and Frequency polygon.
These are the numbers of newspapers sold at a local shop over the last 10
days: 22, 20, 18, 23, 20, 25, 22, 20, 18, 20.
Data:
Numbers of newspapers sold:
22, 20, 18, 23, 20, 25, 22, 20, 18, 20
Steps:
Identify the range:
Range=Max−Min=25−18=7
Choose class intervals: 3 intervals of size 3, e.g., 18-20, 21-23, 24-26.
Count occurrences in each interval.
2. Explain and Evaluate the Normal distribution and Z Score with formulae and
Solve: For some computers, the time period between charges of the battery is
normally distributed with a mean of 50 hours and a standard deviation of 15
hours. Rohan has one of these computers and needs to know the probability
a) Time period higher than 80 hours
b) Time Period below 35 Hours
c) Time period will be between 50 and 70 hours.
3. Explain and evaluate the Describing Data with Averages (measures of
central tendency with formulae) and various measures of variability. And
also find the mean, mode, median, variance and standard deviation for the
following Data Science CIA2 marks 76,87,34,12,89,95,67.
Evaluation:
The mean provides a central value but is sensitive to outliers (e.g.,12 in this case).
The median is robust to outliers and better represents the center for skewed data.
The standard deviation shows a high variability in marks, reflected in the wide range of
scores.
UNIT –III
1. Discuss the significance of regression, regression line and least squares
regression equation and solve: The details about technicians’ experience in a
company (in several years) and their performance rating are in the table below.
Using these values, estimate the performance rating for a technician with 21
years of experience.
Experience 16 18 4 10 12
Rating 87 89 68 80 83
Solution
2. Elaborate in detail the significance of Correlation and the various types of
correlation. Calculate and analyse the Correlation Coefficient (r) between the
study on feelings of stress and life satisfaction. Participants completed a measure
on how stressed they were feeling and Draw Scatter plot.
Stress Score 11 25 10 7 6
Life Satisfaction 7 1 6 9 8
3. What is Regression? Build regression line and least squares regression
line with suitable examples.
Regression is a statistical technique used to identify the relationship
between a dependent variable and one or more independent variables. It helps in
predicting the value of the dependent variable based on the known values of the
independent variables. One of the most common types of regression is linear
regression, where the relationship between the variables is represented by a
straight line.
Let's consider a simple example where we want to predict the height of a person
based on their age.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
# Sample data
ages = np.array([5, 10, 15, 20, 25, 30, 35]).reshape(-1, 1)
heights = np.array([110, 120, 130, 140, 150, 160, 170])
# Predict heights
predicted_heights = model.predict(ages)
= +
where:
is the y-intercept.
import numpy as np
# Sample data
ages = np.array([5, 10, 15, 20, 25, 30, 35])
heights = np.array([110, 120, 130, 140, 150, 160, 170])
Explanation:
1. Reading CSV Files:
o We assume that the CSV files have at least two
columns: Product and Sales.
2. Combining Regions:
o pd.concat is used to combine rows from the North
and West sales data, and similarly for South and
East sales data.
3. Aggregating Sales:
o groupby('Product').agg({'Sales': 'sum'}) groups the
sales by Product and computes the total sales (sum)
for each product.
4. Saving Results:
o The aggregated sales are saved as new CSV files for
the combined regions.
Data Set Samples:
3. Write a NumPy program to capitalize the first letter, lowercase,
uppercase, swapcase, title-case of all the elements of a given array.
Original Array: ['python' 'PHP' 'java' 'C++']
Expected Output:
Capitalized: ['Python' 'Php' 'Java' 'C++']
Lowered: ['python' 'php' 'java' 'c++']
Uppered: ['PYTHON' 'PHP' 'JAVA' 'C++']
Swapcased: ['PYTHON' 'php' 'JAVA' 'c++']
Titlecased: ['Python' 'Php' 'Java' 'C++']
import numpy as np
# Given array
arr = np.array(['python', 'PHP', 'java', 'C++'])
Original array: [0 1 2 3 4 5]
import numpy as np
# Given 2D array
arr = np.array([[0, 1, 2], [3, 4, 5]])
# Print results
print("Original Array:\n", arr)
print("Mean along second axis:", mean)
print("Standard Deviation along second axis:", std_dev)
print("Variance along second axis:", variance)
UNIT –V
1. Explain the Visualization with seaborn with suitable example programs
2. Elaborate the concept of subplots and its applications with suitable
examples
Concept of Subplots
Subplots are a feature used in data visualization that allows multiple
plots to be displayed within a single figure or canvas. Instead of visualizing data
in separate figures, subplots organize them into a grid layout (rows and columns)
within one figure. This is especially useful when comparing different datasets,
visualizing different aspects of the same dataset, or presenting summary
information.
1. Comparative Analysis
Subplots allow comparisons between different datasets or variables side-by-side
in a compact layout.
2. Multi-Dimensional Data Visualization
For datasets with multiple variables, subplots provide an effective way to
visualize relationships across all variables.
3. Model Diagnostics
Subplots can be used to display model performance metrics such as confusion
matrices, learning curves, or residual plots simultaneously.
4. Time Series Analysis
Multiple time series can be plotted as subplots for an overview of trends or
comparisons.
5. Presentation and Reporting
Combining plots in one figure improves clarity and reduces clutter in reports or
presentations.
Examples of Subplots
Example 1: Comparing Data Trends
Suppose you want to compare the sales performance of three regions (Region A,
Region B, Region C) over time.
# Sample data
months = np.arange(1, 13)
sales_a = np.random.randint(200, 500, 12)
sales_b = np.random.randint(150, 450, 12)
sales_c = np.random.randint(100, 400, 12)
# Create subplots
fig, axes = plt.subplots(3, 1, figsize=(8, 10))
plt.tight_layout()
plt.show()
Example 2: Visualizing Relationships Between Variables
In data science, it's common to analyze the relationship between features.
Subplots can visualize scatter plots for pairs of variables.
# Subplots
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
# Linear relationship
axes[0].scatter(x, y1, color='blue')
axes[0].set_title("Linear Relationship")
axes[0].set_xlabel("X")
axes[0].set_ylabel("Y1")
# Quadratic relationship
axes[1].scatter(x, y2, color='red')
axes[1].set_title("Quadratic Relationship")
axes[1].set_xlabel("X")
axes[1].set_ylabel("Y2")
plt.tight_layout()
plt.show()
Advantages of Subplots
3. How text and image annotations are done using Python? Give an example
of your own with appropriate Python code.
Text and image annotations in Python can be done using libraries like
Matplotlib, which allows you to add descriptive text, arrows, and markers to
plots for better clarity and communication of insights.
Annotating Text and Images
Text Annotation: Adds text at specific locations in the plot.
o plt.text(x, y, text, ...): Places text at the specified (x, y) coordinate.
o plt.annotate(...): Offers more control, such as pointing to a location with
an arrow.
Image Annotation: Can include overlaying images, highlighting regions, or
combining images with plots using imshow().
# Sample data
x = np.linspace(0, 10, 100)
y = np.sin(x)
# Create the main plot
plt.figure(figsize=(10, 6))
plt.plot(x, y, label="Sine Wave", color="blue")
plt.title("Annotated Plot with Image", fontsize=14)
plt.xlabel("X-axis")
plt.ylabel("Y-axis")
plt.grid(alpha=0.5)
# Add legend
plt.legend()
plt.tight_layout()
plt.show()
Output:
Explanation of Code
1. Plotting Data:
o A sine wave is plotted using plt.plot().
2. Adding Text:
o plt.text() is used to add descriptive text at a specific point.
3. Annotating with an Arrow:
o plt.annotate() adds a label pointing to the wave's peak using an arrow.
4. Adding an Image:
o AnnotationBbox and OffsetImage are used to place an image (red dot) at a
specific location.
15-Marks
1. Explain in details about the function of mpl_tool kit for Geographic data
Visualisation with suitable example programs.
Output:
Key Points
1. mpl_toolkits.basemap is ideal for simple geographic visualizations but is
less powerful than Cartopy for advanced tasks.
2. It supports both static maps and dynamic overlays with data-driven
visualizations.
3. Combining map features like coastlines, states, and rivers with scatter
plots or heatmaps enhances visualization quality.
2. Find the following for the given data set:
Mean, Median, Mode, Variance, Standard Deviation and skewness.