0% found this document useful (0 votes)
2 views

Presentation1- plotting graph

The document provides an overview of various data visualization techniques and programming examples in Matlab, Python, and R for exploratory data analysis. It also discusses the calculation of activation energy and frequency factor for a chemical reaction, combustion parameters for hydrocarbons, and the sphericity of a cylinder. Additionally, it highlights key skills and tools essential for data scientists, including domain knowledge, mathematics, statistics, and programming.

Uploaded by

padminisuthakar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Presentation1- plotting graph

The document provides an overview of various data visualization techniques and programming examples in Matlab, Python, and R for exploratory data analysis. It also discusses the calculation of activation energy and frequency factor for a chemical reaction, combustion parameters for hydrocarbons, and the sphericity of a cylinder. Additionally, it highlights key skills and tools essential for data scientists, including domain knowledge, mathematics, statistics, and programming.

Uploaded by

padminisuthakar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

Plotting graphs and solving

problems
Exploratory data analysis
Matlab program
• % Sample data (replace with your actual data) • % Bar plot (for discrete time points)
• time = 0:0.1:10; % Time vector • time_discrete = 0:1:10; % Example discrete time points
• temperature = 20 + 10*sin(time); % Example temperature data • temperature_discrete = temperature(1:11); % Corresponding temperature values
• % Plot different types of graphs • subplot(2,2,3);
• figure; • bar(time_discrete, temperature_discrete);
• • title('Bar Plot');
• % Line plot • xlabel('Time');
• subplot(2,2,1); • ylabel('Temperature');
• plot(time, temperature); •
• title('Line Plot'); • % Histogram (for data distribution)
• xlabel('Time'); • subplot(2,2,4);
• ylabel('Temperature'); • histogram(temperature);
• • title('Histogram');
• % Scatter plot • xlabel('Temperature');
• subplot(2,2,2); • ylabel('Frequency');
• scatter(time, temperature);
• title('Scatter Plot');
• xlabel('Time');
• ylabel('Temperature');
• % Add more plot types as needed:
• % - Area plot: area(time, temperature)
• % - Stem plot: stem(time, temperature)
• % - Step plot: stairs(time, temperature)
• % - Pie chart (if applicable):
pie(temperature_discrete)

• % Customize plots further:
• % - Change colors, line styles, markers
• % - Add legends, grid lines, annotations
• % - Adjust axis limits
Python
• # Bar plot
• import matplotlib.pyplot as plt
• plt.figure(figsize=(8, 6))
• import pandas as pd
• plt.bar(df['Time'], df['Temperature'], label='Temperature')
• import seaborn as sns
• plt.xlabel('Time')

• plt.ylabel('Temperature')
• # Sample process data (replace with your actual data)
• plt.title('Temperature at Different Times')
• data = {'Time': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
• 'Temperature': [25, 27, 28, 26, 29, 30, 28, 27, 26, 25],
• plt.legend()
• 'Pressure': [101, 102, 100, 101, 103, 104, 102, 101, 100, 99],
• plt.show()
• 'Flow Rate': [50, 52, 48, 51, 53, 55, 52, 50, 49, 48]}

• df = pd.DataFrame(data) • # Histogram
• • plt.figure(figsize=(8, 6))
• # Line plot • plt.hist(df['Temperature'], bins=10, color='blue', alpha=0.7)
• plt.figure(figsize=(10, 6)) • plt.xlabel('Temperature')
• plt.plot(df['Time'], df['Temperature'], label='Temperature') • plt.ylabel('Frequency')
• plt.plot(df['Time'], df['Pressure'], label='Pressure') • plt.title('Temperature Distribution')
• plt.plot(df['Time'], df['Flow Rate'], label='Flow Rate') • plt.show()
• plt.xlabel('Time') •
• plt.ylabel('Value')
• plt.title('Process Variables Over Time')
• plt.legend()
• plt.show()
• # Scatter plot
• plt.figure(figsize=(8, 6))
• plt.scatter(df['Temperature'], df['Pressure'], c=df['Flow Rate'],
cmap='viridis')
• plt.xlabel('Temperature')
• plt.ylabel('Pressure')
• plt.title('Temperature vs. Pressure')
• plt.colorbar(label='Flow Rate')
• plt.show()

• # Box plot
• plt.figure(figsize=(8, 6))
• sns.boxplot(x='variable', y='value', data=pd.melt(df,
id_vars=['Time'], var_name='variable', value_name='value'))
• plt.xlabel('Variable')
• plt.ylabel('Value')
• plt.title('Box Plot of Process Variables')
• plt.show()
R program • # 2. Histogram
• Code snippet • hist(data$value, main = "Histogram of
• # Sample data (replace with your actual data) Values", xlab = "Value")
• # Assuming 'time' is a time series and 'value' is the

corresponding measurement • # 3. Boxplot
• time <- seq(as.POSIXct("2024-01-01"), • boxplot(data$value, main = "Boxplot of
Values")
as.POSIXct("2024-01-10"), by = "hour")

• value <- rnorm(length(time), mean = 100, sd = 5)
• • # 4. Scatterplot (if you have another variable)
• # Create a data frame • # Assuming 'another_variable' is available
• data <- data.frame(time, value) • # scatterplot(data$another_variable,
data$value, main = "Scatterplot")
• •
• # 1. Time Series Plot • # 5. Autocorrelation Plot (for time series
analysis)
• plot(data$time, data$value, type = "l", xlab =
"Time", ylab = "Value", • acf(data$value, main = "Autocorrelation
Plot")
• main = "Time Series Plot") •

• # 6. Density Plot
• plot(density(data$value), main = "Density Plot", xlab = "Value")

• # Customize plots further:
• # - Add colors, lines, points, and annotations using functions like
`lines()`, `points()`, `text()`, `abline()`.
• # - Adjust plot parameters using `par()` function (e.g.,
`par(mfrow=c(2,2))` for multiple plots in one window).
• # - Use `ggplot2` package for more advanced and customizable
visualizations.

• # Example using ggplot2 (install ggplot2 first:
install.packages("ggplot2"))
• library(ggplot2)

• # Time Series Plot with ggplot2
• ggplot(data, aes(x = time, y = value)) +
• geom_line() +
• labs(x = "Time", y = "Value", title = "Time Series Plot (ggplot2)")

• # Histogram with ggplot2
• ggplot(data, aes(x = value)) +
• geom_histogram(binwidth = 1, fill = "lightblue", color = "black") +
• labs(x = "Value", y = "Frequency", title = "Histogram of Values
(ggplot2)")
For a certain chemical reaction A R, the following data were obtained:
Temperature,0C 100 110 120 130 140
150
Rate
constant,s-1 1.055x 10-16 1.070 x 10-15 9.25 x 10-15 6.94x 10-14 4.58x10-13
3.19 x10-12
Find the activation energy and frequency factor for this reaction.
% Data
T = [100 110 120 130 140 150] + 273.15; % Temperature in Kelvin
k = [1.055e-16 1.070e-15 9.25e-15 6.94e-14 4.58e-13 3.19e-12]; % Rate
constants
% Linearize the Arrhenius equation: ln(k) = ln(A) - Ea/(R*T)
ln_k = log(k);
% Perform linear regression
p = polyfit(1./T, ln_k, 1);
% Extract slope and intercept
slope = p(1);
intercept = p(2);
% Calculate activation energy (Ea)
Ea = -slope * 8.314; % 8.314 J/mol*K is the ideal gas constant
% Calculate frequency factor (A)
A = exp(intercept);

% Display results
fprintf('Activation Energy (Ea): %.2f kJ/mol\n', Ea/1000);
fprintf('Frequency Factor (A): %.2e s^-1\n', A);
A hydrocarbon is burnt with excess air. The Orsat analysis of the flue gas shows
10.81% CO2, 3.78% O2 and 85.40 N2. Calculate the atomic ratio of C:H in the
hydrocarbon and the % excess air.
% Given Orsat analysis of flue gas
CO2_vol = 10.81; % Calculate moles of O2 consumed in combustion

O2_vol = 3.78; O2_consumed = O2_supplied_air - O2_vol;

N2_vol = 85.40;
% Assuming dry air composition (21% O2 and 79% N2) % Calculate moles of O2 required for complete combustion of
carbon
O2_air = 21;
O2_required_C = CO2_vol;
N2_air = 79;
% Calculate moles of N2 from air
% Calculate moles of O2 excess
N2_air_moles = N2_vol / N2_air;
O2_excess = O2_vol - (O2_supplied_air - O2_required_C);
% Calculate moles of O2 supplied by air
O2_supplied_air = N2_air_moles * (O2_air / N2_air);
% Calculate % excess air % Calculate moles of H in the hydrocarbon

excess_air_percent = (O2_excess / O2_required_C) * 100; H_moles = 2 * H2O_moles;

% Determine the atomic ratio of C:H in the hydrocarbon % Calculate atomic ratio of C:H

% Assuming complete combustion of carbon and hydrogen C_H_ratio = CO2_vol / H_moles;

% C_x H_y + (x + y/4)O2 -> xCO2 + (y/2)H2O


% Display results

% Calculate moles of H2O formed fprintf('Atomic ratio of C:H in the hydrocarbon: %.2f\n',
C_H_ratio);
H2O_moles = 2 * (O2_consumed - O2_required_C);
fprintf('% Excess air: %.2f%%\n', excess_air_percent);
Python
def calculate_combustion_parameters(co2_mol_frac,
o2_mol_frac, n2_mol_frac):
"""
Calculates the atomic ratio of C:H in a hydrocarbon and the %
excess air
given the Orsat analysis of the flue gas.
Args:
co2_mol_frac: Molar fraction of CO2 in the flue gas
o2_mol_frac: Molar fraction of O2 in the flue gas
n2_mol_frac: Molar fraction of N2 in the flue gas
Returns:
c_h_ratio: Atomic ratio of C:H in the hydrocarbon
excess_air_percent: Percent excess air

Raises:
ValueError: If the provided molar fractions do not sum to 1
if abs(co2_mol_frac + o2_mol_frac + n2_mol_frac - 1) > 1e-6: # Calculate moles of nitrogen in excess air
raise ValueError("Molar fractions of flue gas components must sum n2_excess_air = n2_moles - n2_supplied_in_air
to 1.")
# Calculate % excess air
# Assuming 100 moles of dry flue gas
excess_air_percent = (n2_excess_air / n2_supplied_in_air) * 100
co2_moles = 100 * co2_mol_frac
# Determine the atomic ratio of C:H in the hydrocarbon
o2_moles = 100 * o2_mol_frac
# Assuming complete combustion:
n2_moles = 100 * n2_mol_frac
h_moles = 2 * (o2_used_for_combustion * 2)
# Calculate moles of carbon in the hydrocarbon
c_h_ratio = c_moles / h_moles
c_moles = co2_moles
return c_h_ratio, excess_air_percent
# Calculate moles of oxygen in the hydrocarbon
if __name__ == "__main__":
o2_used_for_combustion = (c_moles * 2) + (o2_moles / 2)
co2_mol_frac = 0.1081
o2_supplied_in_air = o2_used_for_combustion + o2_moles
o2_mol_frac = 0.0378
# Calculate moles of nitrogen in the air
n2_mol_frac = 0.8540
n2_supplied_in_air = o2_supplied_in_air * (79 / 21)
c_h_ratio, excess_air_percent = calculate_combustion_parameters(
co2_mol_frac, o2_mol_frac, n2_mol_frac
)
Calculate the sphericity of a cylinder of dia 1 cm and height 3 cm.

Matlab Python
% Define the dimensions of the cylinder import math
diameter = 1; % cm
radius = diameter / 2; # Define the dimensions of the cylinder
height = 3; % cm diameter = 1 # cm
radius = diameter / 2
% Calculate the volume of the cylinder height = 3 # cm
cylinder_volume = pi * radius^2 * height;
# Calculate the volume of the cylinder
% Calculate the radius of a sphere with the same cylinder_volume = math.pi * radius**2 * height
volume
sphere_radius = (3 * cylinder_volume / (4 * # Calculate the radius of a sphere with the same volume
pi))^(1/3); sphere_radius = (3 * cylinder_volume / (4 * math.pi))**(1/3)

% Calculate the surface area of the cylinder # Calculate the surface area of the cylinder
cylinder_surface_area = 2 * pi * radius * (radius + cylinder_surface_area = 2 * math.pi * radius * (radius + height)
height);
# Calculate the surface area of the sphere
% Calculate the surface area of the sphere sphere_surface_area = 4 * math.pi * sphere_radius**2
sphere_surface_area = 4 * pi * sphere_radius^2;
# Calculate the sphericity
% Calculate the sphericity sphericity = (sphere_surface_area /
sphericity = (sphere_surface_area / cylinder_surface_area)**(2/3)
cylinder_surface_area)^(2/3);
# Display the result
% Display the result print(f"Sphericity of the cylinder: {sphericity:.4f}")
fprintf('Sphericity of the cylinder: %.4f\n',
sphericity);
• Key Pillars of Data Science
• Domain Knowledge: Most people thinking that domain knowledge is not
important in data science but it is essential.
• The foremost objective of data science is to extract useful insights from that
data so that it can be profitable to the company’s business.
• If you are not aware of the business side of the company that how the
business model of the company works.

• Math Skills: Linear Algebra, Multivariable Calculus & Optimization


Technique: These three things are very important as they help us in
understanding various machine learning algorithms that play an important role
in Data Science.
• Statistics & Probability: Understanding of Statistics is very significant as this is
a part of Data analysis
• Computer Science: Programming Knowledge, Relational Databases, Non-
Relational Databases, Machine Learning, Distributed Computing.

• Who is a Data Scientist?


• who integrates the skills of software programmer, statistician and storyteller slash
artist to extract the nuggets of gold hidden under mountains of data”.

• A Data Scientist can also be divided into different roles based on their skill
sets.

• Data Researcher
• Data Developers
• What’s in a data scientist’s toolbox?

• Data visualization: the presentation of data in a pictorial or graphical format so it can


be easily analyzed.
• Machine learning: a branch of artificial intelligence based on mathematical
algorithms and automation.
• Deep learning: an area of machine learning research that uses data to model
complex abstractions.
• Pattern recognition: technology that recognizes patterns in data (often used
interchangeably with machine learning).
• Data preparation: the process of converting raw data into another format so it can
be more easily consumed.
• Text analytics: the process of examining unstructured data to glean key business
insights.
• Technical Skills for Data Scientists

• Math (e.g. linear algebra, calculus and probability)


• Statistics (e.g. hypothesis testing and summary statistics)
• Machine learning tools and techniques (e.g. k-nearest neighbors, random forests, ensemble methods, etc.)
• Software engineering skills (e.g. distributed computing, algorithms and data structures)
• Data mining
• Data cleaning and munging
• Data visualization (e.g. ggplot and d3.js) and reporting techniques
• Unstructured data techniques
• R and/or SAS (Statistical Analysis System) languages
• SQL databases and database querying languages
• Python (most common), C/C++ Java, Perl
• Big data platforms like Hadoop, Hive & Pig
• Cloud tools like Amazon S3 (Simple Storage Service)
• Types of Data:
• Structured data —
• It is typically categorized as quantitative data — is highly organized and easily
decipherable by machine learning algorithms.
• The simplest example of structured data would be a .xls or .csv file where every
column stands for an attribute of the data.
• Structured query language (SQL) is the programming language used to manage
structured data.

• Unstructured data:
• It is typically categorized as qualitative data, cannot be processed and analyzed via
conventional data tools and methods.
• Unstructured data could be represented by a set of text files, photos, or video files.
• it is best managed in non-relational (NoSQL) databases. Another way to manage
unstructured data is to use data lakes to preserve it in raw form.
• Uses: Structured data is used in machine learning (ML) and drives its algorithms,
whereas unstructured data is used in natural language processing (NLP) and text
mining.
• Data Preprocessing:
• Data preprocessing is a process of preparing the raw data and making it
suitable for a machine learning model.
• Why is Data preprocessing important?
• Preprocessing of data is mainly to check the data quality. The quality can
be checked by the following:
• 1. Accuracy: To check whether the data entered is correct or not.
• 2. Completeness: To check whether the data is available or not recorded.
• 3. Consistency: To check whether the same data is kept in all the places
that does or do not match.
• 4. Timeliness: The data should be updated correctly.
• 5. Believability: The data should be trustable.
• 6. Interpretability: The understandability of the data.
• Data Cleaning
• It is particularly done as part of data preprocessing to clean the data by filling
missing values, smoothing the noisy data, resolving the inconsistency, and
removing outliers.
• Handling Missing Values: The null values in the dataset are imputed using mean/median
or mode based on the type of data that is missing:

• Numerical Data: If a numerical value is missing, then replace that NaN value with mean
or median. It is preferred to impute using the median value as the average or the mean
values are influenced by the outliers and skewness present in the data and are pulled in
their respective direction.
• Categorical Data: When categorical data is missing, replace that with the value which is
most occurring i.e. by mode.
The significance of EDA
• Different fields of science, economics, engineering, and marketing accumulate
and store data primarily in electronic databases.

• Appropriate and well-established decisions should be made using the data


collected.

• It is practically impossible to make sense of datasets containing more than a


handful of data points without the help of computer programs.

• To be certain of the insights that the collected data provides and to make
further decisions, data mining is performed where we go through distinctive
analysis processes.
• Exploratory data analysis is key, and usually the first exercise in data mining.
• It allows us to visualize data to understand it as well as to create hypotheses
for further analysis.
• The exploratory analysis centers around creating a synopsis of data or insights
for the next steps in a data mining project.

• Steps in EDA
• Problem definition: Before trying to extract useful insight from the data, it is
essential to define the business problem to be solved.
• The problem definition works as the driving force for a data analysis plan
execution.
• The main tasks involved in problem definition are defining the main objective
of the analysis, defining the main deliverables, outlining the main roles and
responsibilities, obtaining the current status of the data, defining the
timetable, and performing cost/benefit analysis.
• Based on such a problem definition, an execution plan can be created.

• Data preparation: This step involves methods for preparing the dataset before
actual analysis.
• In this step, we define the sources of data, define data schemas and tables,
understand the main characteristics of the data, clean the dataset, delete non-
relevant datasets, transform the data, and divide the data into required
chunks for analysis.
• Data analysis: The main tasks involve summarizing the data, finding the
hidden correlation and relationships among the data, developing predictive
models, evaluating the models, and calculating the accuracies.
• Some of the techniques used for data summarization are summary tables,
graphs, descriptive statistics, inferential statistics, correlation statistics,
searching, grouping, and mathematical models.

• Development and representation of the results: This step involves presenting


the dataset to the target audience in the form of graphs, summary tables,
maps, and diagrams.
Comparing EDA with classical and Bayesian analysis
• There are several approaches to data analysis.

• Classical data analysis: For the classical data analysis approach, the problem
definition and data collection step are followed by model development,
which is followed by analysis and result communication.

• Exploratory data analysis approach: For the EDA approach, it follows the
same approach as classical data analysis except the model imposition and
the data analysis steps are swapped.
• The main focus is on the data, its structure, outliers, models, and
visualizations. Generally, in EDA, we do not impose any deterministic or
probabilistic models on the data.
Software tools available for EDA
• There are several software tools that are available to facilitate EDA.

• Python: This is an open source programming language widely used in data analysis, data
mining, and data science (https://www.python.org/). For this book, we will be using
Python.

• R programming language: R is an open source programming language that is widely


utilized in statistical computation and graphical data analysis (https://www.r-project.org).

• Weka: This is an open source data mining package that involves several EDA tools and
algorithms (https://www.cs.waikato.ac.nz/ml/weka/).

• KNIME: This is an open source tool for data analysis and is based on Eclipse
(https://www.knime.com/).
Making sense of data
• It is crucial to identify the type of data under analysis.
• Different disciplines store different kinds of data for different purposes.
• For example, medical researchers store patients' data, universities store
students' and teachers' data, and real estate industries storehouse and
building datasets.
• A dataset contains many observations about a particular object.
• For instance, a dataset about patients in a hospital can contain many
observations.
• A patient can be described by a patient identifier (ID), name, address, weight,
date of birth, address, email, and gender. Each of these features that
describes a patient is a variable.
• Each observation can have a specific value for each of these variables. For
example, a patient can have the following:
Visual Aids for Exploratory Data Analysis

• As data scientists, two important goals in our work would be to extract


knowledge from the data and to present the data to stakeholders.
• Presenting results to stakeholders is very complex in the sense that our
audience may not have enough technical knowledge to understand
programming and other technicalities. Hence, visual aids are very useful tools.
• different types of visual aids that can be used

• Line chart
• Bar chart
• Scatter plot
• Histogram
• Pie chart
• Box plot
What is Data Visualization?
• Data Visualization is a process of taking raw data and transforming it into
graphical or pictorial representations such as charts, graphs, diagrams,
pictures, and videos which explain the data and allow you to gain insights
from it.
• So, users can quickly analyze the data and prepare reports to make business
decisions effectively.

• Importance of Data Visualization


• We are inherently in the visual world where pictures or images speak more
than words. So it is easy to visualize a large amount of data using graphs and
charts than depending on reports or spreadsheets
• Data visualization is a quick and easy way to convey concepts to the end-
users, and you can do experiments with different scenarios by making slight
changes.
• It can also:
• Clarifies which element influences customer behaviour.
• Identifies the area to which you need to pay attention.
• Guides you to understand which product should be placed in which location.
• Predicts the sales volume.
• The better you visualize your points, the better you can leverage the information to the
end-users.
• Line Charts
• Line charts are mostly used charts to represent the data and are characterized
by a series of data points connected by a straight line.
• Each point in the line corresponds to a data value in the given category. It
shows the exact value of the plotted data.
• Line charts should only be used to measure the trends over a period of time,
e.g. dates, months, and years
• Bar Charts
• A bar chart or bar graph is a chart that represents categorical data with
rectangular bars with heights proportional to the values that they represent.
Here one axis of the chart plots categories and the other axis represents the
value scale. The bars are of equal width which allows for instant comparison
of data.
• Scatter plot:-
• Scatter plots are commonly used in statistical analysis in order to visualize
numerical relationships.
• So, They are use in order to determine whether two measures are correlate by
plotting them on the x and y-axis. They are suitable for recognizing trends.
• For instance, the house’s area against price and the trend line.
• The data points are concentrated in the lower price and lower area range. A
few outliers are indicating larger area houses available for lower prices.
• Histogram:-
• A histogram is a value distribution plot of numerical columns.
• It basically creates bins in various ranges in values and plots it where we can
visualize how values are distributed.
• Pie chart:-
• A pie chart is a graphical representation technique that displays data in a circular-shaped
graph.
• It is a composite static chart that works best with few variables. Pie charts are often used to
represent sample data with data points belonging to a combination of different categories.
• Each of these categories is represented as a “slice of the pie.” The size of each slice is
directly proportional to the number of data points that belong to a particular category.
• Box Plot:-
• Box Plot is the visual representation of the depicting groups of numerical data
through their quartiles.
• Boxplot is also used for detect the outlier in data set.
• It captures the summary of the data efficiently with a simple box and whiskers
and allows us to compare easily across groups.
• Boxplot summarizes a sample data using 25th, 50th and 75th percentiles.
• A box plot consist of 5 things.
• Minimum
• First Quartile or 25%
• Median (Second Quartile) or 50%
• Third Quartile or 75%
• Maximum

You might also like