Presentation1- plotting graph
Presentation1- plotting graph
problems
Exploratory data analysis
Matlab program
• % Sample data (replace with your actual data) • % Bar plot (for discrete time points)
• time = 0:0.1:10; % Time vector • time_discrete = 0:1:10; % Example discrete time points
• temperature = 20 + 10*sin(time); % Example temperature data • temperature_discrete = temperature(1:11); % Corresponding temperature values
• % Plot different types of graphs • subplot(2,2,3);
• figure; • bar(time_discrete, temperature_discrete);
• • title('Bar Plot');
• % Line plot • xlabel('Time');
• subplot(2,2,1); • ylabel('Temperature');
• plot(time, temperature); •
• title('Line Plot'); • % Histogram (for data distribution)
• xlabel('Time'); • subplot(2,2,4);
• ylabel('Temperature'); • histogram(temperature);
• • title('Histogram');
• % Scatter plot • xlabel('Temperature');
• subplot(2,2,2); • ylabel('Frequency');
• scatter(time, temperature);
• title('Scatter Plot');
• xlabel('Time');
• ylabel('Temperature');
• % Add more plot types as needed:
• % - Area plot: area(time, temperature)
• % - Stem plot: stem(time, temperature)
• % - Step plot: stairs(time, temperature)
• % - Pie chart (if applicable):
pie(temperature_discrete)
•
• % Customize plots further:
• % - Change colors, line styles, markers
• % - Add legends, grid lines, annotations
• % - Adjust axis limits
Python
• # Bar plot
• import matplotlib.pyplot as plt
• plt.figure(figsize=(8, 6))
• import pandas as pd
• plt.bar(df['Time'], df['Temperature'], label='Temperature')
• import seaborn as sns
• plt.xlabel('Time')
•
• plt.ylabel('Temperature')
• # Sample process data (replace with your actual data)
• plt.title('Temperature at Different Times')
• data = {'Time': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
• 'Temperature': [25, 27, 28, 26, 29, 30, 28, 27, 26, 25],
• plt.legend()
• 'Pressure': [101, 102, 100, 101, 103, 104, 102, 101, 100, 99],
• plt.show()
• 'Flow Rate': [50, 52, 48, 51, 53, 55, 52, 50, 49, 48]}
•
• df = pd.DataFrame(data) • # Histogram
• • plt.figure(figsize=(8, 6))
• # Line plot • plt.hist(df['Temperature'], bins=10, color='blue', alpha=0.7)
• plt.figure(figsize=(10, 6)) • plt.xlabel('Temperature')
• plt.plot(df['Time'], df['Temperature'], label='Temperature') • plt.ylabel('Frequency')
• plt.plot(df['Time'], df['Pressure'], label='Pressure') • plt.title('Temperature Distribution')
• plt.plot(df['Time'], df['Flow Rate'], label='Flow Rate') • plt.show()
• plt.xlabel('Time') •
• plt.ylabel('Value')
• plt.title('Process Variables Over Time')
• plt.legend()
• plt.show()
• # Scatter plot
• plt.figure(figsize=(8, 6))
• plt.scatter(df['Temperature'], df['Pressure'], c=df['Flow Rate'],
cmap='viridis')
• plt.xlabel('Temperature')
• plt.ylabel('Pressure')
• plt.title('Temperature vs. Pressure')
• plt.colorbar(label='Flow Rate')
• plt.show()
•
• # Box plot
• plt.figure(figsize=(8, 6))
• sns.boxplot(x='variable', y='value', data=pd.melt(df,
id_vars=['Time'], var_name='variable', value_name='value'))
• plt.xlabel('Variable')
• plt.ylabel('Value')
• plt.title('Box Plot of Process Variables')
• plt.show()
R program • # 2. Histogram
• Code snippet • hist(data$value, main = "Histogram of
• # Sample data (replace with your actual data) Values", xlab = "Value")
• # Assuming 'time' is a time series and 'value' is the
•
corresponding measurement • # 3. Boxplot
• time <- seq(as.POSIXct("2024-01-01"), • boxplot(data$value, main = "Boxplot of
Values")
as.POSIXct("2024-01-10"), by = "hour")
•
• value <- rnorm(length(time), mean = 100, sd = 5)
• • # 4. Scatterplot (if you have another variable)
• # Create a data frame • # Assuming 'another_variable' is available
• data <- data.frame(time, value) • # scatterplot(data$another_variable,
data$value, main = "Scatterplot")
• •
• # 1. Time Series Plot • # 5. Autocorrelation Plot (for time series
analysis)
• plot(data$time, data$value, type = "l", xlab =
"Time", ylab = "Value", • acf(data$value, main = "Autocorrelation
Plot")
• main = "Time Series Plot") •
•
• # 6. Density Plot
• plot(density(data$value), main = "Density Plot", xlab = "Value")
•
• # Customize plots further:
• # - Add colors, lines, points, and annotations using functions like
`lines()`, `points()`, `text()`, `abline()`.
• # - Adjust plot parameters using `par()` function (e.g.,
`par(mfrow=c(2,2))` for multiple plots in one window).
• # - Use `ggplot2` package for more advanced and customizable
visualizations.
•
• # Example using ggplot2 (install ggplot2 first:
install.packages("ggplot2"))
• library(ggplot2)
•
• # Time Series Plot with ggplot2
• ggplot(data, aes(x = time, y = value)) +
• geom_line() +
• labs(x = "Time", y = "Value", title = "Time Series Plot (ggplot2)")
•
• # Histogram with ggplot2
• ggplot(data, aes(x = value)) +
• geom_histogram(binwidth = 1, fill = "lightblue", color = "black") +
• labs(x = "Value", y = "Frequency", title = "Histogram of Values
(ggplot2)")
For a certain chemical reaction A R, the following data were obtained:
Temperature,0C 100 110 120 130 140
150
Rate
constant,s-1 1.055x 10-16 1.070 x 10-15 9.25 x 10-15 6.94x 10-14 4.58x10-13
3.19 x10-12
Find the activation energy and frequency factor for this reaction.
% Data
T = [100 110 120 130 140 150] + 273.15; % Temperature in Kelvin
k = [1.055e-16 1.070e-15 9.25e-15 6.94e-14 4.58e-13 3.19e-12]; % Rate
constants
% Linearize the Arrhenius equation: ln(k) = ln(A) - Ea/(R*T)
ln_k = log(k);
% Perform linear regression
p = polyfit(1./T, ln_k, 1);
% Extract slope and intercept
slope = p(1);
intercept = p(2);
% Calculate activation energy (Ea)
Ea = -slope * 8.314; % 8.314 J/mol*K is the ideal gas constant
% Calculate frequency factor (A)
A = exp(intercept);
% Display results
fprintf('Activation Energy (Ea): %.2f kJ/mol\n', Ea/1000);
fprintf('Frequency Factor (A): %.2e s^-1\n', A);
A hydrocarbon is burnt with excess air. The Orsat analysis of the flue gas shows
10.81% CO2, 3.78% O2 and 85.40 N2. Calculate the atomic ratio of C:H in the
hydrocarbon and the % excess air.
% Given Orsat analysis of flue gas
CO2_vol = 10.81; % Calculate moles of O2 consumed in combustion
N2_vol = 85.40;
% Assuming dry air composition (21% O2 and 79% N2) % Calculate moles of O2 required for complete combustion of
carbon
O2_air = 21;
O2_required_C = CO2_vol;
N2_air = 79;
% Calculate moles of N2 from air
% Calculate moles of O2 excess
N2_air_moles = N2_vol / N2_air;
O2_excess = O2_vol - (O2_supplied_air - O2_required_C);
% Calculate moles of O2 supplied by air
O2_supplied_air = N2_air_moles * (O2_air / N2_air);
% Calculate % excess air % Calculate moles of H in the hydrocarbon
% Determine the atomic ratio of C:H in the hydrocarbon % Calculate atomic ratio of C:H
% Calculate moles of H2O formed fprintf('Atomic ratio of C:H in the hydrocarbon: %.2f\n',
C_H_ratio);
H2O_moles = 2 * (O2_consumed - O2_required_C);
fprintf('% Excess air: %.2f%%\n', excess_air_percent);
Python
def calculate_combustion_parameters(co2_mol_frac,
o2_mol_frac, n2_mol_frac):
"""
Calculates the atomic ratio of C:H in a hydrocarbon and the %
excess air
given the Orsat analysis of the flue gas.
Args:
co2_mol_frac: Molar fraction of CO2 in the flue gas
o2_mol_frac: Molar fraction of O2 in the flue gas
n2_mol_frac: Molar fraction of N2 in the flue gas
Returns:
c_h_ratio: Atomic ratio of C:H in the hydrocarbon
excess_air_percent: Percent excess air
Raises:
ValueError: If the provided molar fractions do not sum to 1
if abs(co2_mol_frac + o2_mol_frac + n2_mol_frac - 1) > 1e-6: # Calculate moles of nitrogen in excess air
raise ValueError("Molar fractions of flue gas components must sum n2_excess_air = n2_moles - n2_supplied_in_air
to 1.")
# Calculate % excess air
# Assuming 100 moles of dry flue gas
excess_air_percent = (n2_excess_air / n2_supplied_in_air) * 100
co2_moles = 100 * co2_mol_frac
# Determine the atomic ratio of C:H in the hydrocarbon
o2_moles = 100 * o2_mol_frac
# Assuming complete combustion:
n2_moles = 100 * n2_mol_frac
h_moles = 2 * (o2_used_for_combustion * 2)
# Calculate moles of carbon in the hydrocarbon
c_h_ratio = c_moles / h_moles
c_moles = co2_moles
return c_h_ratio, excess_air_percent
# Calculate moles of oxygen in the hydrocarbon
if __name__ == "__main__":
o2_used_for_combustion = (c_moles * 2) + (o2_moles / 2)
co2_mol_frac = 0.1081
o2_supplied_in_air = o2_used_for_combustion + o2_moles
o2_mol_frac = 0.0378
# Calculate moles of nitrogen in the air
n2_mol_frac = 0.8540
n2_supplied_in_air = o2_supplied_in_air * (79 / 21)
c_h_ratio, excess_air_percent = calculate_combustion_parameters(
co2_mol_frac, o2_mol_frac, n2_mol_frac
)
Calculate the sphericity of a cylinder of dia 1 cm and height 3 cm.
Matlab Python
% Define the dimensions of the cylinder import math
diameter = 1; % cm
radius = diameter / 2; # Define the dimensions of the cylinder
height = 3; % cm diameter = 1 # cm
radius = diameter / 2
% Calculate the volume of the cylinder height = 3 # cm
cylinder_volume = pi * radius^2 * height;
# Calculate the volume of the cylinder
% Calculate the radius of a sphere with the same cylinder_volume = math.pi * radius**2 * height
volume
sphere_radius = (3 * cylinder_volume / (4 * # Calculate the radius of a sphere with the same volume
pi))^(1/3); sphere_radius = (3 * cylinder_volume / (4 * math.pi))**(1/3)
% Calculate the surface area of the cylinder # Calculate the surface area of the cylinder
cylinder_surface_area = 2 * pi * radius * (radius + cylinder_surface_area = 2 * math.pi * radius * (radius + height)
height);
# Calculate the surface area of the sphere
% Calculate the surface area of the sphere sphere_surface_area = 4 * math.pi * sphere_radius**2
sphere_surface_area = 4 * pi * sphere_radius^2;
# Calculate the sphericity
% Calculate the sphericity sphericity = (sphere_surface_area /
sphericity = (sphere_surface_area / cylinder_surface_area)**(2/3)
cylinder_surface_area)^(2/3);
# Display the result
% Display the result print(f"Sphericity of the cylinder: {sphericity:.4f}")
fprintf('Sphericity of the cylinder: %.4f\n',
sphericity);
• Key Pillars of Data Science
• Domain Knowledge: Most people thinking that domain knowledge is not
important in data science but it is essential.
• The foremost objective of data science is to extract useful insights from that
data so that it can be profitable to the company’s business.
• If you are not aware of the business side of the company that how the
business model of the company works.
• A Data Scientist can also be divided into different roles based on their skill
sets.
• Data Researcher
• Data Developers
• What’s in a data scientist’s toolbox?
• Unstructured data:
• It is typically categorized as qualitative data, cannot be processed and analyzed via
conventional data tools and methods.
• Unstructured data could be represented by a set of text files, photos, or video files.
• it is best managed in non-relational (NoSQL) databases. Another way to manage
unstructured data is to use data lakes to preserve it in raw form.
• Uses: Structured data is used in machine learning (ML) and drives its algorithms,
whereas unstructured data is used in natural language processing (NLP) and text
mining.
• Data Preprocessing:
• Data preprocessing is a process of preparing the raw data and making it
suitable for a machine learning model.
• Why is Data preprocessing important?
• Preprocessing of data is mainly to check the data quality. The quality can
be checked by the following:
• 1. Accuracy: To check whether the data entered is correct or not.
• 2. Completeness: To check whether the data is available or not recorded.
• 3. Consistency: To check whether the same data is kept in all the places
that does or do not match.
• 4. Timeliness: The data should be updated correctly.
• 5. Believability: The data should be trustable.
• 6. Interpretability: The understandability of the data.
• Data Cleaning
• It is particularly done as part of data preprocessing to clean the data by filling
missing values, smoothing the noisy data, resolving the inconsistency, and
removing outliers.
• Handling Missing Values: The null values in the dataset are imputed using mean/median
or mode based on the type of data that is missing:
• Numerical Data: If a numerical value is missing, then replace that NaN value with mean
or median. It is preferred to impute using the median value as the average or the mean
values are influenced by the outliers and skewness present in the data and are pulled in
their respective direction.
• Categorical Data: When categorical data is missing, replace that with the value which is
most occurring i.e. by mode.
The significance of EDA
• Different fields of science, economics, engineering, and marketing accumulate
and store data primarily in electronic databases.
• To be certain of the insights that the collected data provides and to make
further decisions, data mining is performed where we go through distinctive
analysis processes.
• Exploratory data analysis is key, and usually the first exercise in data mining.
• It allows us to visualize data to understand it as well as to create hypotheses
for further analysis.
• The exploratory analysis centers around creating a synopsis of data or insights
for the next steps in a data mining project.
• Steps in EDA
• Problem definition: Before trying to extract useful insight from the data, it is
essential to define the business problem to be solved.
• The problem definition works as the driving force for a data analysis plan
execution.
• The main tasks involved in problem definition are defining the main objective
of the analysis, defining the main deliverables, outlining the main roles and
responsibilities, obtaining the current status of the data, defining the
timetable, and performing cost/benefit analysis.
• Based on such a problem definition, an execution plan can be created.
• Data preparation: This step involves methods for preparing the dataset before
actual analysis.
• In this step, we define the sources of data, define data schemas and tables,
understand the main characteristics of the data, clean the dataset, delete non-
relevant datasets, transform the data, and divide the data into required
chunks for analysis.
• Data analysis: The main tasks involve summarizing the data, finding the
hidden correlation and relationships among the data, developing predictive
models, evaluating the models, and calculating the accuracies.
• Some of the techniques used for data summarization are summary tables,
graphs, descriptive statistics, inferential statistics, correlation statistics,
searching, grouping, and mathematical models.
• Classical data analysis: For the classical data analysis approach, the problem
definition and data collection step are followed by model development,
which is followed by analysis and result communication.
• Exploratory data analysis approach: For the EDA approach, it follows the
same approach as classical data analysis except the model imposition and
the data analysis steps are swapped.
• The main focus is on the data, its structure, outliers, models, and
visualizations. Generally, in EDA, we do not impose any deterministic or
probabilistic models on the data.
Software tools available for EDA
• There are several software tools that are available to facilitate EDA.
• Python: This is an open source programming language widely used in data analysis, data
mining, and data science (https://www.python.org/). For this book, we will be using
Python.
• Weka: This is an open source data mining package that involves several EDA tools and
algorithms (https://www.cs.waikato.ac.nz/ml/weka/).
• KNIME: This is an open source tool for data analysis and is based on Eclipse
(https://www.knime.com/).
Making sense of data
• It is crucial to identify the type of data under analysis.
• Different disciplines store different kinds of data for different purposes.
• For example, medical researchers store patients' data, universities store
students' and teachers' data, and real estate industries storehouse and
building datasets.
• A dataset contains many observations about a particular object.
• For instance, a dataset about patients in a hospital can contain many
observations.
• A patient can be described by a patient identifier (ID), name, address, weight,
date of birth, address, email, and gender. Each of these features that
describes a patient is a variable.
• Each observation can have a specific value for each of these variables. For
example, a patient can have the following:
Visual Aids for Exploratory Data Analysis
• Line chart
• Bar chart
• Scatter plot
• Histogram
• Pie chart
• Box plot
What is Data Visualization?
• Data Visualization is a process of taking raw data and transforming it into
graphical or pictorial representations such as charts, graphs, diagrams,
pictures, and videos which explain the data and allow you to gain insights
from it.
• So, users can quickly analyze the data and prepare reports to make business
decisions effectively.