(Ebook) Data Science Solutions with Python: Fast and Scalable Models Using Keras, PySpark MLlib, H2O, XGBoost, and Scikit-Learn by Tshepo Chris Nokeri ISBN 9781484277614, 1484277619, 9671484277614 - Instantly access the complete ebook with just one click
(Ebook) Data Science Solutions with Python: Fast and Scalable Models Using Keras, PySpark MLlib, H2O, XGBoost, and Scikit-Learn by Tshepo Chris Nokeri ISBN 9781484277614, 1484277619, 9671484277614 - Instantly access the complete ebook with just one click
com
OR CLICK BUTTON
DOWLOAD EBOOK
(Ebook) Biota Grow 2C gather 2C cook by Loucas, Jason; Viles, James ISBN
9781459699816, 9781743365571, 9781925268492, 1459699815, 1743365578, 1925268497
https://ebooknice.com/product/biota-grow-2c-gather-2c-cook-6661374
ebooknice.com
(Ebook) Matematik 5000+ Kurs 2c Lärobok by Lena Alfredsson, Hans Heikne, Sanna
Bodemyr ISBN 9789127456600, 9127456609
https://ebooknice.com/product/matematik-5000-kurs-2c-larobok-23848312
ebooknice.com
(Ebook) SAT II Success MATH 1C and 2C 2002 (Peterson's SAT II Success) by Peterson's
ISBN 9780768906677, 0768906679
https://ebooknice.com/product/sat-ii-success-math-1c-and-2c-2002-peterson-s-sat-
ii-success-1722018
ebooknice.com
(Ebook) Master SAT II Math 1c and 2c 4th ed (Arco Master the SAT Subject Test: Math
Levels 1 & 2) by Arco ISBN 9780768923049, 0768923042
https://ebooknice.com/product/master-sat-ii-math-1c-and-2c-4th-ed-arco-master-
the-sat-subject-test-math-levels-1-2-2326094
ebooknice.com
(Ebook) Cambridge IGCSE and O Level History Workbook 2C - Depth Study: the United
States, 1919-41 2nd Edition by Benjamin Harrison ISBN 9781398375147, 9781398375048,
1398375144, 1398375047
https://ebooknice.com/product/cambridge-igcse-and-o-level-history-
workbook-2c-depth-study-the-united-states-1919-41-2nd-edition-53538044
ebooknice.com
(Ebook) Econometrics and Data Science: Apply Data Science Techniques to Model
Complex Problems and Implement Solutions by Tshepo Chris Nokeri ISBN 9781484274347,
1484274342
https://ebooknice.com/product/econometrics-and-data-science-apply-data-science-
techniques-to-model-complex-problems-and-implement-solutions-35700072
ebooknice.com
(Ebook) Data Science Revealed by Tshepo Chris Nokeri ISBN 9781484268698, 1484268695
https://ebooknice.com/product/data-science-revealed-33446642
ebooknice.com
(Ebook) Econometrics and Data Science: Apply Data Science Techniques to Model
Complex Problems and Implement Solutions for Economic Problems by Tshepo Chris
Nokeri ISBN 9781484274330, 1484274334
https://ebooknice.com/product/econometrics-and-data-science-apply-data-science-
techniques-to-model-complex-problems-and-implement-solutions-for-economic-
problems-36067370
ebooknice.com
(Ebook) Web App Development and Real-Time Web Analytics with Python: Develop and
Integrate Machine Learning Algorithms into Web Apps by Nokeri, Tshepo Chris ISBN
9781484277829, 1484277821
https://ebooknice.com/product/web-app-development-and-real-time-web-analytics-
with-python-develop-and-integrate-machine-learning-algorithms-into-web-
apps-36127710
ebooknice.com
Tshepo Chris Nokeri
Apress Standard
The publisher, the authors and the editors are safe to assume that the
advice and information in this book are believed to be true and accurate
at the date of publication. Neither the publisher nor the authors or the
editors give a warranty, expressed or implied, with respect to the
material contained herein or for any errors or omissions that may have
been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.
Model Description
Linear Applied when there is one dependent feature (continuous feature)
regression and an independent feature (continuous feature of categorical).
method The main linear regression methods are GLM, Ridge, Lasso, Elastic
Net, etc.
Survival Applied to time-event-related censored data, where the dependent
regression feature is categorical and the independent feature is continuous.
method
Time series Applied to uncovering patterns in sequential data and forecasting
analysis future instances. Principal time series models include the ARIMA
method model, SARIMA, Additive model, etc.
Techniques Description
Centroid Applied to determine the center of the data and draw data points
clustering toward the center. The main centroid clustering method is the k-
means method.
Techniques Description
Density Applied to determine where the data is concentrated. The main
clustering density clustering model is the DBSCAN method.
Distribution Identifies the probability of data points belonging to a cluster
clustering based on some distribution. The main distribution clustering
method is the Gaussian Mixture method.
Technique Description
Factor analysis Applied to determine the extent to which factors elucidate
related changes of features in the data.
Principal component Applied to determine the extent to which factors elucidate
analysis related changes of features in the data.
Conclusion
This chapter covered two ways in which machines learn—via
supervised and unsupervised learning. It began by explaining
supervised machine learning and discussing the three types of
supervised learning methods and their applications. It then covered
unsupervised learning techniques, dimension reduction, and cluster
analysis.
© The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature 2022
T. C. Nokeri, Data Science Solutions with Python
https://doi.org/10.1007/978-1-4842-7762-1_2
This chapter carefully presents the big data framework used for parallel data processing
called Apache Spark. It also covers several machine learning (ML) and deep learning (DL)
frameworks useful for building scalable applications. After reading this chapter, you will
understand how big data is collected, manipulated, and examined using resilient and fault-
tolerant technologies. It discusses the Scikit-Learn, Spark MLlib, and XGBoost frameworks. It
also covers a deep learning framework called Keras. It concludes by discussing effective
ways of setting up and managing these frameworks.
Big data frameworks support parallel data processing. They enable you to contain big
data across many clusters. The most popular big data framework is Apache Spark, which is
built on the Hadoop framework.
Big Data
Big data means different things to different people. In this book, we define big data as large
amounts of data that we cannot adequately handle and manipulate using classic methods.
We must undoubtedly use scalable frameworks and modern technologies to process and
draw insight from this data. We typically consider data “big” when it cannot fit within the
current in-memory storage space. For instance, if you have a personal computer and the
data at your disposal exceeds your computer’s storage capacity, it’s big data. This equally
applies to large corporations with large clusters of storage space. We often speak about big
data when we use a stack with Hadoop/Spark.
Element Description
Velocity Modern technologies and improved connectivity enable you to generate data at an unprecedented
speed. Characteristics of velocity include batch data, near or real-time data, and streams.
Volume The scale at which data increases. The nature of data sources and infrastructure influence the
volume of data. Characteristics of the volume include exabyte, zettabyte, etc.
Element Description
Variety Data can come from unique sources. Modern technological devices leave digital footprints here
and there, which increase the number of sources from which businesses and people can get data.
Characteristics of variety include the structure and complexity of the data.
Veracity Data must come from reliable sources. Also, it must be of high quality, consistent, and complete.
Improved Decision-Making
When a business has big data, it can use it to uncover complex patterns of a phenomenon to
influence strategy. This approach helps management make well-informed decisions based
on evidence, rather than on subjective reasoning. Data-driven organizations foster a culture
of evidence-based management.
We also use big data in fields like life sciences, physics, economics, and medicine. There
are many ways in which big data affects the world. This chapter does not consider all
factors. The next sections explain big data warehousing and ETL activities.
Activity Description
Extract Involves getting data from some database.
Transforming Involves converting data from a database into a suitable format for analysis and reporting
Loading Involves warehousing data in a database management system.
To perform ETL activities, you must use a query language. The most popular query
language is SQL (Standard Query Language). There are other query languages that emerged
with the open source movement, such as HiveQL and BigQuery. The Python programming
language supports SQL. Python frameworks can connect to databases by implementing
libraries, such as SQLAlchemy, pyodbc, SQLite, SparkSQL, and pandas, among others.
Apache Spark
Apache Spark executes in-memory cluster computing. It enables developers to build
scalable applications using Java, Scala, Python, R, and SQL. It includes cluster components
like the driver, cluster manager, and executor. You can use it as a standalone cluster manager
or on top of Mesos, Hadoop, YARN, or Baronets. You can use it to access data in the Hadoop
File System (HDFS), Cassandra, HBase, and Hive, among other data sources. The Spark data
structure is considered a resilient distributed data set. This book introduces a framework
that integrates both Python and Apache Spark (PySpark). The book uses it to operate Spark
MLlib. To understand this framework, you first need to grasp the idea behind resilient
distributed data sets.
Spark Configuration
Areas of Spark configuration include Spark properties, environment variables, and logging.
The default configuration directory is SPARK_HOME/conf.
You can install the findspark library in your environment using pip install
findspark and install the pyspark library using pip install pyspark.
Listing 2-1 prepares the PySpark framework using the findspark framework.
import findspark as initiate_pyspark
initiate_pyspark.init("filepath\spark-3.0.0-bin-hadoop2.7")
Listing 2-1 Prepare the PySpark Framework
Listing 2-2 stipulates the PySpark app using the SparkConf() method.
Listing 2-3 prepares the PySpark session with the SparkSession() method.
Spark Frameworks
Spark frameworks extend the core of the Spark API. There are four main Spark frameworks
—SparkSQL, Spark Streaming, Spark MLlib, and GraphX.
SparkSQL
SparkSQL enables you to use relational query languages like SQL, HiveQL, and Scala. It
includes a schemaRDD that has row objects and schema. You create it using an existing
RDD, parquet file, or JSON data set. You execute the Spark Context to create a SQL context.
Spark Streaming
Spark streaming is a scalable streaming framework that supports Apache Kafka, Apache
Flume, HDFS, and Apache Kensis, etc. It processes input data using DStream in small batches
you push using HDFS, databases, and dashboards. Recent versions of Python do not support
Spark Streaming. Consequently, we do not cover the framework in this book. You can use a
Spark Streaming application to read input from any data source and store a copy of the data
in HDFS. This allows you to build and launch a Spark Streaming application that processes
incoming data and runs an algorithm on it.
Spark MLlib
MLlib is an ML framework that allows you to develop and test ML and DL models. In Python,
the frameworks work hand-in-hand with the NumPy framework. Spark MLlib can be used
with several Hadoop data sources and incorporated alongside Hadoop workflows. Common
algorithms include regression, classification, clustering, collaborative filtering, and
dimension reduction. Key workflow utilities include feature transformation, standardization
and normalization, pipeline development, model evaluation, and hyperparameter
optimization.
GraphX
GraphX is a scalable and fault-tolerant framework for iterative and fast graph parallel
computing, social networks, and language modeling. It includes graph algorithms such as
PageRank for estimating the importance of each vertex in a graph, Connected Components
for labeling connected components of the graph with the ID of its lowest-numbered vertex,
and Triangle Counting for finding the number of triangles that pass through each vertex.
ML Frameworks
To solve ML problems, you need to have a framework that supports building and scaling ML
models. There is no shortage of ML models – there are innumerable frameworks for ML.
There are several ML frameworks that you can use. Subsequent chapters cover frameworks
like Scikit-Learn, Spark MLlib, H2O, and XGBoost.
Scikit-Learn
The Scikit-Learn framework includes ML algorithms like regression, classification, and
clustering, among others. You can use it with other frameworks such as NumPy and SciPy. It
can perform most of the tasks required for ML projects like data processing, transformation,
data splitting, normalization, hyperparameter optimization, model development, and
evaluation. Scikit-Learn comes with most distribution packages that support Python. Use
pip install sklearn to install it in your Python environment .
H2O
H2O is an ML framework that uses a driverless technology. It enables you to accelerate the
adoption of AI solutions. It is very easy to use, and it does not require any technical
expertise. Not only that, but it supports numerical and categorical data, including text.
Before you train the ML model, you must first load the data into the H2O cluster. It supports
CSV, Excel, and Parquet files. Default data sources include local file systems, remote files,
Amazon S3, HDFS, etc. It has ML algorithms like regression, classification, cluster analysis,
and dimension reduction. It can also perform most tasks required for ML projects like data
processing, transformation, data splitting, normalization, hyperparameter optimization,
model development, checking pointing, evaluation, and productionizing. Use pip install
h2o to install the package in your environment.
Listing 2-4 prepares the H2O framework.
import h2o
h2o.init()
Listing 2-4 Initializing the H2O Framework
XGBoost
XGBoost is an ML framework that supports programming languages, including Python. It
executes gradient-boosted models that are scalable, and learns fast parallel and distributed
computing without sacrificing memory efficiency. Not only that, but it is an ensemble
learner. As mentioned in Chapter a, ensemble learners can solve both regression and
classification problems. XGBoost uses boosting to learn from the errors committed in the
preceding trees. It is useful when tree-based models are overfitted. Use pip install
xgboost to install the model in your Python environment.
DL Frameworks
DL frameworks provide a structure that supports scaling artificial neural networks. You can
use it stand-alone or with other models. It typically includes programs and code
frameworks. Primary DL frameworks include TensorFlow, PyTorch, Deeplearning4j,
Microsoft Cognitive Toolkit (CNTK), and Keras.
Keras
Keras is a high-level DL framework written using Python; it runs on top of an ML platform
known as TensorFlow. It is effective for rapid prototyping of DL models. You can run Keras
on Tensor Processing Units or on massive Graphic Processing Units. The main Keras APIs
include models, layers, and callbacks. Chapter 7 covers this framework. Execute pip
install Keras and pip install tensorflow to use the Keras framework.
© The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature 2022
T. C. Nokeri, Data Science Solutions with Python
https://doi.org/10.1007/978-1-4842-7762-1_3
This introductory chapter explains the ordinary least-squares method and executes it with the main Python
frameworks (i.e., Scikit-Learn, Spark MLlib, and H2O). It begins by explaining the underlying concept behind
the method.
import pandas as pd
df = pd.read_csv(r"filepath\WA_Fn-UseC_-
Marketing_Customer_Value_Analysis.csv")
Listing 3-1 Attain the Data
Listing 3-2 stipulates the names of columns to drop and then executes the drop() method. It then
stipulates axes as columns in order to drop the unnecessary columns in the data.
Listing 3-3 attains the dummy values for the categorical features in this data.
import numpy as np
int_x = initial_data.iloc[::,0:19]
fin_x = initial_data.iloc[::,19:21]
x_combined = pd.concat([int_x, fin_x], axis=1)
x = np.array(x_combined)
y = np.array(initial_data.iloc[::,19])
Listing 3-4 Outline the Features
Scikit-Learn in Action
Listing 3-5 randomly divides the dataframe.
Listing 3-8 determines the best hyperparameters for the Scikit-Learn ordinary least-squares regression
method.
print(sk_linear_model.intercept_)
433.0646521131769
Listing 3-10 Compute the Scikit-Learn Ordinary Least-Squares Regression Method’s Intercept
Listing 3-11 computes the Scikit-Learn ordinary least-squares regression method’s coefficients.
print(sk_linear_model.coef_)
[-6.15076155e-15 2.49798076e-13 -1.95573220e-14 -1.90089677e-14
-5.87187344e-14 2.50923806e-14 -1.05879478e-13 1.53591400e-14
-1.82507711e-13 -7.86327034e-14 4.17629484e-13 1.28923537e-14
6.52911311e-14 -5.28069778e-14 -1.57900159e-14 -6.74040176e-14
-9.28427833e-14 5.03132848e-14 -8.75978166e-15 2.90235705e+02
-9.55950515e-14]
Listing 3-11 Compute the Scikit-Learn Ordinary Least-Squares Regression Method’s Coefficients
sk_yhat = sk_linear_model.predict(sk_standard_scaled_x_test)
Listing 3-12 Compute the Scikit-Learn Ordinary Least-Squares Regression Method’s Predictions
Listing 3-13 assesses the Scikit-Learn ordinary least-squares method (see Table 3-1).
Table 3-1 shows that the Scikit-Learn ordinary least-squares method explains the entire variability.
PySpark in Action
This section executes and assesses the ordinary least-squares method with the PySpark framework. Listing
3-14 prepares the PySpark framework with the findspark framework.
Listing 3-16 prepares the PySpark session with the SparkSession() method.
Listing 3-17 changes the pandas dataframe created earlier in this chapter to a PySpark dataframe using
the createDataFrame() method.
pyspark_initial_data = pyspark_session.createDataFrame(initial_data)
Listing 3-17 Change Pandas Dataframe to a PySpark Dataframe
Listing 3-18 creates a list for independent features and a string for the dependent feature. It converts
data using the VectorAssembler() method for modeling with the PySpark framework.
x_list = list(x_combined.columns)
y_list = initial_data.columns[19]
from pyspark.ml.feature import VectorAssembler
pyspark_data_columns = x_list
pyspark_vector_assembler = VectorAssembler(inputCols=pyspark_data_columns,
outputCol="variables")
pyspark_data = pyspark_vector_assembler.transform(pyspark_initial_data)
Listing 3-18 Transform the Data
(pyspark_training_data, pyspark_test_data) =
pyspark_data.randomSplit([.8,.2])
Listing 3-19 Divide the Dataframe
Listing 3-21 computes the PySpark ordinary least-squares regression method’s predictions.
pyspark_yhat = pyspark_fitted_linear_model.transform(pyspark_test_data)
Listing 3-21 Compute the PySpark Ordinary Least-Squares Regression Method’s Predictions
pyspark_linear_model_assessment = pyspark_fitted_linear_model.summary
print("PySpark root mean squared error",
pyspark_linear_model_assessment.rootMeanSquaredError)
print("PySpark determinant coefficient", pyspark_linear_model_assessment.r2)
H2O in Action
This section executes and assesses the ordinary least-squares method with the H2O framework.
Listing 3-23 prepares the H2O framework.
h2o_data = initialize_h2o.H2OFrame(initial_data)
Listing 3-24 Change the Pandas Dataframe to an H2O Dataframe
y = y_list
x = h2o_data.col_names
x.remove(y_list)
Listing 3-25 Outline Features
h2o_yhat = h2o_linear_model.predict(h2o_test_data)
Listing 3-28 H2O Ordinary Least-Squares Method Executed Predictions
Listing 3-29 computes the H2O ordinary least-squares method’s standardized coefficients (see Figure 3-
1).
h2o_linear_model_std_coefficients = h2o_linear_model.std_coef_plot()
h2o_linear_model_std_coefficients
Listing 3-29 H2O Ordinary Least-Squares Method’s Standardized Coefficients
Figure 3-1 H2O ordinary least-squares method’s standardized coefficients
Listing 3-30 computes the H2O ordinary least-squares method’s partial dependency (see Figure 3-2).
h2o_linear_model_dependency_plot = h2o_linear_model.partial_plot(data =
h2o_data, cols = list(initial_data.columns[[0,19]]), server=False, plot =
True)
h2o_linear_model_dependency_plot
Listing 3-30 H2O Ordinary Least-Squares Method’s Partial Dependency
Figure 3-2 H2O ordinary least-squares method’s partial dependency
Listing 3-31 arranges the features that are the most important to the H2O ordinary least-squares method
in ascending order (see Figure 3-3).
h2o_linear_model_feature_importance = h2o_linear_model.varimp_plot()
h2o_linear_model_feature_importance
Listing 3-31 H2O Ordinary Least-Squares Method’s Feature Importance
Figure 3-3 H2O ordinary least-squares method’s feature importance
Listing 3-32 assesses the H2O ordinary least-squares method.
h2o_linear_model_assessment = h2o_linear_model.model_performance()
print(h2o_linear_model_assessment)
ModelMetricsRegressionGLM: glm
** Reported on train data. **
MSE: 24844.712331260016
RMSE: 157.6220553452467
MAE: 101.79904883889066
RMSLE: NaN
R^2: 0.7004468136072375
Mean Residual Deviance: 24844.712331260016
Null degrees of freedom: 7325
Residual degrees of freedom: 7304
Null deviance: 607612840.7465751
Residual deviance: 182012362.53881088
AIC: 94978.33944003603
Listing 3-32 Assess H2O Ordinary Least-Squares Method
Listing 3-33 improves the performance of the H2O ordinary least-squares method by specifying
remove_collinear_columns as True.
h2o_linear_model_collinear_removed = H2OGeneralizedLinearEstimator(family="gau
0,remove_collinear_columns = True)
h2o_linear_model_collinear_removed.train(x=x,y=y,training_frame=h2o_training_d
Listing 3-33 Improve the Performance of the Ordinary Least-Squares Method
MSE: 23380.71864337616
RMSE: 152.9075493341521
MAE: 102.53007935777588
RMSLE: NaN
R^2: 0.7180982143647627
Mean Residual Deviance: 23380.71864337616
Null degrees of freedom: 7325
Residual degrees of freedom: 7304
Null deviance: 607612840.7465751
Residual deviance: 171287144.78137374
AIC: 94533.40762597627
ModelMetricsRegressionGLM: glm
** Reported on validation data. **
MSE: 25795.936313899092
RMSE: 160.6111338416459
MAE: 103.18677222520363
RMSLE: NaN
R^2: 0.7310558588001701
Mean Residual Deviance: 25795.936313899092
Null degrees of freedom: 875
Residual degrees of freedom: 854
Null deviance: 84181020.04623385
Residual deviance: 22597240.210975606
AIC: 11430.364002305443
Listing 3-34 Assess the H2O Ordinary Least-Squares Method
Conclusion
This chapter executed three key machine learning frameworks (Scikit-Learn, PySpark, and H2O) to model
data and spawn a continuous output feature using a linear method. It also explored ways of assessing that
method.
© The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature 2022
T. C. Nokeri, Data Science Solutions with Python
https://doi.org/10.1007/978-1-4842-7762-1_4
This chapter describes and executes several survival analysis methods using the main Python frameworks
(i.e., Lifelines and PySpark). It begins by explaining the underlying concept behind the Cox Proportional
Hazards model. It then introduces the accelerated failure time method.
(Equation 4-1)
Listing 4-1 attains the necessary data from a Microsoft Excel file.
import pandas as pd
initial_data = pd.read_excel(r"filepath\survival_data.xlsx", index_col=[0])
Listing 4-1 Attain the Data
int(initial_data.shape[0]) * 0.8
345.6
Listing 4-2 Find the Ratio for Dividing the Data
lifeline_training_data = initial_data.loc[:346]
lifeline_test_data = initial_data.loc[346:]
Listing 4-3 Divide the Data
Lifeline in Action
This section executes and assesses the Cox Proportional Hazards method with the Lifeline framework.
Listing 4-4 executes the Lifeline Cox Proportional Hazards method.
Listing 4-5 computes the test statistics (see Table 4-1) and assesses the Lifeline Cox Proportional
Hazards method with a scaled Schoenfeld, which helps disclose any abnormalities in the residuals (see
Figure 4-1).
lifeline_cox_method_test_statistics_schoenfeld =
lifeline_cox_method.check_assumptions(lifeline_training_data,
show_plots=True)
lifeline_cox_method_test_statistics_schoenfeld
Listing 4-5 Compute the Lifeline Cox Proportional Hazards Method’s Test Statistics and Residuals
Table 4-1 Test Statistics for the Lifeline Cox Proportional Hazards Method
Test Statistic p
Age km 10.53 <0.005
rank 10.78 <0.005
Fin km 0.12 0.73
rank 0.14 0.71
Mar km 0.18 0.67
rank 0.20 0.66
Paro km 0.13 0.72
rank 0.11 0.74
Prio km 0.49 0.48
rank 0.47 0.49
Race km 0.34 0.56
rank 0.37 0.54
Wexp km 11.91 <0.005
rank 11.61 <0.005
Figure 4-1 Scaled Schoenfeld residuals of age
Listing 4-6 determines the Lifeline Cox Proportional Hazards method’s assessment summary (see Table
4-2).
lifeline_cox_method_assessment_summary = lifeline_cox_method.print_summary()
lifeline_cox_method_assessment_summary
Listing 4-6 Compute the Assessment Summary
Coef Exp(coef ) Se(coef ) Coef Lower Coef Upper Exp(coef ) Lower Exp(coef ) Upper Z P -
95% 95% 95% 95% log2(p)
Fin -0.71 0.49 0.23 -1.16 -0.27 0.31 0.77 -3.13 <0.005 9.14
Age -0.03 0.97 0.02 -0.08 0.01 0.93 1.01 -1.38 0.17 2.57
Race 0.39 1.48 0.37 -0.34 1.13 0.71 3.09 1.05 0.30 1.76
Wexp -0.11 0.90 0.24 -0.59 0.37 0.56 1.44 -0.45 0.65 0.62
Mar -1.15 0.32 0.61 -2.34 0.04 0.10 1.04 -1.90 0.06 4.11
Paro 0.07 1.07 0.23 -0.37 0.51 0.69 1.67 0.31 0.76 0.40
Prio 0.10 1.11 0.03 0.04 0.16 1.04 1.17 3.24 <0.005 9.73
Listing 4-7 determines the log test confidence interval for each feature in the data (see Figure 4-2).
lifeline_cox_log_test_ci = lifeline_cox_method.plot()
lifeline_cox_log_test_ci
Listing 4-7 Execute the Lifeline Cox Proportional Hazards Method
Figure 4-2 Log test confidence interval
PySpark in Action
This section executes the accelerated failure time method with the PySpark framework.
Listing 4-8 runs the PySpark framework with the findspark framework.
Listing 4-9 stipulates the PySpark app using the SparkConf() method.
Listing 4-10 prepares the PySpark session using the SparkSession() method.
Listing 4-11 changes the pandas dataframe created earlier in this chapter to a PySpark dataframe using
the createDataFrame() method.
pyspark_initial_data = pyspark_session.createDataFrame(initial_data)
Listing 4-11 Change the Pandas Dataframe to a PySpark Dataframe
Listing 4-12 creates a list for independent features and a string for the dependent feature. It the converts
the data using the VectorAssembler() method for modeling with the PySpark framework.
Listing 4-14 computes the PySpark accelerated failure time method’s predictions.
pyspark_accelerated_failure_method_fitted.transform(pyspark_data).select(pyspa
pyspark_yhat.show()
+------+------------------+
|arrest| prediction|
+------+------------------+
| 1|18.883982665910125|
| 1| 16.88228128814963|
| 1|22.631360777172517|
| 0|373.13041474613107|
| 0| 377.2238319806288|
| 0| 375.8326538406928|
| 1| 20.9780526816987|
| 0| 374.6420738270714|
| 0| 379.7483494080467|
| 0| 376.1601473382181|
| 0| 377.1412349521787|
| 0| 373.7536844216336|
| 1| 36.36443059383637|
| 0|374.14261327949384|
| 1| 22.98494042401171|
| 1| 50.61463874375869|
| 1| 25.56399364288275|
| 0|379.61997114629696|
| 0| 384.3322960430372|
| 0|376.37634062210844|
+------+------------------+
Listing 4-14 Compute the PySpark Accelerated Failure Time Method’s Predictions
Listing 4-15 computes the PySpark accelerated failure time method’s coefficients.
pyspark_accelerated_failure_method_fitted.coefficients
Conclusion
This chapter executed two key machine learning frameworks (Lifeline and PySpark) to model censored data
with the Cox Proportional Hazards and accelerated failure time methods.
© The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature 2022
T. C. Nokeri, Data Science Solutions with Python
https://doi.org/10.1007/978-1-4842-7762-1_5
This chapter executes and appraises a nonlinear method for binary classification (called logistic regression )
using a diverse set of comprehensive Python frameworks (i.e., Scikit-Learn, Spark MLlib, and H2O). To begin,
it clarifies the underlying concept behind the sigmoid function.
(Equation 5-1)
Both Equation 5-1 and Figure 5-1 suggest that the function produces binary output values.
Listing 5-1 attains the necessary data from a Microsoft CSV file using the pandas framework.
import pandas as pd
df = pd.read_csv(r"C:\Users\i5 lenov\Downloads\banking.csv")
Listing 5-1 Attain the Data
Listing 5-2 stipulates the names of columns to drop and then executes the drop() method. It stipulates
axes as columns in order to drop the unnecessary columns in the data.
Listing 5-3 attains dummy values for categorical features in the data.
initial_data = initial_data.dropna()
Listing 5-4 Drop Null Values
Scikit-Learn in Action
This section executes and assesses the logistic regression method with the Scikit-Learn framework. Listing
5-5 outlines the independent and dependent features.
import numpy as np
x = np.array(initial_data.iloc[::, 0:17])
y = np.array(initial_data.iloc[::,-1])
Listing 5-5 Outline the Features
Listing 5-10 executes the logistic regression method with the Scikit-Learn framework.
sk_logistic_regression_method = LogisticRegression(penalty="l2")
sk_logistic_regression_method.fit(sk_standard_scaled_x_train, y_train)
Listing 5-10 Execute the Scikit-Learn Logistic Regression Method
print(sk_logistic_regression_method.intercept_)
[-2.4596243]
Listing 5-11 Compute the Logistic Regression Method’s Intercept
print(sk_logistic_regression_method.coef_)
[[ 0.03374725 0.04330667 -0.01305369 -0.02709009 0.13508899 0.01735913
0.00816758 0.42948983 -0.12670658 -0.25784955 -0.04025993 -0.14622466
-1.14143485 0.70803518 0.23256046 -0.02295578 -0.02857435]]
Listing 5-12 Compute the Logistic Regression Method’s Coefficients
Listing 5-13 computes the Scikit-Learn logistic regression method’s confusion matrix, which includes
two forms of errors—false positives and false negatives and true positives and true negatives (see Table 5-
1).
sk_logistic_regression_method_assessment_2 =
pd.DataFrame(metrics.classification_report(y_test, sk_yhat,
output_dict=True)).transpose()
print(sk_logistic_regression_method_assessment_2)
Listing 5-14 Compute the Scikit-Learn Logistic Regression Method’s Classification Report
Listing 5-15 arranges the Scikit-Learn logistic regression method’s receiver operating characteristics
curve. The goal is to condense the arrangement of the true positive rate (the proclivity of the method to
correctly differentiate positive classes) and the false positive rate (the proclivity of the method to correctly
differentiate negative classes). See Figure 5-2.
sk_yhat_proba =
sk_logistic_regression_method.predict_proba(sk_standard_scaled_x_test)[::,1]
fpr_sk_logistic_regression_method, tprr_sk_logistic_regression_method, _ =
metrics.roc_curve(y_test, sk_yhat_proba)
area_under_curve_sk_logistic_regression_method =
metrics.roc_auc_score(y_test, sk_yhat_proba)
plt.plot(fpr_sk_logistic_regression_method,
tprr_sk_logistic_regression_method, label="AUC= "+
str(area_under_curve_sk_logistic_regression_method))
plt.xlabel("False Positive Rate (FPR)")
plt.ylabel("True Positive Rate (TPR)")
plt.legend(loc="best")
plt.show()
Listing 5-15 Receiver Operating Characteristics Curve for the Scikit-Learn Logistic Regression Method
Figure 5-2 Receiver operating characteristics curve for the Scikit-Learn logistic regression method
Listing 5-16 arranges the Scikit-Learn logistic regression method’s precision-recall curve to condense
the arrangement of the precision and recall (see Figure 5-3).
p_sk_logistic_regression_method, r__sk_logistic_regression_method, _ =
metrics.precision_recall_curve(y_test, sk_yhat)
weighted_ps_sk_logistic_regression_method = metrics.roc_auc_score(y_test,
sk_yhat)
plt.plot(p_sk_logistic_regression_method, r__sk_logistic_regression_method,
label="WPR= " +str(weighted_ps_sk_logistic_regression_method))
plt.xlabel("Recall")
plt.ylabel("Precision")
plt.legend(loc="best")
plt.show()
Listing 5-16 Precision-Recall Curve for the Scikit-Learn Logistic Regression Method
Another Random Scribd Document
with Unrelated Content
This eBook is for the use of anyone anywhere in the United
States and most other parts of the world at no cost and with
almost no restrictions whatsoever. You may copy it, give it away
or re-use it under the terms of the Project Gutenberg License
included with this eBook or online at www.gutenberg.org. If you
are not located in the United States, you will have to check the
laws of the country where you are located before using this
eBook.
1.E.6. You may convert to and distribute this work in any binary,
compressed, marked up, nonproprietary or proprietary form,
including any word processing or hypertext form. However, if you
provide access to or distribute copies of a Project Gutenberg™ work
in a format other than “Plain Vanilla ASCII” or other format used in
the official version posted on the official Project Gutenberg™ website
(www.gutenberg.org), you must, at no additional cost, fee or
expense to the user, provide a copy, a means of exporting a copy, or
a means of obtaining a copy upon request, of the work in its original
“Plain Vanilla ASCII” or other form. Any alternate format must
include the full Project Gutenberg™ License as specified in
paragraph 1.E.1.
• You pay a royalty fee of 20% of the gross profits you derive
from the use of Project Gutenberg™ works calculated using the
method you already use to calculate your applicable taxes. The
fee is owed to the owner of the Project Gutenberg™ trademark,
but he has agreed to donate royalties under this paragraph to
the Project Gutenberg Literary Archive Foundation. Royalty
payments must be paid within 60 days following each date on
which you prepare (or are legally required to prepare) your
periodic tax returns. Royalty payments should be clearly marked
as such and sent to the Project Gutenberg Literary Archive
Foundation at the address specified in Section 4, “Information
about donations to the Project Gutenberg Literary Archive
Foundation.”
• You comply with all other terms of this agreement for free
distribution of Project Gutenberg™ works.
1.F.
1.F.4. Except for the limited right of replacement or refund set forth
in paragraph 1.F.3, this work is provided to you ‘AS-IS’, WITH NO
OTHER WARRANTIES OF ANY KIND, EXPRESS OR IMPLIED,
INCLUDING BUT NOT LIMITED TO WARRANTIES OF
MERCHANTABILITY OR FITNESS FOR ANY PURPOSE.
Please check the Project Gutenberg web pages for current donation
methods and addresses. Donations are accepted in a number of
other ways including checks, online payments and credit card
donations. To donate, please visit: www.gutenberg.org/donate.
Most people start at our website which has the main PG search
facility: www.gutenberg.org.
Our website is not just a platform for buying books, but a bridge
connecting readers to the timeless values of culture and wisdom. With
an elegant, user-friendly interface and an intelligent search system,
we are committed to providing a quick and convenient shopping
experience. Additionally, our special promotions and home delivery
services ensure that you save time and fully enjoy the joy of reading.
ebooknice.com