100% found this document useful (3 votes)
55 views59 pages

(Ebook) Data Science Solutions with Python: Fast and Scalable Models Using Keras, PySpark MLlib, H2O, XGBoost, and Scikit-Learn by Tshepo Chris Nokeri ISBN 9781484277614, 1484277619, 9671484277614 - Instantly access the complete ebook with just one click

The document provides information about various ebooks available for instant download on ebooknice.com, including titles on data science, machine learning, and cooking. It highlights the book 'Data Science Solutions with Python' by Tshepo Chris Nokeri, which covers machine learning and deep learning frameworks, and aims to teach readers how to apply these techniques to real-world problems. The document also includes details about the author's background and the structure of the book, which is designed for intermediate data scientists and machine learning engineers.

Uploaded by

amkabadem42
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (3 votes)
55 views59 pages

(Ebook) Data Science Solutions with Python: Fast and Scalable Models Using Keras, PySpark MLlib, H2O, XGBoost, and Scikit-Learn by Tshepo Chris Nokeri ISBN 9781484277614, 1484277619, 9671484277614 - Instantly access the complete ebook with just one click

The document provides information about various ebooks available for instant download on ebooknice.com, including titles on data science, machine learning, and cooking. It highlights the book 'Data Science Solutions with Python' by Tshepo Chris Nokeri, which covers machine learning and deep learning frameworks, and aims to teach readers how to apply these techniques to real-world problems. The document also includes details about the author's background and the structure of the book, which is designed for intermediate data scientists and machine learning engineers.

Uploaded by

amkabadem42
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 59

Instant Ebook Access, One Click Away – Begin at ebooknice.

com

(Ebook) Data Science Solutions with Python: Fast


and Scalable Models Using Keras, PySpark MLlib,
H2O, XGBoost, and Scikit-Learn by Tshepo Chris
Nokeri ISBN 9781484277614, 1484277619,
9671484277614
https://ebooknice.com/product/data-science-solutions-with-
python-fast-and-scalable-models-using-keras-pyspark-
mllib-h2o-xgboost-and-scikit-learn-35650822

OR CLICK BUTTON

DOWLOAD EBOOK

Get Instant Ebook Downloads – Browse at https://ebooknice.com


Instant digital products (PDF, ePub, MOBI) ready for you
Download now and discover formats that fit your needs...

Start reading on any device today!

(Ebook) Biota Grow 2C gather 2C cook by Loucas, Jason; Viles, James ISBN
9781459699816, 9781743365571, 9781925268492, 1459699815, 1743365578, 1925268497

https://ebooknice.com/product/biota-grow-2c-gather-2c-cook-6661374

ebooknice.com

(Ebook) Matematik 5000+ Kurs 2c Lärobok by Lena Alfredsson, Hans Heikne, Sanna
Bodemyr ISBN 9789127456600, 9127456609

https://ebooknice.com/product/matematik-5000-kurs-2c-larobok-23848312

ebooknice.com

(Ebook) SAT II Success MATH 1C and 2C 2002 (Peterson's SAT II Success) by Peterson's
ISBN 9780768906677, 0768906679

https://ebooknice.com/product/sat-ii-success-math-1c-and-2c-2002-peterson-s-sat-
ii-success-1722018

ebooknice.com

(Ebook) Master SAT II Math 1c and 2c 4th ed (Arco Master the SAT Subject Test: Math
Levels 1 & 2) by Arco ISBN 9780768923049, 0768923042

https://ebooknice.com/product/master-sat-ii-math-1c-and-2c-4th-ed-arco-master-
the-sat-subject-test-math-levels-1-2-2326094

ebooknice.com
(Ebook) Cambridge IGCSE and O Level History Workbook 2C - Depth Study: the United
States, 1919-41 2nd Edition by Benjamin Harrison ISBN 9781398375147, 9781398375048,
1398375144, 1398375047

https://ebooknice.com/product/cambridge-igcse-and-o-level-history-
workbook-2c-depth-study-the-united-states-1919-41-2nd-edition-53538044

ebooknice.com

(Ebook) Econometrics and Data Science: Apply Data Science Techniques to Model
Complex Problems and Implement Solutions by Tshepo Chris Nokeri ISBN 9781484274347,
1484274342

https://ebooknice.com/product/econometrics-and-data-science-apply-data-science-
techniques-to-model-complex-problems-and-implement-solutions-35700072

ebooknice.com

(Ebook) Data Science Revealed by Tshepo Chris Nokeri ISBN 9781484268698, 1484268695

https://ebooknice.com/product/data-science-revealed-33446642

ebooknice.com

(Ebook) Econometrics and Data Science: Apply Data Science Techniques to Model
Complex Problems and Implement Solutions for Economic Problems by Tshepo Chris
Nokeri ISBN 9781484274330, 1484274334

https://ebooknice.com/product/econometrics-and-data-science-apply-data-science-
techniques-to-model-complex-problems-and-implement-solutions-for-economic-
problems-36067370

ebooknice.com

(Ebook) Web App Development and Real-Time Web Analytics with Python: Develop and
Integrate Machine Learning Algorithms into Web Apps by Nokeri, Tshepo Chris ISBN
9781484277829, 1484277821

https://ebooknice.com/product/web-app-development-and-real-time-web-analytics-
with-python-develop-and-integrate-machine-learning-algorithms-into-web-
apps-36127710

ebooknice.com
Tshepo Chris Nokeri

Data Science Solutions with Python


Fast and Scalable Models Using Keras, PySpark
MLlib, H2O, XGBoost, and Scikit-Learn
1st ed.
Tshepo Chris Nokeri
Pretoria, South Africa

ISBN 978-1-4842-7761-4 e-ISBN 978-1-4842-7762-1


https://doi.org/10.1007/978-1-4842-7762-1

© Tshepo Chris Nokeri 2022

Apress Standard

The use of general descriptive names, registered names, trademarks,


service marks, etc. in this publication does not imply, even in the
absence of a specific statement, that such names are exempt from the
relevant protective laws and regulations and therefore free for general
use.

The publisher, the authors and the editors are safe to assume that the
advice and information in this book are believed to be true and accurate
at the date of publication. Neither the publisher nor the authors or the
editors give a warranty, expressed or implied, with respect to the
material contained herein or for any errors or omissions that may have
been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.

This Apress imprint is published by the registered company APress


Media, LLC part of Springer Nature.
The registered company address is: 1 New York Plaza, New York, NY
10004, U.S.A.
I dedicate this book to my family and everyone who has merrily played
influential roles in my life.
Introduction
This book covers the in-memory, distributed cluster computing
framework called PySpark, the machine learning framework platforms
called Scikit-Learn, PySpark MLlib, H2O, and XGBoost, and the deep
learning framework known as Keras. After reading this book, you will
be able to apply supervised and unsupervised learning to solve
practical and real-world data problems. In this book, you will learn how
to engineer features, optimize hyperparameters, train and test models,
develop pipelines, and automate the machine learning process.
To begin, the book carefully presents supervised and unsupervised
ML and DL models and examines big data frameworks and machine
learning and deep learning frameworks. It also discusses the
parametric model called Generalized Linear Model and a survival
regression model known as the Cox Proportional Hazards model and
Accelerated Failure Time (AFT). It presents a binary classification
model called Logistic Regression and an ensemble model called
Gradient Boost Trees. It also introduces DL and an artificial neural
network, the Multilayer Perceptron (MLP) classifier. It describes a way
of performing cluster analysis using the k-means model. It explores
dimension reduction techniques like Principal Components Analysis
and Linear Discriminant Analysis and concludes by unpacking
automated machine learning.
The book targets intermediate data scientists and machine learning
engineers who want to learn how to apply key big data frameworks, as
well as ML and DL frameworks. Before exploring the contents of this
book, be sure that you understand basic statistics, Python
programming, probability theories, and predictive analytics.
The books uses Anaconda (an open source distribution of Python
programming) for the examples. The following list highlights some of
the Python libraries that this book covers.
Pandas for data structures and tools.
PySpark for in-memory, cluster computing.
XGBoost for gradient boosting and survival regression analysis.
Auto-Sklearn, Tree-based Pipeline Optimization Tool (TPOT),
Hyperopt-Sklearn, and H2O for AutoML.
Scikit-Learn for building and validating key machine learning
algorithms.
Keras for high-level frameworks for deep learning.
H2O for driverless machine learning.
Lifelines for survival analysis.
NumPy for arrays and matrices.
SciPy for integrals, solving differential equations, and optimization.
Matplotlib and Seaborn for recognized plots and graphs.
Any source code or other supplementary material referenced by the
author in this book is available to readers on GitHub via the book’s
product page, located at www.apress.com/9781484277614. For more
detailed information, please visit http://www.apress.com/source-code.
Acknowledgments
Writing a single-authored book is demanding, but I received firm
support and active encouragement from my family and dear friends.
Many heartfelt thanks to the Apress Publishing team for all their
support throughout the writing and editing processes. Lastly, my
humble thanks to all of you for reading this; I earnestly hope you find it
helpful.
Table of Contents
Chapter 1:​Exploring Machine Learning
Exploring Supervised Methods
Exploring Nonlinear Models
Exploring Ensemble Methods
Exploring Unsupervised Methods
Exploring Cluster Methods
Exploring Dimension Reduction
Exploring Deep Learning
Conclusion
Chapter 2:​Big Data, Machine Learning, and Deep Learning
Frameworks
Big Data
Big Data Features
Impact of Big Data on Business and People
Better Customer Relationships
Refined Product Development
Improved Decision-Making
Big Data Warehousing
Big Data ETL
Big Data Frameworks
Apache Spark
ML Frameworks
Scikit-Learn
H2O
XGBoost
DL Frameworks
Keras
Chapter 3:​Linear Modeling with Scikit-Learn, PySpark, and H2O
Exploring the Ordinary Least-Squares Method
Scikit-Learn in Action
PySpark in Action
H2O in Action
Conclusion
Chapter 4:​Survival Analysis with PySpark and Lifelines
Exploring Survival Analysis
Exploring Cox Proportional Hazards Method
Lifeline in Action
Exploring the Accelerated Failure Time Method
PySpark in Action
Conclusion
Chapter 5:​Nonlinear Modeling With Scikit-Learn, PySpark, and
H2O
Exploring the Logistic Regression Method
Scikit-Learn in Action
PySpark in Action
H2O in Action
Conclusion
Chapter 6:​Tree Modeling and Gradient Boosting with Scikit-Learn,
XGBoost, PySpark, and H2O
Decision Trees
Preprocessing Features
Scikit-Learn in Action
Gradient Boosting
XGBoost in Action
PySpark in Action
H2O in Action
Conclusion
Chapter 7:​Neural Networks with Scikit-Learn, Keras, and H2O
Exploring Deep Learning
Multilayer Perceptron Neural Network
Preprocessing Features
Scikit-Learn in Action
Keras in Action
Deep Belief Networks
H2O in Action
Conclusion
Chapter 8:​Cluster Analysis with Scikit-Learn, PySpark, and H2O
Exploring the K-Means Method
Scikit-Learn in Action
PySpark in Action
H2O in Action
Conclusion
Chapter 9:​Principal Component Analysis with Scikit-Learn,
PySpark, and H2O
Exploring the Principal Component Method
Scikit-Learn in Action
PySpark in Action
H2O in Action
Conclusion
Chapter 10:​Automating the Machine Learning Process with H2O
Exploring Automated Machine Learning
Preprocessing Features
H2O AutoML in Action
Conclusion
Index
About the Author
Tshepo Chris Nokeri
harnesses advanced analytics and
artificial intelligence to foster innovation
and optimize business performance. In
his work, he delivered complex solutions
to companies in the mining, petroleum,
and manufacturing industries. He earned
a Bachelor’s degree in Information
Management and then graduated with an
honour’s degree in Business Science
from the University of the
Witwatersrand, on a TATA Prestigious
Scholarship and a Wits Postgraduate
Merit Award. He was also unanimously
awarded the Oxford University Press Prize. He is the author of Data
Science Revealed, Implementing Machine Learning in Finance, and
Econometrics and Data Science, all published by Apress.
About the Technical Reviewer
Joos Korstanje
is a data scientist with over five years of industry experience in
developing machine learning tools, a large part of which has been
forecasting models. He currently works at Disneyland Paris, where he
develops machine learning for a variety of tools. His experience in
writing and teaching have motivated him to contribute to this book on
advanced forecasting with Python.
© The Author(s), under exclusive license to APress Media, LLC, part of Springer
Nature 2022
T. C. Nokeri, Data Science Solutions with Python
https://doi.org/10.1007/978-1-4842-7762-1_1

1. Exploring Machine Learning


Tshepo Chris Nokeri1
(1) Pretoria, South Africa

This chapter introduces the best machine learning methods and


specifies the main differences between supervised and unsupervised
machine learning. It also discusses various applications of both.
Machine learning has been around for a long time; however, it has
recently gained widespread recognition. This is because of the
increased computational power of modern computer systems and the
ease of access to open source platforms and frameworks. Machine
learning involves inducing computer systems with intelligence by
implementing various programming and statistical techniques. It draws
from fields such as statistics, computational linguistics, and
neuroscience, among others. It also applies modern statistics and basic
programming. It enables developers to develop and deploy intelligent
computer systems and create practical and reliable applications.

Exploring Supervised Methods


Supervised learning involves feeding a machine learning algorithm with
a large data set, through which it learns the desired output. The main
machine learning methods are linear, nonlinear, and ensemble. The
following section familiarizes you with the parametric method, also
known as the linear regression method . See Table 1-1.
Linear regression methods expect data to follow a Gaussian
distribution. The ordinary least-squares method is the standard linear
method; it minimizes the error in terms and determines the slope and
intercept to fit the data to a straight line (where the dependent feature
is continuous).
Furthermore, the ordinary least-squares method determines the
extent of the association between the features. It also assumes linearity,
no autocorrelation in residuals, and no multicollinearity. In the real
world, data hardly ever comes from a normal distribution; the method
struggles to explain the variability, resulting in an under-fitted or over-
fitted model. To address this problem, we introduce a penalty term to
the equation to alter the model’s performance.

Table 1-1 Types of Parametric Methods

Model Description
Linear Applied when there is one dependent feature (continuous feature)
regression and an independent feature (continuous feature of categorical).
method The main linear regression methods are GLM, Ridge, Lasso, Elastic
Net, etc.
Survival Applied to time-event-related censored data, where the dependent
regression feature is categorical and the independent feature is continuous.
method
Time series Applied to uncovering patterns in sequential data and forecasting
analysis future instances. Principal time series models include the ARIMA
method model, SARIMA, Additive model, etc.

Exploring Nonlinear Models


Nonlinear (classification) methods differentiate classes (also called
categories or labels)—see Table 1-2. When there are two classes in the
dependent feature, you’ll implement binary classification methods.
When there are more than two classes, you’ll implement multiclass
classification methods. There are innumerable functions for
classification, including sigmoid, tangent hyperbolic, and kernel, among
others. Their application depends on the context. Subsequent chapters
will cover some of these functions.
Table 1-2 Varying Nonlinear Models
Model Description
Binary Applied when the categorical feature has only two possible
classification outcomes. The popular binary classifier is the logistic
method regression model.
Multiclass Applied when the categorical feature has more than two
classification possible outcomes. The main multi-class classifier is the Linear
method Discriminant Analysis model (which can also be used for
dimension reduction).
Survival Applied when you’re computing the probabilities of an event
classification occurring using a categorical feature.

Exploring Ensemble Methods


Ensemble methods enable you to uncover linearity and nonlinearity.
The main ensembler is the random forest trees, which is often
computationally demanding. In addition, its performance depends on a
variety of factors like the depth, iterations, data splitting, and boosting.

Exploring Unsupervised Methods


In contrast to supervised machine learning, unsupervised learning
contains no ground truth. You expose the model to all the sample data,
allowing it to guess about the pattern of the data. The most common
unsupervised methods are cluster-related methods.

Exploring Cluster Methods


Cluster methods are convenient for assembling common values in the
data; they enable you to uncover patterns in both structured and
unstructured data. Table 1-3 highlights a few cluster methods that you
can use to guess the pattern of the data.

Table 1-3 Varying Cluster Models

Techniques Description
Centroid Applied to determine the center of the data and draw data points
clustering toward the center. The main centroid clustering method is the k-
means method.
Techniques Description
Density Applied to determine where the data is concentrated. The main
clustering density clustering model is the DBSCAN method.
Distribution Identifies the probability of data points belonging to a cluster
clustering based on some distribution. The main distribution clustering
method is the Gaussian Mixture method.

Exploring Dimension Reduction


Dimension reducers help determine the extent to which factors or
components elucidate related changes in data. Table 1-4 provides an
overview of the chief dimension reducers.

Table 1-4 Main Dimension Reducers

Technique Description
Factor analysis Applied to determine the extent to which factors elucidate
related changes of features in the data.
Principal component Applied to determine the extent to which factors elucidate
analysis related changes of features in the data.

Exploring Deep Learning


Deep learning extends machine learning in that it uses artificial neural
networks to solve complex problems. You can use deep learning to
discover patterns in big data. Deep learning uses artificial neural
networks to discover patterns in complex data. The networks use an
approach similar to that of animals, whereby neurons (nodes) that
receive input data from the environment scan the input and pass it to
neurons in successive layers to arrive at some output (which is
understanding the complexity in the data). There are several artificial
neural networks that you can use to determine patterns of behavior of a
phenomenon, depending on the context. Table 1-5 highlights the chief
neural networks.
Table 1-5 Varying Neural Networks
Network Description
Restricted Boltzmann The most common neural network that contains only
Machine (RBM) the hidden and visible layers.
Multilayer Perceptron A neural network that prolongs a restricted Boltzmann
(MLP) machine with input, hidden and output layers.
Recurrent Neural Serves as a sequential modeler.
Network (RNN)
Convolutional Neural Serves as a dimension reducer and classifier.
Network (CNN)

Conclusion
This chapter covered two ways in which machines learn—via
supervised and unsupervised learning. It began by explaining
supervised machine learning and discussing the three types of
supervised learning methods and their applications. It then covered
unsupervised learning techniques, dimension reduction, and cluster
analysis.
© The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature 2022
T. C. Nokeri, Data Science Solutions with Python
https://doi.org/10.1007/978-1-4842-7762-1_2

2. Big Data, Machine Learning, and Deep Learning


Frameworks
Tshepo Chris Nokeri1
(1) Pretoria, South Africa

This chapter carefully presents the big data framework used for parallel data processing
called Apache Spark. It also covers several machine learning (ML) and deep learning (DL)
frameworks useful for building scalable applications. After reading this chapter, you will
understand how big data is collected, manipulated, and examined using resilient and fault-
tolerant technologies. It discusses the Scikit-Learn, Spark MLlib, and XGBoost frameworks. It
also covers a deep learning framework called Keras. It concludes by discussing effective
ways of setting up and managing these frameworks.
Big data frameworks support parallel data processing. They enable you to contain big
data across many clusters. The most popular big data framework is Apache Spark, which is
built on the Hadoop framework.

Big Data
Big data means different things to different people. In this book, we define big data as large
amounts of data that we cannot adequately handle and manipulate using classic methods.
We must undoubtedly use scalable frameworks and modern technologies to process and
draw insight from this data. We typically consider data “big” when it cannot fit within the
current in-memory storage space. For instance, if you have a personal computer and the
data at your disposal exceeds your computer’s storage capacity, it’s big data. This equally
applies to large corporations with large clusters of storage space. We often speak about big
data when we use a stack with Hadoop/Spark.

Big Data Features


The features of big data are described as the four Vs—velocity, volume, variety, and veracity.
Table 2-1 highlights these features of big data.
Table 2-1 Big Data Features

Element Description
Velocity Modern technologies and improved connectivity enable you to generate data at an unprecedented
speed. Characteristics of velocity include batch data, near or real-time data, and streams.
Volume The scale at which data increases. The nature of data sources and infrastructure influence the
volume of data. Characteristics of the volume include exabyte, zettabyte, etc.
Element Description
Variety Data can come from unique sources. Modern technological devices leave digital footprints here
and there, which increase the number of sources from which businesses and people can get data.
Characteristics of variety include the structure and complexity of the data.
Veracity Data must come from reliable sources. Also, it must be of high quality, consistent, and complete.

Impact of Big Data on Business and People


Without a doubt, big data affects the way we think and do business. Data-driven
organizations typically establish the basis for evidence-based management. Big data
involves measuring the key aspects of the business using quantitative methods. It helps
support decision-making. The next sections discuss ways in which big data affects
businesses and people.

Better Customer Relationships


Insights from big data help manage customer relationships. Organizations with big data
about their customers can study customers’ behavioral patterns and use descriptive
analytics to drive customer-management strategies.

Refined Product Development


Data-driven organizations use big data analytics and predictive analytics to drive product
development. and management strategies. This approach is useful for incremental and
iterative delivery of applications.

Improved Decision-Making
When a business has big data, it can use it to uncover complex patterns of a phenomenon to
influence strategy. This approach helps management make well-informed decisions based
on evidence, rather than on subjective reasoning. Data-driven organizations foster a culture
of evidence-based management.
We also use big data in fields like life sciences, physics, economics, and medicine. There
are many ways in which big data affects the world. This chapter does not consider all
factors. The next sections explain big data warehousing and ETL activities.

Big Data Warehousing


Over the past few decades, organizations have invested in on-premise databases, including
Microsoft Access, Microsoft SQL Server, SAP Hana, Oracle Database, and many more. There
has recently been widespread adoption of cloud databases like Microsoft Azure SQL and
Oracle XE. There are also standard big data (distributed) databases like Cassandra and
HBase, among others. Businesses are shifting toward scalable cloud-based databases to
harness benefits associated with increasing computational power, fault-tolerant
technologies, and scalable solutions.

Big Data ETL


Although there have been significant advances in database management, the way that
people manipulate data from databases remains the same. Extracting, transforming, and
loading (ETL) still play an integral part in analysis and reporting. Table 2-2 discusses ETL
activities.
Table 2-2 ETL Activities

Activity Description
Extract Involves getting data from some database.
Transforming Involves converting data from a database into a suitable format for analysis and reporting
Loading Involves warehousing data in a database management system.
To perform ETL activities, you must use a query language. The most popular query
language is SQL (Standard Query Language). There are other query languages that emerged
with the open source movement, such as HiveQL and BigQuery. The Python programming
language supports SQL. Python frameworks can connect to databases by implementing
libraries, such as SQLAlchemy, pyodbc, SQLite, SparkSQL, and pandas, among others.

Big Data Frameworks


Big data frameworks enable developers to collect, manage, and manipulate distributed data.
Most open source big data frameworks use in-memory cluster computing. The most popular
frameworks include Hadoop, Spark, Flink, Storm, and Samza. This book uses PySpark to
perform ETL activities, explore data, and build machine learning pipelines.

Apache Spark
Apache Spark executes in-memory cluster computing. It enables developers to build
scalable applications using Java, Scala, Python, R, and SQL. It includes cluster components
like the driver, cluster manager, and executor. You can use it as a standalone cluster manager
or on top of Mesos, Hadoop, YARN, or Baronets. You can use it to access data in the Hadoop
File System (HDFS), Cassandra, HBase, and Hive, among other data sources. The Spark data
structure is considered a resilient distributed data set. This book introduces a framework
that integrates both Python and Apache Spark (PySpark). The book uses it to operate Spark
MLlib. To understand this framework, you first need to grasp the idea behind resilient
distributed data sets.

Resilient Distributed Data Sets


Resilient Distributed Data Sets (RDDs) are immutable elements for parallelizing data or for
transforming existing data. Chief RDD operations include transformation and actions. We
store them in any storage supported by Hadoop. For instance, in a Hadoop Distributed File
System (HDF), Cassandra, HBase, Amazon S3, etc.

Spark Configuration
Areas of Spark configuration include Spark properties, environment variables, and logging.
The default configuration directory is SPARK_HOME/conf.
You can install the findspark library in your environment using pip install
findspark and install the pyspark library using pip install pyspark.
Listing 2-1 prepares the PySpark framework using the findspark framework.
import findspark as initiate_pyspark
initiate_pyspark.init("filepath\spark-3.0.0-bin-hadoop2.7")
Listing 2-1 Prepare the PySpark Framework
Listing 2-2 stipulates the PySpark app using the SparkConf() method.

from pyspark import SparkConf


pyspark_configuration =
SparkConf().setAppName("pyspark_linear_method").setMaster("local")
Listing 2-2 Stipulate the PySpark App

Listing 2-3 prepares the PySpark session with the SparkSession() method.

from pyspark.sql import SparkSession


pyspark_session = SparkSession(pyspark_context)
Listing 2-3 Prepare the Spark Session

Spark Frameworks
Spark frameworks extend the core of the Spark API. There are four main Spark frameworks
—SparkSQL, Spark Streaming, Spark MLlib, and GraphX.
SparkSQL
SparkSQL enables you to use relational query languages like SQL, HiveQL, and Scala. It
includes a schemaRDD that has row objects and schema. You create it using an existing
RDD, parquet file, or JSON data set. You execute the Spark Context to create a SQL context.
Spark Streaming
Spark streaming is a scalable streaming framework that supports Apache Kafka, Apache
Flume, HDFS, and Apache Kensis, etc. It processes input data using DStream in small batches
you push using HDFS, databases, and dashboards. Recent versions of Python do not support
Spark Streaming. Consequently, we do not cover the framework in this book. You can use a
Spark Streaming application to read input from any data source and store a copy of the data
in HDFS. This allows you to build and launch a Spark Streaming application that processes
incoming data and runs an algorithm on it.
Spark MLlib
MLlib is an ML framework that allows you to develop and test ML and DL models. In Python,
the frameworks work hand-in-hand with the NumPy framework. Spark MLlib can be used
with several Hadoop data sources and incorporated alongside Hadoop workflows. Common
algorithms include regression, classification, clustering, collaborative filtering, and
dimension reduction. Key workflow utilities include feature transformation, standardization
and normalization, pipeline development, model evaluation, and hyperparameter
optimization.
GraphX
GraphX is a scalable and fault-tolerant framework for iterative and fast graph parallel
computing, social networks, and language modeling. It includes graph algorithms such as
PageRank for estimating the importance of each vertex in a graph, Connected Components
for labeling connected components of the graph with the ID of its lowest-numbered vertex,
and Triangle Counting for finding the number of triangles that pass through each vertex.

ML Frameworks
To solve ML problems, you need to have a framework that supports building and scaling ML
models. There is no shortage of ML models – there are innumerable frameworks for ML.
There are several ML frameworks that you can use. Subsequent chapters cover frameworks
like Scikit-Learn, Spark MLlib, H2O, and XGBoost.

Scikit-Learn
The Scikit-Learn framework includes ML algorithms like regression, classification, and
clustering, among others. You can use it with other frameworks such as NumPy and SciPy. It
can perform most of the tasks required for ML projects like data processing, transformation,
data splitting, normalization, hyperparameter optimization, model development, and
evaluation. Scikit-Learn comes with most distribution packages that support Python. Use
pip install sklearn to install it in your Python environment .

H2O
H2O is an ML framework that uses a driverless technology. It enables you to accelerate the
adoption of AI solutions. It is very easy to use, and it does not require any technical
expertise. Not only that, but it supports numerical and categorical data, including text.
Before you train the ML model, you must first load the data into the H2O cluster. It supports
CSV, Excel, and Parquet files. Default data sources include local file systems, remote files,
Amazon S3, HDFS, etc. It has ML algorithms like regression, classification, cluster analysis,
and dimension reduction. It can also perform most tasks required for ML projects like data
processing, transformation, data splitting, normalization, hyperparameter optimization,
model development, checking pointing, evaluation, and productionizing. Use pip install
h2o to install the package in your environment.
Listing 2-4 prepares the H2O framework.

import h2o
h2o.init()
Listing 2-4 Initializing the H2O Framework

XGBoost
XGBoost is an ML framework that supports programming languages, including Python. It
executes gradient-boosted models that are scalable, and learns fast parallel and distributed
computing without sacrificing memory efficiency. Not only that, but it is an ensemble
learner. As mentioned in Chapter a, ensemble learners can solve both regression and
classification problems. XGBoost uses boosting to learn from the errors committed in the
preceding trees. It is useful when tree-based models are overfitted. Use pip install
xgboost to install the model in your Python environment.

DL Frameworks
DL frameworks provide a structure that supports scaling artificial neural networks. You can
use it stand-alone or with other models. It typically includes programs and code
frameworks. Primary DL frameworks include TensorFlow, PyTorch, Deeplearning4j,
Microsoft Cognitive Toolkit (CNTK), and Keras.

Keras
Keras is a high-level DL framework written using Python; it runs on top of an ML platform
known as TensorFlow. It is effective for rapid prototyping of DL models. You can run Keras
on Tensor Processing Units or on massive Graphic Processing Units. The main Keras APIs
include models, layers, and callbacks. Chapter 7 covers this framework. Execute pip
install Keras and pip install tensorflow to use the Keras framework.
© The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature 2022
T. C. Nokeri, Data Science Solutions with Python
https://doi.org/10.1007/978-1-4842-7762-1_3

3. Linear Modeling with Scikit-Learn, PySpark, and H2O


Tshepo Chris Nokeri1
(1) Pretoria, South Africa

This introductory chapter explains the ordinary least-squares method and executes it with the main Python
frameworks (i.e., Scikit-Learn, Spark MLlib, and H2O). It begins by explaining the underlying concept behind
the method.

Exploring the Ordinary Least-Squares Method


The ordinary least-squares method is used with data that has an output feature that is not confined (it’s
continuous). This method expects normality and linearity, and there must be an absence of autocorrelation
in the error of terms (also called residuals) and multicollinearity. It is also highly prone to abnormalities in
the data, so you might want to use alternative methods like Ridge, Lasso, and Elastic Net if this method does
not serve you well.
Listing 3-1 attains the necessary data from a Microsoft CSV file.

import pandas as pd
df = pd.read_csv(r"filepath\WA_Fn-UseC_-
Marketing_Customer_Value_Analysis.csv")
Listing 3-1 Attain the Data

Listing 3-2 stipulates the names of columns to drop and then executes the drop() method. It then
stipulates axes as columns in order to drop the unnecessary columns in the data.

drop_column_names = df.columns[[0, 6]]


initial_data = df.drop(drop_column_names, axis="columns")
Listing 3-2 Drop Unnecessary Features in the Data

Listing 3-3 attains the dummy values for the categorical features in this data.

initial_data.iloc[::, 0] = pd.get_dummies(initial_data.iloc[::, 0])


initial_data.iloc[::, 2] = pd.get_dummies(initial_data.iloc[::, 2])
initial_data.iloc[::, 3] = pd.get_dummies(initial_data.iloc[::, 3])
initial_data.iloc[::, 4] = pd.get_dummies(initial_data.iloc[::, 4])
initial_data.iloc[::, 5] = pd.get_dummies(initial_data.iloc[::, 5])
initial_data.iloc[::, 6] = pd.get_dummies(initial_data.iloc[::, 6])
initial_data.iloc[::, 7] = pd.get_dummies(initial_data.iloc[::, 7])
initial_data.iloc[::, 8] = pd.get_dummies(initial_data.iloc[::, 8])
initial_data.iloc[::, 9] = pd.get_dummies(initial_data.iloc[::, 9])
initial_data.iloc[::, 15] = pd.get_dummies(initial_data.iloc[::, 15])
initial_data.iloc[::, 16] = pd.get_dummies(initial_data.iloc[::, 16])
initial_data.iloc[::, 17] = pd.get_dummies(initial_data.iloc[::, 17])
initial_data.iloc[::, 18] = pd.get_dummies(initial_data.iloc[::, 18])
initial_data.iloc[::, 20] = pd.get_dummies(initial_data.iloc[::, 20])
initial_data.iloc[::, 21] = pd.get_dummies(initial_data.iloc[::, 21])
Listing 3-3 Attain Dummy Features
Listing 3-4 outlines the independent and dependent features.

import numpy as np
int_x = initial_data.iloc[::,0:19]
fin_x = initial_data.iloc[::,19:21]
x_combined = pd.concat([int_x, fin_x], axis=1)
x = np.array(x_combined)
y = np.array(initial_data.iloc[::,19])
Listing 3-4 Outline the Features

Scikit-Learn in Action
Listing 3-5 randomly divides the dataframe.

from sklearn.model_selection import train_test_split


x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2,
random_state=0)
Listing 3-5 Randomly Divide the Dataframe

Listing 3-6 scales the independent features.

from sklearn.preprocessing import StandardScaler


sk_standard_scaler = StandardScaler()
sk_standard_scaled_x_train = sk_standard_scaler.fit_transform(x_train)
sk_standard_scaled_x_test = sk_standard_scaler.transform(x_test)
Listing 3-6 Scale the Independent Features

Listing 3-7 executes the Scikit-Learn ordinary least-squares regression method.

from sklearn.linear_model import LinearRegression


sk_linear_model = LinearRegression()
sk_linear_model.fit(sk_standard_scaled_x_train, y_train)
Listing 3-7 Execute the Scikit-Learn Ordinary Least-Squares Regression Method

Listing 3-8 determines the best hyperparameters for the Scikit-Learn ordinary least-squares regression
method.

from sklearn.preprocessing import GridSearchCV


sk_linear_model_param = {'fit_intercept':[True,False]}
sk_linear_model_param_mod = GridSearchCV(estimator=sk_linear_model,
param_grid=sk_linear_model_param, n_jobs=-1)
sk_linear_model_param_mod.fit(sk_standard_scaled_x_train, y_train)
print("Best OLS score: ", sk_linear_model_param_mod.best_score_)
print("Best OLS parameter: ", sk_linear_model_param_mod.best_params_)

Best OLS score: 1.0


Best OLS parameter: {'fit_intercept': True}
Listing 3-8 Determine the Best Hyperparameters for the Scikit-Learn Ordinary Least-Squares Regression Method

Listing 3-9 executes the Scikit-Learn ordinary least-squares regression method.

sk_linear_model = LinearRegression(fit_intercept= True)


sk_linear_model.fit(sk_standard_scaled_x_train, y_train)
sk_linear_model_param_mod.best_params_)
Listing 3-9 Execute the Scikit-Learn Ordinary Least-Squares Regression Method
Listing 3-10 computes the Scikit-Learn ordinary least-squares regression method’s intercept.

print(sk_linear_model.intercept_)
433.0646521131769
Listing 3-10 Compute the Scikit-Learn Ordinary Least-Squares Regression Method’s Intercept

Listing 3-11 computes the Scikit-Learn ordinary least-squares regression method’s coefficients.

print(sk_linear_model.coef_)
[-6.15076155e-15 2.49798076e-13 -1.95573220e-14 -1.90089677e-14
-5.87187344e-14 2.50923806e-14 -1.05879478e-13 1.53591400e-14
-1.82507711e-13 -7.86327034e-14 4.17629484e-13 1.28923537e-14
6.52911311e-14 -5.28069778e-14 -1.57900159e-14 -6.74040176e-14
-9.28427833e-14 5.03132848e-14 -8.75978166e-15 2.90235705e+02
-9.55950515e-14]
Listing 3-11 Compute the Scikit-Learn Ordinary Least-Squares Regression Method’s Coefficients

Listing 3-12 computes the ordinary least-squares regression method’s predictions.

sk_yhat = sk_linear_model.predict(sk_standard_scaled_x_test)
Listing 3-12 Compute the Scikit-Learn Ordinary Least-Squares Regression Method’s Predictions

Listing 3-13 assesses the Scikit-Learn ordinary least-squares method (see Table 3-1).

from sklearn import metrics


sk_mean_ab_error = metrics.mean_absolute_error(y_test, sk_yhat)
sk_mean_sq_error = metrics.mean_squared_error(y_test, sk_yhat)
sk_root_sq_error = np.sqrt(sk_mean_sq_error)
sk_determinant_coef = metrics.r2_score(y_test, sk_yhat)
sk_exp_variance = metrics.explained_variance_score(y_test, sk_yhat)
sk_linear_model_ev = [[sk_mean_ab_error, sk_mean_sq_error, sk_root_sq_error,
sk_determinant_coef, sk_exp_variance]]
sk_linear_model_assessment = pd.DataFrame(sk_linear_model_ev, index =
["Estimates"], columns = ["Sk mean absolute error",
"Sk mean squared error",
"Sk root mean squared error",
"Sk determinant coefficient",
"Sk variance score"])
sk_linear_model_assessment
y = np.array(initial_data.iloc[::,19])
Listing 3-13 Assess the Scikit-Learn Ordinary Least-Squares Method

Table 3-1 Assessment of the Scikit-Learn Ordinary Least-Squares Method

Sk Mean Absolute Sk Mean Squared Sk Root Mean Squared Sk Determinant Sk Variance


Error Error Error Coefficient Score
Estimates 9.091189e-13 1.512570e-24 1.229866e-12 1.0 1.0

Table 3-1 shows that the Scikit-Learn ordinary least-squares method explains the entire variability.

PySpark in Action
This section executes and assesses the ordinary least-squares method with the PySpark framework. Listing
3-14 prepares the PySpark framework with the findspark framework.

import findspark as initiate_pyspark


initiate_pyspark.init("filepath\spark-3.0.0-bin-hadoop2.7")
Listing 3-14 Prepare the PySpark Framework
Listing 3-15 stipulates the PySpark app with the SparkConf() method.

from pyspark import SparkConf


pyspark_configuration =
SparkConf().setAppName("pyspark_linear_method").setMaster("local")
Listing 3-15 Stipulate the PySpark App

Listing 3-16 prepares the PySpark session with the SparkSession() method.

from pyspark.sql import SparkSession


pyspark_session = SparkSession(pyspark_context)
Listing 3-16 Prepare the Spark Session

Listing 3-17 changes the pandas dataframe created earlier in this chapter to a PySpark dataframe using
the createDataFrame() method.

pyspark_initial_data = pyspark_session.createDataFrame(initial_data)
Listing 3-17 Change Pandas Dataframe to a PySpark Dataframe

Listing 3-18 creates a list for independent features and a string for the dependent feature. It converts
data using the VectorAssembler() method for modeling with the PySpark framework.

x_list = list(x_combined.columns)
y_list = initial_data.columns[19]
from pyspark.ml.feature import VectorAssembler
pyspark_data_columns = x_list
pyspark_vector_assembler = VectorAssembler(inputCols=pyspark_data_columns,
outputCol="variables")
pyspark_data = pyspark_vector_assembler.transform(pyspark_initial_data)
Listing 3-18 Transform the Data

Listing 3-19 divides the data using the randomSplit() method.

(pyspark_training_data, pyspark_test_data) =
pyspark_data.randomSplit([.8,.2])
Listing 3-19 Divide the Dataframe

Listing 3-20 executes the PySpark ordinary least-squares regression method.

from pyspark.ml.regression import LinearRegression


pyspark_linear_model = LinearRegression(labelCol=y_list,
featuresCol=pyspark_data.columns[-1])
pyspark_fitted_linear_model =
pyspark_linear_model.fit(pyspark_training_data)
Listing 3-20 Execute the PySpark Ordinary Least-Squares Regression Method

Listing 3-21 computes the PySpark ordinary least-squares regression method’s predictions.

pyspark_yhat = pyspark_fitted_linear_model.transform(pyspark_test_data)
Listing 3-21 Compute the PySpark Ordinary Least-Squares Regression Method’s Predictions

Listing 3-22 assesses the PySpark ordinary least-squares method.

pyspark_linear_model_assessment = pyspark_fitted_linear_model.summary
print("PySpark root mean squared error",
pyspark_linear_model_assessment.rootMeanSquaredError)
print("PySpark determinant coefficient", pyspark_linear_model_assessment.r2)

PySpark root mean squared error 2.0762306526480097e-13


PySpark determinant coefficient 1.0
Listing 3-22 Assess the PySpark Ordinary Least-Squares Method

H2O in Action
This section executes and assesses the ordinary least-squares method with the H2O framework.
Listing 3-23 prepares the H2O framework.

import h2o as initialize_h2o


initialize_h2o.init()
Listing 3-23 Prepare the H2O Framework

Listing 3-24 changes the pandas dataframe to an H2O dataframe.

h2o_data = initialize_h2o.H2OFrame(initial_data)
Listing 3-24 Change the Pandas Dataframe to an H2O Dataframe

Listing 3-25 outlines the independent and dependent features.

y = y_list
x = h2o_data.col_names
x.remove(y_list)
Listing 3-25 Outline Features

Listing 3-26 randomly divides the data.

h2o_training_data, h2o_validation_data, h2o_test_data =


h2o_data.split_frame(ratios=[.8,.1])
Listing 3-26 Randomly Divide the Dataframe

Listing 3-27 executes the H2O ordinary least-squares regression method.

from h2o.estimators import H2OGeneralizedLinearEstimator


h2o_linear_model = H2OGeneralizedLinearEstimator(family="gaussian")
h2o_linear_model.train(x=x,y=y,training_frame=h2o_training_data,validation_fra
Listing 3-27 Execute the H2O Ordinary Least-Squares Regression Method

Listing 3-28 computes the H2O ordinary least-squares method’s predictions.

h2o_yhat = h2o_linear_model.predict(h2o_test_data)
Listing 3-28 H2O Ordinary Least-Squares Method Executed Predictions

Listing 3-29 computes the H2O ordinary least-squares method’s standardized coefficients (see Figure 3-
1).

h2o_linear_model_std_coefficients = h2o_linear_model.std_coef_plot()
h2o_linear_model_std_coefficients
Listing 3-29 H2O Ordinary Least-Squares Method’s Standardized Coefficients
Figure 3-1 H2O ordinary least-squares method’s standardized coefficients
Listing 3-30 computes the H2O ordinary least-squares method’s partial dependency (see Figure 3-2).

h2o_linear_model_dependency_plot = h2o_linear_model.partial_plot(data =
h2o_data, cols = list(initial_data.columns[[0,19]]), server=False, plot =
True)
h2o_linear_model_dependency_plot
Listing 3-30 H2O Ordinary Least-Squares Method’s Partial Dependency
Figure 3-2 H2O ordinary least-squares method’s partial dependency
Listing 3-31 arranges the features that are the most important to the H2O ordinary least-squares method
in ascending order (see Figure 3-3).

h2o_linear_model_feature_importance = h2o_linear_model.varimp_plot()
h2o_linear_model_feature_importance
Listing 3-31 H2O Ordinary Least-Squares Method’s Feature Importance
Figure 3-3 H2O ordinary least-squares method’s feature importance
Listing 3-32 assesses the H2O ordinary least-squares method.

h2o_linear_model_assessment = h2o_linear_model.model_performance()
print(h2o_linear_model_assessment)

ModelMetricsRegressionGLM: glm
** Reported on train data. **

MSE: 24844.712331260016
RMSE: 157.6220553452467
MAE: 101.79904883889066
RMSLE: NaN
R^2: 0.7004468136072375
Mean Residual Deviance: 24844.712331260016
Null degrees of freedom: 7325
Residual degrees of freedom: 7304
Null deviance: 607612840.7465751
Residual deviance: 182012362.53881088
AIC: 94978.33944003603
Listing 3-32 Assess H2O Ordinary Least-Squares Method

Listing 3-33 improves the performance of the H2O ordinary least-squares method by specifying
remove_collinear_columns as True.

h2o_linear_model_collinear_removed = H2OGeneralizedLinearEstimator(family="gau
0,remove_collinear_columns = True)
h2o_linear_model_collinear_removed.train(x=x,y=y,training_frame=h2o_training_d
Listing 3-33 Improve the Performance of the Ordinary Least-Squares Method

Listing 3-34 assesses the H2O ordinary least-squares method.


h2o_linear_model_collinear_removed_assessment =
h2o_linear_model_collinear_removed.model_performance()
print(h2o_linear_model_collinear_removed)

MSE: 23380.71864337616
RMSE: 152.9075493341521
MAE: 102.53007935777588
RMSLE: NaN
R^2: 0.7180982143647627
Mean Residual Deviance: 23380.71864337616
Null degrees of freedom: 7325
Residual degrees of freedom: 7304
Null deviance: 607612840.7465751
Residual deviance: 171287144.78137374
AIC: 94533.40762597627

ModelMetricsRegressionGLM: glm
** Reported on validation data. **

MSE: 25795.936313899092
RMSE: 160.6111338416459
MAE: 103.18677222520363
RMSLE: NaN
R^2: 0.7310558588001701
Mean Residual Deviance: 25795.936313899092
Null degrees of freedom: 875
Residual degrees of freedom: 854
Null deviance: 84181020.04623385
Residual deviance: 22597240.210975606
AIC: 11430.364002305443
Listing 3-34 Assess the H2O Ordinary Least-Squares Method

Conclusion
This chapter executed three key machine learning frameworks (Scikit-Learn, PySpark, and H2O) to model
data and spawn a continuous output feature using a linear method. It also explored ways of assessing that
method.
© The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature 2022
T. C. Nokeri, Data Science Solutions with Python
https://doi.org/10.1007/978-1-4842-7762-1_4

4. Survival Analysis withPySpark and Lifelines


Tshepo Chris Nokeri1
(1) Pretoria, South Africa

This chapter describes and executes several survival analysis methods using the main Python frameworks
(i.e., Lifelines and PySpark). It begins by explaining the underlying concept behind the Cox Proportional
Hazards model. It then introduces the accelerated failure time method.

Exploring Survival Analysis


Survival methods are common in the manufacturing, insurance, and medical science fields. They are
convenient for properly assessing risk when an independent investigation is carried out over long periods
and subjects enter and leave at given times.
These methods are employed in this chapter to determine the probabilities of a machine failing and
patients surviving an illness, among other applications. They are equally suitable for missing values
(censored data).

Exploring Cox Proportional Hazards Method


The Cox Proportional Hazards method is the best survival method for handling censored data with subjects
that have related changes. It is similar to the Mantel-Haenszel method and expects the hazard rate to be a
function of time. It states that the rate is a function of covariates. Equation 4-1 defines the Cox Proportional
Hazards method.

(Equation 4-1)

Listing 4-1 attains the necessary data from a Microsoft Excel file.

import pandas as pd
initial_data = pd.read_excel(r"filepath\survival_data.xlsx", index_col=[0])
Listing 4-1 Attain the Data

Listing 4-2 finds the ratio for dividing the data.

int(initial_data.shape[0]) * 0.8

345.6
Listing 4-2 Find the Ratio for Dividing the Data

Listing 4-3 divides the data.

lifeline_training_data = initial_data.loc[:346]
lifeline_test_data = initial_data.loc[346:]
Listing 4-3 Divide the Data

Lifeline in Action
This section executes and assesses the Cox Proportional Hazards method with the Lifeline framework.
Listing 4-4 executes the Lifeline Cox Proportional Hazards method.

from lifelines import CoxPHFitter


lifeline_cox_method = CoxPHFitter()
lifeline_cox_method.fit(lifeline_training_data, initial_data.columns[0],
initial_data.columns[1])
Listing 4-4 Execute the Lifeline Cox Proportional Hazards Method

Listing 4-5 computes the test statistics (see Table 4-1) and assesses the Lifeline Cox Proportional
Hazards method with a scaled Schoenfeld, which helps disclose any abnormalities in the residuals (see
Figure 4-1).

lifeline_cox_method_test_statistics_schoenfeld =
lifeline_cox_method.check_assumptions(lifeline_training_data,
show_plots=True)
lifeline_cox_method_test_statistics_schoenfeld
Listing 4-5 Compute the Lifeline Cox Proportional Hazards Method’s Test Statistics and Residuals

Table 4-1 Test Statistics for the Lifeline Cox Proportional Hazards Method

Test Statistic p
Age km 10.53 <0.005
rank 10.78 <0.005
Fin km 0.12 0.73
rank 0.14 0.71
Mar km 0.18 0.67
rank 0.20 0.66
Paro km 0.13 0.72
rank 0.11 0.74
Prio km 0.49 0.48
rank 0.47 0.49
Race km 0.34 0.56
rank 0.37 0.54
Wexp km 11.91 <0.005
rank 11.61 <0.005
Figure 4-1 Scaled Schoenfeld residuals of age
Listing 4-6 determines the Lifeline Cox Proportional Hazards method’s assessment summary (see Table
4-2).

lifeline_cox_method_assessment_summary = lifeline_cox_method.print_summary()
lifeline_cox_method_assessment_summary
Listing 4-6 Compute the Assessment Summary

Table 4-2 Summary of the Cox Proportional Hazards

Coef Exp(coef ) Se(coef ) Coef Lower Coef Upper Exp(coef ) Lower Exp(coef ) Upper Z P -
95% 95% 95% 95% log2(p)
Fin -0.71 0.49 0.23 -1.16 -0.27 0.31 0.77 -3.13 <0.005 9.14
Age -0.03 0.97 0.02 -0.08 0.01 0.93 1.01 -1.38 0.17 2.57
Race 0.39 1.48 0.37 -0.34 1.13 0.71 3.09 1.05 0.30 1.76
Wexp -0.11 0.90 0.24 -0.59 0.37 0.56 1.44 -0.45 0.65 0.62
Mar -1.15 0.32 0.61 -2.34 0.04 0.10 1.04 -1.90 0.06 4.11
Paro 0.07 1.07 0.23 -0.37 0.51 0.69 1.67 0.31 0.76 0.40
Prio 0.10 1.11 0.03 0.04 0.16 1.04 1.17 3.24 <0.005 9.73

Listing 4-7 determines the log test confidence interval for each feature in the data (see Figure 4-2).

lifeline_cox_log_test_ci = lifeline_cox_method.plot()
lifeline_cox_log_test_ci
Listing 4-7 Execute the Lifeline Cox Proportional Hazards Method
Figure 4-2 Log test confidence interval

Exploring the Accelerated Failure Time Method


The accelerated failure time method models the censored data with a log-linear function to describe the log
of the survival time. Likewise, it also assumes each instance is independent.

PySpark in Action
This section executes the accelerated failure time method with the PySpark framework.
Listing 4-8 runs the PySpark framework with the findspark framework.

import findspark as initiate_pyspark


initiate_pyspark.init("filepath\spark-3.0.0-bin-hadoop2.7")
Listing 4-8 Prepare the PySpark Framework

Listing 4-9 stipulates the PySpark app using the SparkConf() method.

from pyspark import SparkConf


pyspark_configuration =
SparkConf().setAppName("pyspark_aft_method").setMaster("local")
Listing 4-9 Stipulate the PySpark App

Listing 4-10 prepares the PySpark session using the SparkSession() method.

from pyspark.sql import SparkSession


pyspark_session = SparkSession(pyspark_context)
from pyspark.sql import SparkSession
Listing 4-10 Prepare the Spark Session

Listing 4-11 changes the pandas dataframe created earlier in this chapter to a PySpark dataframe using
the createDataFrame() method.
pyspark_initial_data = pyspark_session.createDataFrame(initial_data)
Listing 4-11 Change the Pandas Dataframe to a PySpark Dataframe
Listing 4-12 creates a list for independent features and a string for the dependent feature. It the converts
the data using the VectorAssembler() method for modeling with the PySpark framework.

x_list = list(initial_data.iloc[::, 0:9].columns)


from pyspark.ml.feature import VectorAssembler
pyspark_data_columns = x_list
pyspark_vector_assembler = VectorAssembler(inputCols=pyspark_data_columns,
outputCol="variables")
pyspark_data = pyspark_vector_assembler.transform(pyspark_initial_data)
Listing 4-12 Transform the Data

Listing 4-13 executes the PySpark accelerated failure time method.

from pyspark.ml.regression import AFTSurvivalRegression


pyspark_accelerated_failure_method =
AFTSurvivalRegression(censorCol=pyspark_data.columns[1],
labelCol=pyspark_data.columns[0],featuresCol="variables",)
pyspark_accelerated_failure_method_fitted =
pyspark_accelerated_failure_method.fit(pyspark_data)
Listing 4-13 Execute the PySpark Accelerated Failure Time Method

Listing 4-14 computes the PySpark accelerated failure time method’s predictions.

pyspark_accelerated_failure_method_fitted.transform(pyspark_data).select(pyspa
pyspark_yhat.show()

+------+------------------+
|arrest| prediction|
+------+------------------+
| 1|18.883982665910125|
| 1| 16.88228128814963|
| 1|22.631360777172517|
| 0|373.13041474613107|
| 0| 377.2238319806288|
| 0| 375.8326538406928|
| 1| 20.9780526816987|
| 0| 374.6420738270714|
| 0| 379.7483494080467|
| 0| 376.1601473382181|
| 0| 377.1412349521787|
| 0| 373.7536844216336|
| 1| 36.36443059383637|
| 0|374.14261327949384|
| 1| 22.98494042401171|
| 1| 50.61463874375869|
| 1| 25.56399364288275|
| 0|379.61997114629696|
| 0| 384.3322960430372|
| 0|376.37634062210844|
+------+------------------+
Listing 4-14 Compute the PySpark Accelerated Failure Time Method’s Predictions

Listing 4-15 computes the PySpark accelerated failure time method’s coefficients.
pyspark_accelerated_failure_method_fitted.coefficients

DenseVector([0.0388, -1.7679, -0.0162, -0.0003, 0.0098, -0.0086, -0.0026,


0.0115, 0.0003])
Listing 4-15 Compute the PySpark Accelerated Failure Time Method’s Coefficients

Conclusion
This chapter executed two key machine learning frameworks (Lifeline and PySpark) to model censored data
with the Cox Proportional Hazards and accelerated failure time methods.
© The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature 2022
T. C. Nokeri, Data Science Solutions with Python
https://doi.org/10.1007/978-1-4842-7762-1_5

5. Nonlinear Modeling With Scikit-Learn, PySpark, and H2O


Tshepo Chris Nokeri1
(1) Pretoria, South Africa

This chapter executes and appraises a nonlinear method for binary classification (called logistic regression )
using a diverse set of comprehensive Python frameworks (i.e., Scikit-Learn, Spark MLlib, and H2O). To begin,
it clarifies the underlying concept behind the sigmoid function.

Exploring the Logistic Regression Method


The logistic regression method unanimously accepts values and then models them by executing a function
(sigmoid) to anticipate values of a categorical output feature. Equation 5-1 defines the sigmoid function,
which applies to logistic regression (also see Figure 5-1).

(Equation 5-1)

Both Equation 5-1 and Figure 5-1 suggest that the function produces binary output values.

Figure 5-1 Sigmoid function

Listing 5-1 attains the necessary data from a Microsoft CSV file using the pandas framework.
import pandas as pd
df = pd.read_csv(r"C:\Users\i5 lenov\Downloads\banking.csv")
Listing 5-1 Attain the Data
Listing 5-2 stipulates the names of columns to drop and then executes the drop() method. It stipulates
axes as columns in order to drop the unnecessary columns in the data.

drop_column_names = df.columns[[8, 9, 10]]


initial_data = df.drop(drop_column_names, axis="columns")
Listing 5-2 Drop Unnecessary Features in the Data

Listing 5-3 attains dummy values for categorical features in the data.

initial_data.iloc[::, 1] = pd.get_dummies(initial_data.iloc[::, 1])


initial_data.iloc[::, 2] = pd.get_dummies(initial_data.iloc[::, 2])
initial_data.iloc[::, 3] = pd.get_dummies(initial_data.iloc[::, 3])
initial_data.iloc[::, 4] = pd.get_dummies(initial_data.iloc[::, 4])
initial_data.iloc[::, 5] = pd.get_dummies(initial_data.iloc[::, 5])
initial_data.iloc[::, 6] = pd.get_dummies(initial_data.iloc[::, 6])
initial_data.iloc[::, 7] = pd.get_dummies(initial_data.iloc[::, 7])
initial_data.iloc[::, 11] = pd.get_dummies(initial_data.iloc[::, 11])
Listing 5-3 Attain Dummy Features

Listing 5-4 drops the null values.

initial_data = initial_data.dropna()
Listing 5-4 Drop Null Values

Scikit-Learn in Action
This section executes and assesses the logistic regression method with the Scikit-Learn framework. Listing
5-5 outlines the independent and dependent features.

import numpy as np
x = np.array(initial_data.iloc[::, 0:17])
y = np.array(initial_data.iloc[::,-1])
Listing 5-5 Outline the Features

Listing 5-6 randomly divides the dataframe.

from sklearn.model_selection import train_test_split


x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2,
random_state=0)
Listing 5-6 Randomly Divide the Dataframe

Listing 5-7 scales the independent features.

from sklearn.preprocessing import StandardScaler


sk_standard_scaler = StandardScaler()
sk_standard_scaled_x_train = sk_standard_scaler.fit_transform(x_train)
sk_standard_scaled_x_test = sk_standard_scaler.transform(x_test)
Listing 5-7 Scale Independent Features

Listing 5-8 executes the Scikit-Learn logistic regression method.

from sklearn.linear_model import LogisticRegression


sk_logistic_regression_method = LogisticRegression()
sk_logistic_regression_method.fit(sk_standard_scaled_x_train, y_train)
Listing 5-8 Execute the Scikit-Learn Logistic Regression Method
Listing 5-9 determines the best hyperparameters for the Scikit-Learn logistic regression method.

from sklearn.model_selection import GridSearchCV


sk_logistic_regression_method_param = {"penalty":("l1","l2")}
sk_logistic_regression_method_param_mod =
GridSearchCV(estimator=sk_logistic_regression_method,
param_grid=sk_logistic_regression_method_param, n_jobs=-1)
sk_logistic_regression_method_param_mod.fit(sk_standard_scaled_x_train,
y_train)
print("Best logistic regression score: ",
sk_logistic_regression_method_param_mod.best_score_)
print("Best logistic regression parameter: ",
sk_logistic_regression_method_param_mod.best_params_)
Best logistic regression score: 0.8986039453717755
Best logistic regression parameter: {'penalty': 'l2'}
Listing 5-9 Determine the Best Hyperparameters for the Scikit-Learn Logistic Regression Method

Listing 5-10 executes the logistic regression method with the Scikit-Learn framework.

sk_logistic_regression_method = LogisticRegression(penalty="l2")
sk_logistic_regression_method.fit(sk_standard_scaled_x_train, y_train)
Listing 5-10 Execute the Scikit-Learn Logistic Regression Method

Listing 5-11 computes the logistic regression method’s intercept.

print(sk_logistic_regression_method.intercept_)
[-2.4596243]
Listing 5-11 Compute the Logistic Regression Method’s Intercept

Listing 5-12 computes the coefficients.

print(sk_logistic_regression_method.coef_)
[[ 0.03374725 0.04330667 -0.01305369 -0.02709009 0.13508899 0.01735913
0.00816758 0.42948983 -0.12670658 -0.25784955 -0.04025993 -0.14622466
-1.14143485 0.70803518 0.23256046 -0.02295578 -0.02857435]]
Listing 5-12 Compute the Logistic Regression Method’s Coefficients

Listing 5-13 computes the Scikit-Learn logistic regression method’s confusion matrix, which includes
two forms of errors—false positives and false negatives and true positives and true negatives (see Table 5-
1).

from sklearn import metrics


sk_logistic_regression_method_assessment_1 =
pd.DataFrame(metrics.confusion_matrix(y_test, sk_yhat), index=["Actual:
Deposit","Actual: No deposit"], columns=("Predicted: deposit","Predicted: No
deposit"))
print(sk_logistic_regression_method_assessment_1)
Listing 5-13 Compute the Scikit-Learn Logistic Regression Method’s Confusion Matrix

Table 5-1 Scikit-Learn Logistic Regression Method’s Confusion Matrix

Predicted: Deposit Predicted: No Deposit


Actual: Deposit 7230 95
Predicted: Deposit Predicted: No Deposit
Actual: No Deposit 711 202
Listing 5-14 computes the appropriate classification report (see Table 5-2).

sk_logistic_regression_method_assessment_2 =
pd.DataFrame(metrics.classification_report(y_test, sk_yhat,
output_dict=True)).transpose()
print(sk_logistic_regression_method_assessment_2)
Listing 5-14 Compute the Scikit-Learn Logistic Regression Method’s Classification Report

Table 5-2 Scikit-Learn Logistic Regression Method’s Classification Report

Precision Recall F1-score Support


0 0.910465 0.987031 0.947203 7325.000000
1 0.680135 0.221249 0.333884 913.000000
Accuracy 0.902161 0.902161 0.902161 0.902161
Macro Avg 0.795300 0.604140 0.640544 8238.000000
Weighted Avg 0.884938 0.902161 0.879230 8238.000000

Listing 5-15 arranges the Scikit-Learn logistic regression method’s receiver operating characteristics
curve. The goal is to condense the arrangement of the true positive rate (the proclivity of the method to
correctly differentiate positive classes) and the false positive rate (the proclivity of the method to correctly
differentiate negative classes). See Figure 5-2.

sk_yhat_proba =
sk_logistic_regression_method.predict_proba(sk_standard_scaled_x_test)[::,1]
fpr_sk_logistic_regression_method, tprr_sk_logistic_regression_method, _ =
metrics.roc_curve(y_test, sk_yhat_proba)
area_under_curve_sk_logistic_regression_method =
metrics.roc_auc_score(y_test, sk_yhat_proba)
plt.plot(fpr_sk_logistic_regression_method,
tprr_sk_logistic_regression_method, label="AUC= "+
str(area_under_curve_sk_logistic_regression_method))
plt.xlabel("False Positive Rate (FPR)")
plt.ylabel("True Positive Rate (TPR)")
plt.legend(loc="best")
plt.show()
Listing 5-15 Receiver Operating Characteristics Curve for the Scikit-Learn Logistic Regression Method
Figure 5-2 Receiver operating characteristics curve for the Scikit-Learn logistic regression method
Listing 5-16 arranges the Scikit-Learn logistic regression method’s precision-recall curve to condense
the arrangement of the precision and recall (see Figure 5-3).

p_sk_logistic_regression_method, r__sk_logistic_regression_method, _ =
metrics.precision_recall_curve(y_test, sk_yhat)
weighted_ps_sk_logistic_regression_method = metrics.roc_auc_score(y_test,
sk_yhat)
plt.plot(p_sk_logistic_regression_method, r__sk_logistic_regression_method,
label="WPR= " +str(weighted_ps_sk_logistic_regression_method))
plt.xlabel("Recall")
plt.ylabel("Precision")
plt.legend(loc="best")
plt.show()
Listing 5-16 Precision-Recall Curve for the Scikit-Learn Logistic Regression Method
Another Random Scribd Document
with Unrelated Content
This eBook is for the use of anyone anywhere in the United
States and most other parts of the world at no cost and with
almost no restrictions whatsoever. You may copy it, give it away
or re-use it under the terms of the Project Gutenberg License
included with this eBook or online at www.gutenberg.org. If you
are not located in the United States, you will have to check the
laws of the country where you are located before using this
eBook.

1.E.2. If an individual Project Gutenberg™ electronic work is derived


from texts not protected by U.S. copyright law (does not contain a
notice indicating that it is posted with permission of the copyright
holder), the work can be copied and distributed to anyone in the
United States without paying any fees or charges. If you are
redistributing or providing access to a work with the phrase “Project
Gutenberg” associated with or appearing on the work, you must
comply either with the requirements of paragraphs 1.E.1 through
1.E.7 or obtain permission for the use of the work and the Project
Gutenberg™ trademark as set forth in paragraphs 1.E.8 or 1.E.9.

1.E.3. If an individual Project Gutenberg™ electronic work is posted


with the permission of the copyright holder, your use and distribution
must comply with both paragraphs 1.E.1 through 1.E.7 and any
additional terms imposed by the copyright holder. Additional terms
will be linked to the Project Gutenberg™ License for all works posted
with the permission of the copyright holder found at the beginning
of this work.

1.E.4. Do not unlink or detach or remove the full Project


Gutenberg™ License terms from this work, or any files containing a
part of this work or any other work associated with Project
Gutenberg™.

1.E.5. Do not copy, display, perform, distribute or redistribute this


electronic work, or any part of this electronic work, without
prominently displaying the sentence set forth in paragraph 1.E.1
with active links or immediate access to the full terms of the Project
Gutenberg™ License.

1.E.6. You may convert to and distribute this work in any binary,
compressed, marked up, nonproprietary or proprietary form,
including any word processing or hypertext form. However, if you
provide access to or distribute copies of a Project Gutenberg™ work
in a format other than “Plain Vanilla ASCII” or other format used in
the official version posted on the official Project Gutenberg™ website
(www.gutenberg.org), you must, at no additional cost, fee or
expense to the user, provide a copy, a means of exporting a copy, or
a means of obtaining a copy upon request, of the work in its original
“Plain Vanilla ASCII” or other form. Any alternate format must
include the full Project Gutenberg™ License as specified in
paragraph 1.E.1.

1.E.7. Do not charge a fee for access to, viewing, displaying,


performing, copying or distributing any Project Gutenberg™ works
unless you comply with paragraph 1.E.8 or 1.E.9.

1.E.8. You may charge a reasonable fee for copies of or providing


access to or distributing Project Gutenberg™ electronic works
provided that:

• You pay a royalty fee of 20% of the gross profits you derive
from the use of Project Gutenberg™ works calculated using the
method you already use to calculate your applicable taxes. The
fee is owed to the owner of the Project Gutenberg™ trademark,
but he has agreed to donate royalties under this paragraph to
the Project Gutenberg Literary Archive Foundation. Royalty
payments must be paid within 60 days following each date on
which you prepare (or are legally required to prepare) your
periodic tax returns. Royalty payments should be clearly marked
as such and sent to the Project Gutenberg Literary Archive
Foundation at the address specified in Section 4, “Information
about donations to the Project Gutenberg Literary Archive
Foundation.”

• You provide a full refund of any money paid by a user who


notifies you in writing (or by e-mail) within 30 days of receipt
that s/he does not agree to the terms of the full Project
Gutenberg™ License. You must require such a user to return or
destroy all copies of the works possessed in a physical medium
and discontinue all use of and all access to other copies of
Project Gutenberg™ works.

• You provide, in accordance with paragraph 1.F.3, a full refund of


any money paid for a work or a replacement copy, if a defect in
the electronic work is discovered and reported to you within 90
days of receipt of the work.

• You comply with all other terms of this agreement for free
distribution of Project Gutenberg™ works.

1.E.9. If you wish to charge a fee or distribute a Project Gutenberg™


electronic work or group of works on different terms than are set
forth in this agreement, you must obtain permission in writing from
the Project Gutenberg Literary Archive Foundation, the manager of
the Project Gutenberg™ trademark. Contact the Foundation as set
forth in Section 3 below.

1.F.

1.F.1. Project Gutenberg volunteers and employees expend


considerable effort to identify, do copyright research on, transcribe
and proofread works not protected by U.S. copyright law in creating
the Project Gutenberg™ collection. Despite these efforts, Project
Gutenberg™ electronic works, and the medium on which they may
be stored, may contain “Defects,” such as, but not limited to,
incomplete, inaccurate or corrupt data, transcription errors, a
copyright or other intellectual property infringement, a defective or
damaged disk or other medium, a computer virus, or computer
codes that damage or cannot be read by your equipment.

1.F.2. LIMITED WARRANTY, DISCLAIMER OF DAMAGES - Except for


the “Right of Replacement or Refund” described in paragraph 1.F.3,
the Project Gutenberg Literary Archive Foundation, the owner of the
Project Gutenberg™ trademark, and any other party distributing a
Project Gutenberg™ electronic work under this agreement, disclaim
all liability to you for damages, costs and expenses, including legal
fees. YOU AGREE THAT YOU HAVE NO REMEDIES FOR
NEGLIGENCE, STRICT LIABILITY, BREACH OF WARRANTY OR
BREACH OF CONTRACT EXCEPT THOSE PROVIDED IN PARAGRAPH
1.F.3. YOU AGREE THAT THE FOUNDATION, THE TRADEMARK
OWNER, AND ANY DISTRIBUTOR UNDER THIS AGREEMENT WILL
NOT BE LIABLE TO YOU FOR ACTUAL, DIRECT, INDIRECT,
CONSEQUENTIAL, PUNITIVE OR INCIDENTAL DAMAGES EVEN IF
YOU GIVE NOTICE OF THE POSSIBILITY OF SUCH DAMAGE.

1.F.3. LIMITED RIGHT OF REPLACEMENT OR REFUND - If you


discover a defect in this electronic work within 90 days of receiving
it, you can receive a refund of the money (if any) you paid for it by
sending a written explanation to the person you received the work
from. If you received the work on a physical medium, you must
return the medium with your written explanation. The person or
entity that provided you with the defective work may elect to provide
a replacement copy in lieu of a refund. If you received the work
electronically, the person or entity providing it to you may choose to
give you a second opportunity to receive the work electronically in
lieu of a refund. If the second copy is also defective, you may
demand a refund in writing without further opportunities to fix the
problem.

1.F.4. Except for the limited right of replacement or refund set forth
in paragraph 1.F.3, this work is provided to you ‘AS-IS’, WITH NO
OTHER WARRANTIES OF ANY KIND, EXPRESS OR IMPLIED,
INCLUDING BUT NOT LIMITED TO WARRANTIES OF
MERCHANTABILITY OR FITNESS FOR ANY PURPOSE.

1.F.5. Some states do not allow disclaimers of certain implied


warranties or the exclusion or limitation of certain types of damages.
If any disclaimer or limitation set forth in this agreement violates the
law of the state applicable to this agreement, the agreement shall be
interpreted to make the maximum disclaimer or limitation permitted
by the applicable state law. The invalidity or unenforceability of any
provision of this agreement shall not void the remaining provisions.

1.F.6. INDEMNITY - You agree to indemnify and hold the Foundation,


the trademark owner, any agent or employee of the Foundation,
anyone providing copies of Project Gutenberg™ electronic works in
accordance with this agreement, and any volunteers associated with
the production, promotion and distribution of Project Gutenberg™
electronic works, harmless from all liability, costs and expenses,
including legal fees, that arise directly or indirectly from any of the
following which you do or cause to occur: (a) distribution of this or
any Project Gutenberg™ work, (b) alteration, modification, or
additions or deletions to any Project Gutenberg™ work, and (c) any
Defect you cause.

Section 2. Information about the Mission


of Project Gutenberg™
Project Gutenberg™ is synonymous with the free distribution of
electronic works in formats readable by the widest variety of
computers including obsolete, old, middle-aged and new computers.
It exists because of the efforts of hundreds of volunteers and
donations from people in all walks of life.

Volunteers and financial support to provide volunteers with the


assistance they need are critical to reaching Project Gutenberg™’s
goals and ensuring that the Project Gutenberg™ collection will
remain freely available for generations to come. In 2001, the Project
Gutenberg Literary Archive Foundation was created to provide a
secure and permanent future for Project Gutenberg™ and future
generations. To learn more about the Project Gutenberg Literary
Archive Foundation and how your efforts and donations can help,
see Sections 3 and 4 and the Foundation information page at
www.gutenberg.org.

Section 3. Information about the Project


Gutenberg Literary Archive Foundation
The Project Gutenberg Literary Archive Foundation is a non-profit
501(c)(3) educational corporation organized under the laws of the
state of Mississippi and granted tax exempt status by the Internal
Revenue Service. The Foundation’s EIN or federal tax identification
number is 64-6221541. Contributions to the Project Gutenberg
Literary Archive Foundation are tax deductible to the full extent
permitted by U.S. federal laws and your state’s laws.

The Foundation’s business office is located at 809 North 1500 West,


Salt Lake City, UT 84116, (801) 596-1887. Email contact links and up
to date contact information can be found at the Foundation’s website
and official page at www.gutenberg.org/contact

Section 4. Information about Donations to


the Project Gutenberg Literary Archive
Foundation
Project Gutenberg™ depends upon and cannot survive without
widespread public support and donations to carry out its mission of
increasing the number of public domain and licensed works that can
be freely distributed in machine-readable form accessible by the
widest array of equipment including outdated equipment. Many
small donations ($1 to $5,000) are particularly important to
maintaining tax exempt status with the IRS.

The Foundation is committed to complying with the laws regulating


charities and charitable donations in all 50 states of the United
States. Compliance requirements are not uniform and it takes a
considerable effort, much paperwork and many fees to meet and
keep up with these requirements. We do not solicit donations in
locations where we have not received written confirmation of
compliance. To SEND DONATIONS or determine the status of
compliance for any particular state visit www.gutenberg.org/donate.

While we cannot and do not solicit contributions from states where


we have not met the solicitation requirements, we know of no
prohibition against accepting unsolicited donations from donors in
such states who approach us with offers to donate.

International donations are gratefully accepted, but we cannot make


any statements concerning tax treatment of donations received from
outside the United States. U.S. laws alone swamp our small staff.

Please check the Project Gutenberg web pages for current donation
methods and addresses. Donations are accepted in a number of
other ways including checks, online payments and credit card
donations. To donate, please visit: www.gutenberg.org/donate.

Section 5. General Information About


Project Gutenberg™ electronic works
Professor Michael S. Hart was the originator of the Project
Gutenberg™ concept of a library of electronic works that could be
freely shared with anyone. For forty years, he produced and
distributed Project Gutenberg™ eBooks with only a loose network of
volunteer support.
Project Gutenberg™ eBooks are often created from several printed
editions, all of which are confirmed as not protected by copyright in
the U.S. unless a copyright notice is included. Thus, we do not
necessarily keep eBooks in compliance with any particular paper
edition.

Most people start at our website which has the main PG search
facility: www.gutenberg.org.

This website includes information about Project Gutenberg™,


including how to make donations to the Project Gutenberg Literary
Archive Foundation, how to help produce our new eBooks, and how
to subscribe to our email newsletter to hear about new eBooks.
Welcome to our website – the ideal destination for book lovers and
knowledge seekers. With a mission to inspire endlessly, we offer a
vast collection of books, ranging from classic literary works to
specialized publications, self-development books, and children's
literature. Each book is a new journey of discovery, expanding
knowledge and enriching the soul of the reade

Our website is not just a platform for buying books, but a bridge
connecting readers to the timeless values of culture and wisdom. With
an elegant, user-friendly interface and an intelligent search system,
we are committed to providing a quick and convenient shopping
experience. Additionally, our special promotions and home delivery
services ensure that you save time and fully enjoy the joy of reading.

Let us accompany you on the journey of exploring knowledge and


personal growth!

ebooknice.com

You might also like