0% found this document useful (0 votes)
137 views

Learning Apache Spark With Python

Uploaded by

Akuthota Shankar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
137 views

Learning Apache Spark With Python

Uploaded by

Akuthota Shankar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 200
Learning APACHE Spark. Learning Apache Spark with Python Wenqiang Feng December 05, 2021 Preface 11 About 1.2 Motivation for this tutorial 1.3 Copyright notice and license info. 14 Acknowledgement . 15. Feedback and suggestions Why Spark with Python ? 2.1 Why Spark? 2.2 Why Spark with Python (PySpark)? Configure Running Platform 3.1 Run on Databricks Community Cloud 3.2. Configure Spark on Mac and Ubuntu 3.3. Configure Spark on Windows 3.4 PySpark With Text Editor or IDE. 3.5. PySparkling Water: Spark + H20. 3.6 Set up Spark on Cloud 3.7 PySpark on Colaboratory 3.8 Demo Code in this Section An Introduction to Apache Spark 4.1 Core Concepts 4.2 Spark Components 43° Architecture 4.4 How Spark Works? Programming with RDDs 5. Create RED 5.2 Spark Operations 53 rdd.DataPrame vs pd.DataFrame Statistics and Linear Algebra Preliminaries 6.1 Notations 62 Linear Algebra Preliminaries 63 Measurement Formula CONTENTS wus now a4 uw u 18 20 21 29, 30 31 31 33 33 34 36 36 7 37 4 43 61 6 61 63 10 uw 2 2B 14 64 Confusion Matrix 65 Statistical Tests Data Exploration 7.1 Univariate Analysis 7.2 Multivariate Analysis Data Manipulation: Features 8.1 Feature Extraction 82 Feature Transform 83 Feature Selection 84 Unbalanced data: Undersampling Regression 9.1 Linear Regression 9.2 Generalized linear regression 9.3 Decision tree Regression 9.4 Random Forest Regression 9.5 Gradient-boosted tree regression Regularization 10.1 Ordinary least squares regression 10.2 Ridge regression 10.3 Least Absolute Shrinkage and Selection Operate (Lasso) 104 Elastic net . Classification ILL Binomial logistic regression 11.2 Multinomial logistic regression 113 Decision tree Classification 114. Random forest Classification ILS. Gradient-boosted tree Classification 116 XGBoost: Gradient-boosted tree Classification 17 Naive Bayes Classification Clustering 12.1 K-Means Model RFM Analysis, 13.1 RFM Analysis Methodology 13.2 Demo 13.3. Extension ‘Text Mining 14.1 Text Collection 14.2. Text Preprocessing 143 Text Classification 14.4 Sentiment analysis 14.5. Negrams and Correlations or or 30 87 96 116 47 9 119 133 142 Ist 159 167 167 168 168 168 169 169 181 194 206 217 218 219 233 233 247 248, 250 256 263 263 2m 2714 280 287 15 16 7 18 19 20 2 22 23 24 146 Topic Model: Latent Dirichlet Allocation Social Network Anal 15.1 Introduction 15.2 Co-occurrence Network 15.3. Appendix: matrix multiplication in PySpark 15.4 Correlation Network ALS: Stock Portfolio Recommendations 16.1 Recommender systems 16.2 Alternating Least Squares. 163 Demo Monte Carlo Simulation 17.1 Simulating Casino Win 17.2. Simulating a Random Walk Markov Chain Monte Carlo 18.1 Metropolis algorithm 18.2 A Toy Example of Metropolis 18.3 Demos eee Neural Network 19.1 Feedforward Neural Network Automation for Cloudera Distribution Hadoop 20.1 Automation Pipeline 20.2 Data Clean and Manipulation Automation 20.3. ML Pipeline Automation 20.4 Save and Load PipelineModel 20.5. Ingest Results Back into Hadoop Wrap PySpark Package 21.1 Package Wrapper 21.2 Pacakge Publishing on PyPI PySpark Data Audit Library 22.1 Install with pip Install from Repo Uninstall 224 Test . 225 Auditing on Big Dataset ‘Zeppelin to jupyter notebook 23.1 How to Install 23.2 Converting Demos My Cheat Sheet 287 305 306 306 310 312 313 314 31s 3s 323 324 326 335, 336 336 337 345, 345 349) 349) 349) 392 393 353 355 355 357 359 359) 359) 359) 360 367 311 371 372 377 25 JDBC Connection 25.1 25.2 253 25.4 JDBC Driver JDBC read JDBC write JDBC temp_view 26 Databricks Tips 26.1 26.2 26.3 26.4 26.5 Display samples Auto files download ‘Working with AWS $3 delta format nlflow 27 PySpark API 24 212 213 214 215 216 207 278 Stat API Regression API Classification APT Clustering API Recommendation API Pipeline API Tuning API Evaluation API 28 Main Reference Bibliography Python Module Index Index 381 381 382 383 383 385 385 385 389) 403 403 405 405 4 430 450 465 470 472 ann 483, 435 487 439 Learning Apache Spark with Python Learning Spark ‘Welcome to my Learning Apache Spark with Python note! In this note, you will lear a wide array of concepts about PySpark in Data Mining, Text Mining, Machine Learning and Deep Leaming. The PDF version can be downloaded from HERE. CONTENTS 1 Learning Apache Spark with Python 2 CONTENTS CHAPTER ONE PREFACE 1.1 About 1.1.1 About this note This is a shared repository for Learning Apache Spark Notes. The PDF version can be downloaded from HERE. The first version was posted on Github in ChenFeng ({Feng2017). This shared repository mainly contains the self-learning and self-teaching notes from Wengiang during his IMA Data Science Fellowship, ‘The reader is referred to the repository hitps://github com/runawayhorse001/LeamningApacheSpark for more details about the dataset and the . ipynb files. In this repository, I try to use the detailed demo code and examples to show how to use each main functions. If you find your work wasn’t cited in this note, please feel free to let me know. Although I am by no means an data mining programming and Big Data expert, I decided that it would be ‘useful for me to share what I learned about PySpark programming in the form of easy tutorials with detailed example, I hope those tutorials will be a valuable tool for your studies. ‘The tutorials assume that the reader has a preliminary knowledge of programming and Linux. And this document is generated automatically by using sphinx. 1.1.2 About the author + Wengiang Feng - Director of Data Science and PhD in Mathematics ~ University of Tennessee at Knoxville = Email: von198@ gmail.com + Biography Wengiang Feng is the Director of Data Science at American Express (AMEX). Prior to his time at AMEX. Dr, Feng was a Sr, Data Scientist in Machine Learning Lab, H&R Block. Before joining Block, Dr. Feng was a Data Scientist at Applied Analytics Group, DST (now SS&C). Dr. Feng's responsibilities include providing clients with access to cutting-edge skills and technologies, including Big Data analytic solutions, advanced analytic and data enhancement techniques and modeling. Learning Apache Spark with Python Dr. Feng has deep analytic expertise in data mining, analytic systems, machine learning algorithms, business intelligence, and applying Big Data tools to strategically solve industry problems in a cross functional business. Before joining DST, Dr. Feng was an IMA Data Science Fellow at The Institute for Mathematics and its Applications (IMA) at the University of Minnesota. While there, he helped startup companies make marketing decisions based on deep predictive analytics. Dr. Feng graduated from University of Tennessee, Knoxville, with Ph.D. in Computational Mathe- matics and Master's degree in Statistics. He also holds Master's degree in Computational Mathematics from Missouri University of Science and Technology (MST) and Master's degree in Applied Mathe- ‘matics from the University of Science and Technology of China (USTC). + Declaration ‘The work of Wengiang Feng was supported by the IMA, while working at IMA. However, any opin- ion, finding, and conclusions or recommendations expressed in this material are those of the author and do not necessarily reflect the views of the IMA, UTK, DST, HR & Block and AMEX. 1.2 Motivation for this tutorial I was motivated by the IMA Data Science Fellowship project to learn PySpark. After that I was impressed and attracted by the PySpark. And I foud that: 1, It is no exaggeration to say that Spark is the most powerful Bigdata tool, 2. However, I still found that learning Spark was a difficult process. I have to Google it and identify which one is true. And it was hard to find detailed examples which I can easily learned the full process in one file. 3. Good sources are expensive for a graduate student 1.3 Copyright notice and license info This Learning Apache Spark with Python PDF file is supposed to be a free and living document, which is why its source is available online at hitps://runawayhorse001. github io/LearningApacheSpark/pyspark. pdf, But this document is licensed according to both MIT License and Creative Commons Attribution- ‘NonCommercial 2.0 Generic (CC BY-NC 2.0) License. ‘When you plan to use, copy, modify, merge, publish, distribute or sublicense, Please see the terms of those licenses for more details and give the corresponding credits to the author. 4 Chapter 1. Pret Learning Apache Spark with Python 1.4 Acknowledgement At here, I would like to thank Ming Chen, Jian Sun and Zhongbo Li at the University of Tennessee at Knoxwille for the valuable discussion and thank the generous anonymous authors for providing the detailed solutions and source code on the internet, Without those help, this repository would not have been possible to be made. Wengiang also would like to thank the Institute for Mathematics and Its Applications (IMA) at University of Minnesota, Twin Cities for support during his IMA Data Scientist Fellow visit and thank TAN THIAM HUAT and Mark Rabins for finding the typos. ‘A special thank you goes to Dr. Haiping Lu, Lecturer in Machine Learning at Department of Computer Science, University of Sheffield, for recommending and heavily using my tutorial in his teaching class and for the valuable suggestions. 1.5 Feedback and suggestions Your comments and suggestions are highly appreciated. I am more than happy to receive corrections, sug- gestions or feedbacks through email (von !98@ gmail.com) for improvements. 1.4, Acknowledgement 5 Learning Apache Spark with Python 6 Chapter 1. Preface CHAPTER Two WHY SPARK WITH PYTHON ? Chinese proverb ‘Sharpening the knife longer can make it easier to hack the firewood — old Chinese proverb I want to answer this question from the following two parts: 2.1. Why Spark? I think the following four main reasons from Apache Spark"™ official website are good enough to convince you to use Spark. 1, Speed Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk. Apache Spark has an advanced DAG execution engine that supports acyclic data flow and in-memory computing, 120 440 2 2 90 £ 50 ™ Hadoop D = ® Spark 3 a0 0.8 = 5 z Fig. I: Logistic regression in Hadoop and Spark 2, Base of Use Write applications quickly in Java, Scala, Python, R. Learning Apache Spark with Python Spark offers over 80 high-level operators that make it easy to build parallel apps. And you can use it interactively from the Scala, Python and R shells. 3. Generality ‘Combine SQL, streaming, and complex analytics. Spark powers a stack of libraries including SQL and DataFrames, MLlib for machine learning, GraphX, and Spark Streaming. You can combine these libraries seamlessly in the same application. Spark SL Sod ry Graphx Structured [| Streaming eae pert (machine learning) ff (graph) Apache Spark Core Standalone Scheduler Fig. 2: The Spark stack 4, Runs Everywhere Spark runs on Hadoop, Mesos, standalone, or in the cloud. It can access diverse data sources including DES, Cassandra, HBase, and $3, 8 Chapter 2, Why Spark with Python ? Learning Apache Spark with Python APACHE Soark hemp rr Pe MESOS fYpRICE Fig, 3: The Spark platform 2.2 Why Spark with Python (PySpark)? No matter you like it or not, Python has been one of the most popular programming languages. 2.2. Why Spark with Python (PySpark)? 9 Learning Apache Spark with Python KDnuggets Analytics, Data Science, Machine Learning Software Poll, top tools share, 2015-2017 0% 10% 2098 30% 40% 50% 50% Python language sob language RapidMiner Excel spark Anaconda Tensorflow sekit-learn Tableau KNIME Fig. 4: KDnuggets Analytics/Data Science 2017 Software Poll from kinuggets 10 Chapter 2, Why Spark with Python ? CHAPTER THREE CONFIGURE RUNNING PLATFORM Chinese proverb Good tools are prerequisite to the successful execution of a job. ~ old Chinese proverb A good programming platform can save you lots of troubles and time. Herein I will only present how to install my favorite programming platform and only show the easiest way which I know to set it up on Linux system, If you want to install on the other operator system, you can Google it. In this section, you may learn how to set up Pyspark on the corresponding programming platform and package 3.1 Run on Databricks Community Cloud If you don’t have any experience with Linux or Unix operator system, I would love to recommend you to use Spark on Databricks Community Cloud. Since you do not need to setup the Spark and it’s totally free for Community Edition. Please follow the steps listed below. 1, Sign up a account at: https://community.cloud.databricks.com/login html " Learning Apache Spark with Python € > © Of secure |https/communitylouddatabicks.com sinh Tt)e B02 PPS. 4k Bookmarks mjob [5 ForeignNation> [> TheFORTRAN?: [Fortran Tutor: © Usingthedluste, G @databricks & Sign In to Databricks &wlengt@uked Forgot Passwort? SignUp. 2. Sign in with your account, then you can creat your cluster(machine), table(dataset) and. rnotebook(coue) 2 Chapter 3, Configure Run Learning Apache Spark with Python eombids x — € 9-6 Oe see | htp/cammuntycloddabvcscom em Oro Apps Bookmarks tjob D ForeignNations [S TeFORTRANE) (Fortran Tutorsl & Wingthecuste © Ei @ Atachments @databricks New Documentation What's ne 2 notsok aac oe Pye, Se, SOL @ amteke ora Sone 2 terteenion 3. Create your cluster where your code will run 3.1, Run on Databricks Community Cloud 13 Learning Apache Spark with Python fe uta Ou = = 6.0 |# nse dnc t= More a ea es a ST ae Tle . Create Custer ta New Cluster cm QIEEIB a 4, Import your dataset 4 Chapter 3. Configure Running Platform Learning Apache Spark with Python '@ cesterbe-outs oe € > © 0 [6 sue | s/communtydontscem n t= gore fons 4 teokmarts tjeb () Freanvaton) ()TheFORRANP, 0 Forrnutrs: a Uingthedia GBH @ Atachmets E + ova ear awa or Data Import € > © Os use seyejunmaycntaacen Tes gore hoe te tsinsls ejb 6 Fesgn iti B heRSMAR Efe ir & Wyte daa 6 Aan ep tny Table Details 3.1, Run on Databricks Community Cloud 5 Learning Apache Spark with Python Note: You need to save the path which appears at Uploaded to DBES: /File- Store/tables/0Srmhuqv1489687378010/. Since we will use this path to load the dataset. 5. Create your notebook © 9 0 0 (a use seyejurmaycndandacn? ssc Tes @oroe Epes teinats ib Sforapniton S eFORtaey Sfatn Ma 6 ugeda 6 GAs w up 2 16 Chapter 3, Configure Run Learning Apache Spark with Python © > 0.0 (& at [hep/mmny atta TH = More hem teckowis nb Frsnttons (Te PONTMY. Foro hrs Wgtecnts Ed @ atc mbm . IneaRemression er ere ‘LLinear Regression with PySpark on Databricks ‘Author: Wenglang Feng Setup sparksession 2. Load dataset After finishing the above 5 steps, you are ready to run your Spark code on Databricks Community Cloud. I will run all the following demos on Databricks Community Cloud. Hopefully, when you run the demo code, ‘you will get the following results: TV) Radio| Newspaper | Sales) 69.2) 22.1] 45.1] 10.41 ° 69.3) 8.3] 3 58.5) 18.5] a 58.4) 12.8) only showing top 5 rows ac0: (nullable = true) TV: double (nullable = true) |-- Radio: double (nullable = true) |-- Newspaper: double (nullable = true} |-- Sales: double (nullable = true) 3.1, Run on Databricks Community Cloud 7 Learning Apache Spark with Python 3.2 Configure Spark on Mac and Ubuntu 3.2.1 Installing Prerequisites will strongly recommend you to install Anaconda, since it contains most of the prerequisites and support multiple Operator Systems. 1, Install Python Go to Ubuntu Software Center and follow the following steps: a, Open Ubuntu Software Center b, Search for python ©, And click Install (Or Open your terminal and using the following command: sudo apt-get install build-esseatial checkinstal sudo apt-get insta: sudo easy_install pip sudo pip install épy 3.2.2 Install Java Java is used by many other softwares, So it is quite possible that you have already installed it, You can by using the following command in Command Prompt Otherwise, you can follow the steps in How do I install Java for my Mac? to install java on Mac and use the following command in Command Prompt to install on Ubuntu: sudo apt-add-repositer webupdsteam/ Java sudo apt get update sudo apt-get install oracle-java8- installer 3.2.3 Install Java SE Runtime Environment Tinstalled ORACLE Java IDK Warning: Installing Java and Java SE Runtime Environment steps are very important, since Spark is a domain-specific language written in Java. 8 Chapter 3. Configure Running Platform Learning Apache Spark with Python You can check if your Java is available and find it’s version by using the following command in Command Prompt If your Java is installed successfully, you will get the similar results as follows: java version "1.8.0_131" va(IM) SE Runtime Env: spot (TM) 64-Bi onment (build 1.8.0_131-bi1) Server VM (build 25.131-b11, mi 3.2.4 Install Apache Spark Actually, the Pre-build version doesn’t need installation, You can use it when you unpack it. a. Download: You can get the Pre-built Apache Spark™ from Download Apache Spark™. . Unpack: Unpack the Apache Spark™ to the path where you want to install the Spark. cc. Test: Test the Prerequisites: change the direction spark-¥.#. f-bin-hadoop!. #/ bin and run /pyspark on 2.7.13 [Anaconda 4 3:05:08: C 4.2.1 Compatible Apple LLVM 0 (x86_64) | 7 Dee 20 2016,, ® (clang 57) ] on darwir <" for more, ", Ncopyright™, "credit informati Anaconda is prought to you by Continuum Analytics Please check out: h nuun.io/thanks and https://anaconda.org Using Spark's esau rofile: org/apache/spark/log4j-defavits pre setLogievel (newLevel) sparkR, 0 13:30:12 WARN Nativecode der: Unable to load native-had ibrary for your platform... using builtin-Jjava classes where, 7 WARN ObjectStore: Failed to get datal jectException Using Python version 2.7 (default, Dee 20 2018 23:05:08) spark". rkSession available as 3.2, Configure Spark on Mac and Ubuntu 19 Learning Apache Spark with Python 3.2.5 Configure the Spark a, Mac Operator System: open your bash_profi le in Terminal vin ~/-bash_profile And add the following lines to your bash_prof ile (remember to change the path) our_spark_installation, ME/bin Me fbi At last, remember to source your bash_profile ree ~/.bash_profile ] b. Ubuntu Operator Sysytem: open your bashyc in Terminal And add the following lines to your bas!hre (remember to change the path) your_spark_installation_path ME/bin: $SPARK_HOME/sbin Me/bin jupyter* At last, remember to source your bashre source -/.bashre 3.3 Configure Spark on Windows Installing open source software on Windows is always a nightmare for me. Thanks for Deelesh Mandloi. ‘You can follow the detailed procedures in the blog Getting Started with PySpark on Windows to install the Apache Spark™ on your Windows Operator System. 20 Chapter 3. Configure Running Platform Learning Apache Spark with Python 3.4 PySpark With Text Editor or IDE 3.4.1 PySpark With Jupyter Notebook After you finishing the above setup steps in Configure Spark on Mac and Ubuntu, then you should be good to write and run your PySpark Code in Jupyter notebook. C0 Omaha 7 ws Oe oe ge mite sp mS capone C22 Degen’ Deron Foran ne : SI JUPYter Tes PySpmk cone marae e 3.4.2 PySpark With PyCharm ‘After you finishing the above setup steps in Configure Spark on Mac and Ubuntu, then you should be good to add the PySpark to your PyCharm project. 1, Create a new PyCharm project 3.4, PySpark With Text Editor or IDE 2 Learning Apache Spark with Python 2. Goto Project Structure Option L: File -> Settings -> Project: -> Project Structure Option 2: PyCharm -> Preferences -> Project: -> Project Structure 22 Chapter 3. Configure Running Platform Learning Apache Spark with Python 3. Add Content Root: all files from $SPARK_HOME/python/lib 3.4, PySpark With Text Editor or IDE 23 Learning Apache Spark with Python Chapter 3. Configure Running Platform Learning Apache Spark with Python 4, Run your script 3.4.3 PySpark With Apache Zeppelin ‘After you finishing the above setup steps in Conjigure Spark on Mac and Ubuntu, then you should be good to write and run your PySpark Code in Apache Zeppelin 3.4, PySpark With Text Editor or IDE 25 Learning Apache Spark with Python test SHWre4) asim @ © 3.4.4 PySpark With Sublime Text After you finishing the above setup steps in Configure Spark on Mac and Ubuntu, then you should be good to use Sublime Text to write your PySpark Code and run your code as a normal python code in Terminal. python test_pyspark.py Then you should get the output results in your terminal, 26 Chapter 3, Configure Run Learning Apache Spark with Python 3.4.5 PySpark With Eclipse If you want to run PySpark code on Eclipse, you need to add the paths for the External Libraries for your Current Project as follows 1, Open the properties of your project, LL 3.4, PySpark With Text Editor or IDE 27 Learning Apache Spark with Python 2. Add the paths for the External Libraries Z type fitertext Gl) PyDev - PYTHONPATH a > Resource ‘The final PYTHONPATH used for a launch is composed of the paths Builders defined here, joined with the paths defined by the selected interpreter. ProjectReferences _epsource Folders | mjExternal Libraries _@ String Substitution Variables PyDev-Interpreter/¢ . External libraries (source Folders/zips/jars/eggs) outside of the workspace. ‘Run/Debug Settings Whenusing variables, the final paths resolved must be filesystem absolute. > Task Repostory ‘Changes in external libraries are not monitored, so, the Force restore internal info’ wikitext should be used if an external library changes. im /opt/spark/python ‘Add source folder ‘Add zip/jar/eog ‘Add based on variable Remove Force restore internal info Restore Defaults || Apply cancel OK And then you should be good to run your code on Felipse with PyDev. 28 Chapter 3, Configure Running Platform Learning Apache Spark with Python 3.5 PySparkling Water: Spark + H2O 1. Download Sparkling Water from: _ hitps://s3.amazonaws.com/h2o-release/sparkling-water/ rel-2.4/Sfindex. html 2. Test PySparking water-2.4.5.2ip wacer-2.4.5/pin cd ~/sp fey. rkling If you have a correct setup for PySpark, then you will get the following results: rk defined in SPARA_HOME-/Users/dt216661/spark environmen (default, Dec 14 2018, 13:28:58 patible Apple LLVM 6.0 ( + "copyright", "credits" 4:08:30 WARN NativeCodeic sLibrary fer your pla’ fault log level to "WARN" m ng builtin-java classes where applicab TEES OH TERT PART 3.5, PySparkling Water: Spark + H20 29 Learning Apache Spark with Python (continued from previous page) Using Spark's dersuie ofile: org/ che/spark/log4j-de fault log level to "WAR use sc.se Ltnewievel) SparkR, use, 2:66 ice 'SparkUI' could not bind o 'sparkUI' could not bin o eCodeLoader: Unable to lead native-hadoop using builtin-java classes where appl. Store: Failed to get database g I NG IS TINX version Using P: version 3.7.1 (default, Dec 14 2018 13:28:58 Le as ‘spark’. 3. Setup pyspar' Ling with Jupyter notebook Add the following alias to your bash (Linux systems) or bas»_prof ile (Mac system) 'SPARK_DRIVER_PYTHON=" sparkling-water-2.4.5 al " PYSPARK_DRIVER_PYTHON_0} 4, Open pyspark! ing in terminal ‘sparkling 3.6 Set up Spark on Cloud Following the setup steps in Configure Spark on Mac and Ubuntu, you can set up your own cluster on the cloud, for example AWS, Google Cloud. Actually, for those clouds, they have their own Big Data tool. You can run them directly whitout any setting just like Databricks Community Cloud. If you want more details, please feel free to contact with me. 30 Chapter 3. Configure Running Platform Learning Apache Spark with Python 3.7 PySpark on Colaboratory Colaboratory is a free Jupyter notebook environment that requires no setup and runs entirely in the cloud, 3.7.1 Installation ipip install pyspark 3.7.2 Testing from pyspark.sql import Sparksession spark = Sparkse builder \ appNane ("Py getorcreate() \ value") (spark. some. config optic \ 88 Ho) Sigh 3.8 Demo Code in this Section ‘The Jupyter notebook can be download from installation on colab, + Python Source code #8 set up SparkSes from pyspark.sql import SpazkSessior spark ~ Sparksession \ TET TERT AEST 3.7. PySpark on Colaboratory 31 Learning Apache Spark with Python saved from previous page) builder \ ppName (" onfigt Spark SQl basic example") \ ©.config.option", getorcr ae om.databricks. spark options (header= \ nferschema="true').\ le/ o";header=True) af. show (5) af-printSchema () 32 Chapter 3. Configure Running Platform CHAPTER FOUR AN INTRODUCTION TO APACHE SPARK Chinese proverb ‘Know yourself and know your enemy, and you will never be defeated — idiom, from Sunzi’s Art of War 4.1 Core Concepts ‘Most of the following content comes from [Kirillov2016]. So the copyright belongs to Anton Kirilloy. I will refer you to get more details from Apache Spark core concepts, architecture and internals. Before diving deep into how Apache Spark works, lets understand the jargon of Apache Spark + Job: A piece of code which reads some input from HDFS or local, performs some computation on the data and writes some output data Stages: Jobs are divided into stages. Stages are classified as a Map or reduce stages (Its easier to understand if you have worked on Hadoop and want to correlate). Stages are divided based on com- putational boundaries, all computations (operators) cannot be Updated in a single Stage. It happens over many stages. + Tasks: Each stage has some tasks, one task per partition. One task is executed on one partition of data on one executor (machine) + DAG: DAG stands for Directed Acyclic Graph, in the present context its a DAG of operators. + Executor: The process responsible for executing a task, ‘Master: The machine on which the Driver program runs + Slave: The machine on which the Executor program runs 33 Learning Apache Spark with Python 4.2 Spark Components Driver Program 1, Spark Driver + separate process to execute user applications + creates SparkContext to schedule jobs execution and negotiate with cluster manager 2. Executors + nun tasks scheduled by driver + store computation results in memory, on disk or off-heap + interact with storage systems 3. Cluster Manager + Mesos 34 Chapter 4, An Introduction to Apache Spark Learning Apache Spark with Python + YARN + Spark Standalone Spark Driver contains more components responsible for translation of user code into actual jobs executed on cluster: User Program Driver Executor DD graph val oo = wav mpartooertiomaty] DAGScheduler Threads eT | raseneauer 7 rdd1.join¢rdd2) split graph into launch tasks vie execute tasks -Grounay(..J stages of tasks cluster manager -filter(.) submit each retry failed or store and serve buildoperatorDAG stage as ready straggling tasks blocks stge (ES agnosticto ie Cee ee Tso T 36 Chapter 4. An Introduction to Apache Spark CHAPTER FIVE PROGRAMMING WITH RDDS Chinese proverb If you only know yourself, but not your opponent, you may win or may lose. If you know neither yourself nor your enemy, you will always endanger yourself — idiom, from Sunzi’s Art of War RDD represents Resilient Distributed Dataset. An RDD in Spark is simply an immutable distributed collection of objects sets. Each RDD is split into multiple partitions (similar pattern with smaller sets), which may be computed on different nodes of the cluster. 5.1 Create RDD Usually, there are two popular ways to create the RDDs: loading an external dataset, or distributing a set of collection of objects. The following examples show some simplest ways to create RDDs by using parallelize () fuention which takes an already existing collection in your program and pass the same to the Spark Context. 1, Byusing parallelize( ) function from pyspark.sqi import SparkSession arkSess: builder configt gecOrcreate() parallelize([(1, 2, 3, ‘ab tae £9), tg h it)])-tepr({'coli', ‘col2', tcol2','col4"]) Then you will get the RDD data: show (} CORTES On TERT FAT 37 Learning Apache Spark with Python (continued from previous page) from pyspark.sql import Sparksessio: spark = SparkSession \ puilder \ appName ( alue") \ 12d, 2p Ve (514 (TeV, 1920013 myDat ‘Then you will get the RDD data [mybata. collect () [ile 2d, (8p Me (Se Oe (Te Be (10d 2. By using createDataFrame ( ) function from pyspark.sql import Spark spark = SparkSession \ puilder appName ("Py config (" spark. some..cc \ get 0 Employee = spark. createl ment Ta") Then you will get the RDD data: Department td 3B Chapter 5. Programming with RDDs Learning Apache Spark with Python 3. By using read and Load functions a, Read dataset from .csv file FF set up Sparksession from pyspark.sql import SparkSess spark = SparkSession \ builder \ appName ("Python Spark nfig("spark.some.conlig vy getorcreate() ae read. format (‘com.databricks.spark.cev') .\ options (header='true', \ infezschema-'tcue').\ Load ("/nome/£eng/spark/Code/data sheader-Teue) 0 Sales l 22.1 2 10.4 4 18 5|180.8| 10.8 98.4) 12.8 only showing top 5 rows _c0: integer (nullable = true} Tv: double (nullable = true} Radio: double (nullable = tree) Newspaper: doub sales: do @ (nullable = tree) ‘Once created, RDDs offer two types of operations: transformations and actions. b, Read dataset from DataBase from pyspark.sql import SparkSession WF set up spark = SparkSessioi builder \ appName ("Pychon Spark ores \ config("spark. sone.cont "some-value") TEaTTET OS BEAT FET 5.1, Create RDD 39 Learning Apache Spark with Python {continued from previous page) getorcreace() 48 User information HHH OPH #9:5432/datase -post: af = spark. read. jabe(url-url, ypropertiee-propert ies) — af. show ( af .printSchema () Driver', ‘password’: pw, ‘Then you will get the RDD data: _c0| IV) Radio | Newepaper|Sales 37.8 68.21 22.2 38.3 45.11 10.4 45.9 69.31 9.3 3 58.5) 18.5 5)180.8) 10.8 38.4) 12 only showing top § rows root _c0: integer (nullable ~ true) Tv: double (nullable ~ ¢ Radio: double (nullable = true} Newspaper: double (nullable = true) Sales: double (n Note: Reading tables from Database needs the proper drive for the corresponding Database. For example, the above demo needs org -p stgresq] .Dr iver and you need to download it and put itin jars folder of your spark installation path, I download post. gresql—42.1.1. jar from the official website and put itin folder, C, Read dataset from HDFS ‘from pyspark.conf impert Sparkcont from pyspark.context import SpazkCo! from pyspark.sql import HiveContext. TEaRTRIET OS HEAT FET 40 Chapter 5. Programming with RDDs Learning Apache Spark with Python {continued from previous page) c= Sparkcor (local ", "exempie') Hivecontext (sc) textFile( Lidiret()) s://edhst test /user/data/deno OM spf LIMIT 100") print (spf. show (5) 5.2 Spark Operations Warning: All the figures below are from Jeffrey Thompson, The interested reader is referred to pyspark pictures ‘There are two main types of Spark operations: Transformations and Actions [Karau20L ee TRANSFORMATIONS Spaik® Operations = + ui yey ACTIONS Note: Some people defined three types of operations: ‘Transformations, Actions and Shufftes. 5.2, Spark Operations a Learning Apache Spark with Python 5.2.1 Spark Transformations ‘Transformations construct a new RDD from a previous one. For example, one common transformation is filtering data that matches a predicate. eS seo ee =e Essential Core & Intermediate Spark Operations eee er az ra g a 2 ices BS Ea 2 — NE oo 1 FeporeteloiedSoreitehtoreitions iz aoa SES ceey eS = medion Essential Core & Intermediate PairRDD Operations [Moth / Statistical Set Theory /Relatinel Data Sirctre sarpietytey Psgeee cero) + cca ‘tiauerfetn apeteioin me oe a TRANSFORMATIONS 5.2.2 Spark Actions Actions, on the other hand, compute a result based on an RDD, and either return it to the driver program ot save it to an external storage system (e.g., DFS). 1S Keteoweretetee 1 oeiproe 1 cohtecoon| 1 SSnepronnistinct 42 Chapter 5, Programming with RDDs Learning Apache Spark with Python + tgs + eomtyey 1 Stpeoyteence AC actions 5.3 rdd.DataFrame Vs pd. DataFrame 5.3.1 Create DataFrame 1, From List P3iele', 3 All col_name) spark.createDataFrane (my_list, col_name) . sho! Comparison Bc oa 1 2 2 1 23 bi 2) 2 223 4 4 Altention: Pay attentation to the parameter will make the list as rows, mns= in pd. DataFrame. Since the default value python © DataFrane(my_list, columns name) i-DataFrame(my_list, col_name} Comparison 5.3, rdd.DataFrame Vs pd.DataFrame 43 Learning Apache Spark with Python acae 2. From Dict At: [0, 1, Oly Bt: (1, 0, 2], tots Ty, 0, 0 array (List (d.values())}.T olist (), list (d.keys(})) AU Bc ABC 0 1 oat 1 ° 1) 0) 0 201 oO) 1) 0 5.3.2 Load DataFrame 1 From DataBase ‘Most of time, you need to share your code with your colleagues or release your code for Code Review or ‘Quality assurance(QA). You will definitely do not want to have your User So you can save them in login.txt: nformat ion in the code. runawayhorse001 thonTips and use the following code to import your User Information: ery: login = pd.re: v(x! login. txt’, header-None} pw (ol print (‘User information is ready! ") except, print (‘Login information is not available!!!) TEORUIES OF ERT FAT 44 Chapter 5. Programming with RDDs Learning Apache Spark with Python (continued from previous page) #atabase 2 host = '##. 488. 984.48" “table_name! Comparison: conn = paycopg2.connest (host-host, database-dh_name, user-user, password cur = conn. cursor () sqi =" select format (table_name=tab name) 432/' +dn_name a sql.Driver', "password': able-table_name, properties Attention: Reading tables from Database with PySpark needs the proper drive for the corresponding Database. For example, the above demo needs org.postgresql.Driver and you need to download it and put it in jars folder of your spark installation path, I download postgresql-42.1. | jar from the official website and put it in jars folder. 2. From .csv tising.csv') Frame sp read.csv(path='Advertising header-True, inferSchema~True} 3. From . json Data from: http://api.lufidaten.info/statio/y data json dp = pd.read_json("data/data. json") dg = spark. read. 3} (data/e Python 5.3, rdd.DataFrame Vs pd.DataFrame 45 Learning Apache Spark with Python cimestanp Smestamp"] ead (4) show (4) nestamp ja assiae1 1/2019-02-28,, 23:52 1 2994551482 2019-02-28, 19-02-28, 2994551484 2019-02-28, 017123152 only showing top 4 rows 5.3.3 First n Rows Python Code: ap head (a) ds. show (4) TV Radio |Newspaper | Sales Ty Radio sales 0 230 27.8 22.1 230.1) 27.8 69.2) 22.1 1 44,5 39.3 0.4 44.5) 39.3 45.1) 10.4 12 45.9 93 2) 45.9 68.3) 9 3151.5 41.3 2.5 151.5| 41.3 52.5| 18.5 only showing top 4 rows 46 Chapter 5. Program ming with RDDs Learning Apache Spark with Python 5.3.4 Column Names aT" Ra floatse Eloats4 Eloats4 Eloats4 dtype: object, 5.3.6 Fill Null ['female', 2, 21,{'male', 3, 41] None], head (} show(} Comparison Al B 0 male male ull 1 female 2 female| 2/3 2 male 2 male| 2) 4 Python Code 5.3, rdd.DataFrame Vs pd.DataFrame a7 Learning Apache Spark with Python dp. fillna (~99) ds.f111na(-99) .show() Al Bloc AB 0 male 2 male| 1 1 female 2 3.0 female) 2) 3 2 male 3 4.0 male) 3] 4 5.3.7 Replace Values Python Code A.replace({'male', ‘fenale'],[1, 0], ace=True) ds.na.replace({'male!, ' female’) ub 0 Al gl oc Boe 0 1 NaN Linu 10 2 3.0 a} 2) 3 213 4.0 13) 4 5.3.8 Rename Columns 1, Rename all columns Python Code: dp head (2) ‘ dg toDP('a','b! TEORUIES OF ERT FAT 48 Chapter 5. Programming with RDDs Learning Apache Spark with Python (continued from previous page) 1 5 38 > 45 2 151.5 5 2. Rename one or more columns Python Code dp. renane (columns-mapping) head (4) new_names = [mapping.get ( de. CoDF (snew_names) “ ol) for umn] Comparis TV Rad: Ca) Tv Rad: D 0 230 2 22.1 144.5 0.4 2 2 45.9 9.3 2151.5 41 8.5 Note: You can also use withColumnRenamed to rename one column in PySpark, Python ds .withColumnRenaned( Newspaper", Paper) show(4 51 39.3 to 5 41.3 5] 18 only showing W3 5.3, rdd.DataFrame Vs pd.DataFrame 49 Learning Apache Spark with Python 5.3.9 Drop Columns "ame = [ "Newspaper", "Sales" Python Co: ap drop (dzop_pane, axis=1) -head() ds .drop (varop_name! ) show (4) 0 8 23 37.8 : 5 04.51 39.3 2 5 45.9 ; 51.58/41 only showing top 4 rows 5.3.10 Filter dp = pd. read_cev (Advertising. ’ as read.esv(path=' tising header-True, inferSehema—Teu Python Code: dp (dp. Newspaper<20] -head (4) ‘ de [ds .Newspaper<20] av Radio Newspaper 7 2 19.6 11.6 11.6] 13.2 8 8.6 2.1 1.0 2.1 al 48 11 214.7 24.0 4.0 24.0 4.01 17.4 13097.5 7.6 7.200 9.7 7.6 7.2 7 Python 50 Chapter 5. Programming with RDDs Learning Apache Spark with Python dp [ (dp. Newspaper<20} (dp. tV>100)] -head(4) ' ds [ (ds Newspaper<20) (ds.TV>100) ].show(4) nv : sales focea tenant boat 7 120.2 6 120.2) 19.6 11.6] 13.2 1. 214.7 ° 214.7| 24 4.01 17.4 19 147.3 1 147.3) 23.9 19.1) 14.6 25 262 5 262.9) 3.5 19.5) 12.0 only showing top 4 rows 5.3.11 With New Column Python Code: dp('tv_norn'] = dp.2V/sum(dp. TV) dp head (4) ‘ ds.withColumn(‘tv_norm', ds.TV/ds.groupBy () .agg (F.sum("TV")) .collect () £0] (01) show (4) Newspap . Tv Radio Newspaper Sales tv_norm 27.8 68.2 22.1 9.007824 230.1) 37.8 69.21 22 “+1 /0.007824268493802813 1 44.5 38.3 45 0.4 9.001813 44.51 39.3 45.11 10 +4 10.001513167962643, 2 17.2 45.8 9.3 9.000585 2) 45.9 69.31 9 42) 5.84864920006 2151.5 41.3 58.5 18.5 0.005152 191.5) 41.3 58.5) 18 <5) 0.005151971824472517 Python Code 5.3, rdd.DataFrame Vs pd.DataFrame 51 Learning Apache Spark with Python dp['cond'] = dp.apply (lambda ec: 1 4f ((c.TV>100)4(e.Radio<40)) else 2 if © sa 0 else 2,axis=1) ‘ de .withColumn(!cond',P. when ((de.TV>100) & (ds.Radie<40),1)\ when (ds.Sales>10, 2)\ otherwise (3)) show (4) TV Radio Newspaper Sales cond 0 230.2 8 69.2 22.1 1 230.1 69.21 22.11, sol 1 44,5 1 10d 2 44.5| 39.3 45.11 10.41, 2 2 68.30 8.300 3 45.9 68.3) 8.310 43 3151.5 41.3 58.5 18.5 2 151.5) 41.3 8.5] 18.51 +2 only showing top 4 rows Python ap "] © np. log tap. Tv) dp.head{s) ? import pyspark.sql.functions as F ds -withColumn('log_tv',®.log(ds-TV)} show (4) 'V | Radio | Newspaper | Sales|., o og_ty TV Radio Newspaper Sales logty o 230 27.8 69.2 22.1 5.438514 220.1) 37.8 69.21 22.11, + §.43851399700132 1 44,5 39.3 4 0.4 3.795489 44.5) 29.3 45.1 44) 3. 7954891891721947 2 45.9 2 9 17.2) 43.9 68.3) 9 38381940 5 18.5 5 151.5) 41.3 58.5] 18.5 CORTES OF ERT FET 52 Chapter 5. Programming with RDDs

You might also like