Ewwww
Ewwww
Introduction
Big data is the collection of data that is vast in size, however,
growing exponentially faster with time. It’s data with so huge size
and complexity that none of the traditional data management
tools will store it or process it with efficiency.
There are other technologies also that can process Big Data more
efficiently than python. They are Hadoop and Spark.
Hadoop
Hadoop is the best solution for storing and processing Big Data
because Hadoop stores huge files in the form of (HDFS) Hadoop
distributed file system without specifying any schema.
Become a Data Scientist in JUST 200 Days
A 100% Job Guarantee Training ProgramDownload Brochure
Spark
Image source: by me
Need of Python in Big Data
1. Open Source:
2. Easy to learn:
Python is very easy to learn just like the English language. Its
syntax and code are easy and readable for beginners also. Python
has a lot of applications like the development of web applications,
data science, machine learning, and, so on.
Spark provides a Python API called PySpark released by the Apache Spark
community to support Python with Spark. Using PySpark, one will simply
integrate and work with RDDs within the Python programming language
too.
Spark comes with an interactive python shell called PySpark shell. This
PySpark shell is responsible for the link between the python API and the
spark core and initializing the spark context. PySpark can also be
launched directly from the command line by giving some instructions for
interactive use.
Python is that the most well-liked language for ML/AI due to its
Components of Hadoop:
There are mainly two components of Hadoop:
Image source: by me
# Install java
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
#create java home variable
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.0.0-bin-
hadoop3.2"
Step 1: Install Hadoop
#download hadoop
!wget https://downloads.apache.org/hadoop/common/hadoop-
3.3.0/hadoop-3.3.0.tar.gz
#Running Hadoop
!/usr/local/hadoop-3.3.0/bin/hadoop
!mkdir ~/input
!cp /usr/local/hadoop-3.3.0/etc/hadoop/*.xml ~/input
!ls ~/input
!/usr/local/hadoop-3.3.0/bin/hadoop jar /usr/local/hadoop-
3.3.0/share/hadoop/mapreduce/hadoop-mapreduce-examples-
3.3.0.jar grep ~/input ~/grep_example 'allowed[.]*'
MapReduce
Image source: by me
Apache Spark
Image source: by me
# Install java
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
#Install spark (change the version number if needed)
!wget -q
https://archive.apache.org/dist/spark/spark-3.0.0/spark-3.0.0-
bin-hadoop3.2.tgz
#Unzip the spark file to the current folder
!tar xf spark-3.0.0-bin-hadoop3.2.tgz
#Set your spark folder to your system path environment.
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.0.0-bin-
hadoop3.2"
#Install findspark using pip
!pip install -q findspark
#Spark for Python (pyspark)
!pip install pyspark
#importing pyspark
import pyspark
#importing sparksessio
from pyspark.sql import SparkSession
#creating a sparksession object and providing appName
spark=SparkSession.builder.appName("local[*]").getOrCreate()
#printing the version of spark
print("Apache Spark version: ", spark.version)
Conclusion:
In this blog, we studied how Python can become a good and
efficient tool for Big Data Processing also. We can integrate all the
Big Data tools with Python which makes data processing easier
and faster. Python has become a suitable choice not only for Data
Science but also for Big Data processing.