4 BNI Python Training
4 BNI Python Training
EVERYBODY
Data Engineer / Manage data Support business Help data scientist Experience 4 years
Scientist – PT XL pipelines from with big data build model in in python , 3 years
AXIATA Cloudera technology large scale in big data
PYTHON INTRODUCTION
Version :
• 2.*
• 3.*
WHY PYTHON ?
• Simple
• Extensive Support Libraries
• Integration Feature
WHY
PYTHON ?
TOP
COMPANIES
THAT USING
PYTHON
PYTHON TOP
LIBRARY
WHAT PYTHON CAN DO
INTERMEZO
DATA ANALYTICS CYCLE
Data Engineer
Traditional DW
HADOOP Tech
Python
/presentation /reporting
Data Scientist
Variable
Operator
String and functions
Conditionals
Iterations
OVERVIEW
List
Tuple
Dictionary
File
Error handling
BASIC PYTHON
• https://
colab.research.google.com/
JUPYTER NOTEBOOK
3
Types Value
String message = 'And now for
PRIMITIVE something completely
different'
VARIABLE
Integer n = 17
• One conditional
CONDITIONAL
• Two conditionals
CHAINED CONDITIONAL
• Sequence • Initialize
• Mutable
LIST
• Traversing a list
LIST OPERATIONS
• Append
• Extend
LIST AND
FUNCTIONS
DELETING ELEMENT
• Remove
• Pop
• del
TUPLES
• Sequence
• immutable (no append)
• Initialize
LIST AND STRINGS
• Create dictionary
• Reading files
FILES
• Write to file
ERROR HANDLING
SOURCE
• http://do1.dr-chuck.com/pythonlearn/EN_us/pythonlearn.pdf
PANDAS
INTRODUCTION
• Add index
CREATING DATAFRAME FROM
SCRATCH
• Search by index
COMMON FILES
Pandas
READ DATA FROM CSV
READ DATA FROM CSV
• Statistics columns
DATAFR AME S LIC IN G, S ELECTING, EXTRACTING
Columns wise
Row wise
D ATA F R A M E S L I C I N G , S E L E C T I N G , E X T R A C T I N G
DATAFR AME S LIC IN G, S ELECTING, EXTRACTING
Conditional selections
Select * from movies_df where director = ‘Ridley Scott’ and director == ‘Christopher Nolan’
DATAFR AME S LIC IN G, S ELECTING, EXTRACTING
Apply aggregation
Dataframe = movies_df_gb
Index = ‘year’
Columns = ‘new_genre’ rows to column
Values = ‘count_genre’ fill the cell
aggfunc = function aggregation (max,min,sum,etc)
PANDAS JOIN
• Create dataframe
PANDAS JOIN
Create a date range
• Create timestamp
• Timestamp function
• Year, month, day , hour, minutes, seconds, ms, day/month name, day in week/month/year
DATE AND TIME
• Exploration
DATE AND TIME
Date range
Format : mm/dd/yyyy
DATE AND TIME
• Different format
DATE AND TIME
• Slicing data
• Daily aggregation
• Monthly aggregation
DATE AND TIME
get year
get year, month
get year, month, day
get hour
get dayname
DATE AND TIME
• Time delta
UNIX TIME
• https://www.unixtimestamp.com/index.php
• Source
• https://www.learndatasci.com/tutorials/python-pandas-tutorial-complete-introd
uction-for-beginners/
• https://towardsdatascience.com/basic-time-series-manipulation-with-pandas-
4432afee64ea
PYTHON DATA VISUALIZATION
• Matplotlib
• Seaborn
TYPE OF CHART
• Bar chart
• Line chart
• Scatter plot
• heatmap
SETUP JUPYTER
DATA PREPARATION
BAR CHART
Purpose is to comparing
few items
BAR CHART
DATA CHART
LINE CHART
• movies_df_gb =
movies_df[['year','new_genre','revenue_millions']].groupby(['year','new_genre']).agg({'new_genre':'count','revenue_
millions':'sum'})
• movies_df_gb.columns = ['count_genre','sum_revenue_mio']
• movies_df_gb = movies_df_gb.reset_index()
movies_df_gb_pvt = pd.pivot_table(movies_df_gb,
values='count_genre',
index=['year'],columns=['new_genre'],
aggfunc=np.sum).fillna(0)
HEAT MAP
SPARK INTRODUCTION
• Rdd
a = sc.parallelize([1,2,3,4])
• dataframe
Df = a.toDF(‘a’)
START SPARK ENGINE
• Basic configuration
LOAD DATA
• From csv
LOAD DATA
http://localhost:4040
SCHEMA AND COLUMNS
• Rename columns
DATAFRAME EXPLORATION
• filter
• Filter
• aggregation
• join
• pivot
USER DEFINED FUNCTION (UDF)
TEMPORARY TABLE
EXPORT RESULT
Save to csv
Save to parquet
Result folder
• spark.stop()
PYTHON HBASE
~/.bashrc
• download hbase and extract hbase
export JAVA_HOME="/usr/lib/jvm/java-8-openjdk-amd64" https://downloads.apache.org/hbase/1.4.13/hbase-1.4.13-bin.tar.gz/
export PATH=$PATH:$JAVA_HOME/bin
export HBASE_HOME=/home/adam/hduser/hbase-1.4.12 • configure conf/hbase_env.sh (app java_home)
export PATH=$PATH:$HBASE_HOME/bin
• Export hbase_home to ~/.bashrc
• configure conf/hbase-site.xml
• Start hbase, hbase shell and thrift
RETRIEVE DATA
READ CSV
1. library
2. Import csv
CREATE CONNECTION AND INSERT
SOURCE
• https://www.digitalocean.com/community/tutorials/how-to-install-java-with-ap
t-on-ubuntu-18-04
• https://www.guru99.com/hbase-installation-guide.html