Created and owned by Europe SNH Analytics
Python Basic
First Line of Code – Hello World!
Python Code
Code Output
Variables
In programming, variables are helping save and keep required values. We can assign different type of
data to variables – integer, float, string etc.
a_value=1
b_value=1.2
this_is_my_string= "Hello World!" or this_is_my_string=’Hello World!’
In Python you don’t have to specify type of variable (like in Java or C#), after assignment system will
automatically recognize proper type. Name of the variable is created by developer – always try to put
names that can easily suggest what specific variable is holding, it will help other people read your code.
For example if you need to save name and age of user it is better to create something like this:
user_age=10
user_name=”John”
than something like this:
value_temp1=10
string_temp2=”John”
Variables are helpful if we need to do some changes in the code:
If we need to change a and b value to 10 and 12 it will be much easier to do it in code above – we need
to just change two first lines, than in the code below where we need to change 4 lines:
List
On top of variables which help us save single value, in Python (like in other languages) we have
structures which can hold more than one value, we are calling them lists.
You can declare empty list as follow:
first_list=[]
or during declaration you can already assign values:
first_list=[1, 2, 3, 4]
If you first declare empty list and then you want to assign value, you are using the method “append”
first_list.append(1)
Similar to variables , lists can hold integer, string or float value and as with variables you don’t need to
specify type of lists:
string_list=[“apple”, “banana”, “pear”]
float_list=[1.3, 4.5, 6.7, 8.3]
To check how many elements are in the list you are using function “len”
To return specific list from the list you are writing below code – remember that index in the list is start
from 0, so if we want to get first element:
And second element
Dictionary
Dictionary is another form of storing your values, it is storing your values in key:value pairs.
Dictionary you declare with curly brackets: {}
empty_dictionary={}
or during declaration you can already assign values:
new_dictionary={“key1”:”value1”, “key2”:”value2”}
For example if you want to store targets per BUs:
After you declare such dictionary and in the code you want to use target for Hair Care you can write
below code:
IF Statement
When you want to compare two values with each other and act depends on comparison result you are
using IF statement, values you are comparing need to be the same type, so string you can compare with
string and number with number.
If condition:
What to do when condition is true
else:
What to do when condition is false
Important: new block of code in python you are starting with indentation:
In above example we are comparing “number1” with “number2”.
Loop
If you need to do something several times or want to iterate over list’s elements the best way is to us
Loops.
The most popular Loop in Python is for.
a. Let’s print 10 times “Europe SNH”
If you don’t need to use loop counter in the code you can also write loop like this:
b. Iterate over list, let’s assume we have below list:
loop_list=[10, 5, 6, 4, 12, 18, 3, 2]
and we want to implement the logic that if list element is lower or equal to 5 it should be doubled and if
it is higher, it should be zero out:
Function
If you need to use some piece of code several times, it is good idea to put this into the function.
Function in python you are defining with word “def”
def function_name (arguments):
function code
Let’s create function that is summing up two elements:
Our sum_function we can also write as below:
You can also create function without “return” statement:
Or even without input arguments:
Libraries
So instead of writing code from scratch we can use code that someone already wrote, tested, and
optimized.
Let’s assume that we need functionality to calculate average value
values_list=[34, 10, 2, 3, 4, 55, 100, 4, 50]
in this case we are looking for value 29.1111
How we can solve this problem?
a. Write code from scratch
Let’s create function which will calculate average value:
Let’s see how it works:
And test on one more example:
Work as expected.
But instead of writing our own code we can use libraries.
b. Libraries
For our case we need to use library numpy and function “mean”. Numpy is one of the most popular
libraries in python:
https://numpy.org/doc/stable/
To use libraries in the code we need to first import it:
import library_name
If you import like this, to use it in the code you need to put code like this:
If you put only function name “mean” you will receive an error:
we can use also alias name:
import library_name as alias:
In this case you can write code like this:
The most popular alias for numpy is “np”
You can also import one single function from library:
And in this case you can use only function name:
Another popular library is pandas which enabling to load data and process it:
https://pandas.pydata.org/
For example to load csv file:
PROJECT
In the project we will take flow from KNIME training and codify it in Python:
1. The easiest way to load data from PS Data Hub to Data Frame is to use library “spark” and
method “sql”, to display data frame we can use command “display”
2. After data is being loaded to spark data frame we can transfer it to pandas data frame with
function “toPandas()”:
3. To see sample data you can use “head” method, by default it will display first 5 rows:
4. You can see that rows are being numbered starting with 0, if you want to display more than 5
rows you can specify this in “head” method, to see first 15 rows:
5. To see Data Frame priorities – data types, column names etc., use method “info”:
6. We see that all our numeric columns have type object:
We need to change it into float column.
General code is as below:
df['DataFrame Column'] = df['DataFrame
Column'].astype(float)
in our case we have several columns and maybe we will need to use similar code elsewhere, so this is
good idea to create function for this:
Once function is created, we need to create list of columns which we want to change from object to
float:
With method “len” we can check how many columns we want to change:
Now let’s execute out function “object_to_float”:
After this we can run once again method “info” and as you see some columns change type from object
to float:
7. After we change type of columns, we can run method “describe” which will show us some useful
statistics for data frame:
By default, python will run it for numeric values and show you basic statistics like average, count,
standard deviation, or percentiles.
If you want to run it for all columns, add “include=’all’” inside describe method:
8. To see how many rows are in our data frame, similarly, to list we can use function “len”:
Or we can use properties “shape” to see also number of columns:
9. Next steps for us is to filter for the columns we really need for our analysis/visualization:
Geographic Group (Name)
SubSector (Name)
FPC (ID)
FPC (Long Name)
Category (Name)
Brand (Name)
Level 7 (ID)
Level 7 (Name)
Month number (i.e. 201302)
Day date (i.e. 01-JAN-2013)
SU Unfilled Cases (exc. 3.4.2 3.4.3)
SU Unfilled Cases
SU Shipments
10. After column selection data IS NOT BEING aggregated accordingly, what means that in our data
set we still have data on Customer Level, so next step for us is to aggregate data.
We want to sum all float columns and group them by object columns:
11. Now we can look at columns’ names:
a. some of them are not “user friendly” like Level 7 which is, in reality, Ship From location
b. Most of them contains spaces or brackets signs – it will be good to remove them as well
(some of the systems may not allow you use data where columns headers contain some
special signs or spaces etc.)
First let’s create dictionary with new column names:
Then we will use function “rename” to change columns headers. In the function we also put
“inplace=True” in order to save changed data frame in old data frame name – group_df.
Inplace=True is the same as we will write
group_df=group_df.rename(columns=rename_dictionary)
12. When we look on our data we see that FPC code and Ship From contains “[“ and “]” – those
brackets are not giving any information, so it will be good to remove them:
13. Next step for us is to do the mapping – we want to add to our data set TDC Val Name
a. First load data from table userdb_eupscanalytics_im.python_train_md
b. Join main data frame with prod mapping data frame:
For join operation we will use function “merge” from pandas library:
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html
first step is to import library:
For pandas usual alias is “pd”
After this we can use functions from this library in our code:
We don’t need column “material_id” so we can drop it:
After Join operation it is good to check rows count in new dataset, to make sure that we don’t have any
duplications:
Everything is fine, new data set has the same number of rows as group_df and we have one additional
column with TDC Val Name.
Another thing we should check after join operation is missing values, we can easily do it with one line of
code:
We see that for some rows system didn’t find TDC VAL Name, let’s replace NaN values with key word
“missing”:
Now when we filter for NaN we got empty data set:
And all NaN values were replaced with “missing”:
14. Now we can add business rule to our flow – If SMO is equal to TURKEY & CCAR as final cut we
should report su_cuts_excl and in every other case su_cuts
For training purpose, we will do two ways:
a. With loop
b. With Lambda function – recommended approach
Loop
First we will create empty list to keep our results
Then we will create the loop to iterate over all data frame rows:
Last step is to add new column with temp_cuts values:
Lambda
Exactly the same results (and much faster) we can achieve with below line of code:
15. Final Data Set Clean up
a. Drop columns su_cuts and su_cuts_excl
b. Keep only one of final_cuts column and rename it to su_cuts
c. Rename TDC NAME column into tdc_name
Drop columns su_cuts and su_cuts_excl
Keep only one of final_cuts column and rename it to su_cuts
Rename TDC NAME column into tdc_name
16. Data Filter
a. Single column filter
b. Multiple columns filter with AND
c. Multiple columns filter with OR
Single Column Filter:
Multiple columns filter with AND
Multiple columns filter with OR
17. Data Save
a. Convert pandas data frame into spark data frame
b. Save as table in database or download as csv
Convert pandas data frame into spark data frame
Download CSV:
Save as table:
18. Spark register temp table
Once you have spark data frame you can register it as temporary view and use it in normal way in SQL
code: