0% found this document useful (0 votes)

4 views40 pages

Python Training for data analysis

The document provides a comprehensive introduction to Python programming, covering essential concepts such as variables, lists, dictionaries, control statements (if statements and loops), functions, and libraries. It also includes practical examples of data manipulation using libraries like NumPy and Pandas, as well as a project outline for loading and processing data with Spark. The document emphasizes best practices in coding and data handling, including naming conventions and data type management.

Uploaded by

Ahmjou

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views40 pages

Python Training for data analysis

Uploaded by

Ahmjou

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 40

Created and owned by Europe SNH Analytics

Python Basic
First Line of Code – Hello World!

Python Code

Code Output

Variables

In programming, variables are helping save and keep required values. We can assign different type of
data to variables – integer, float, string etc.

a_value=1

b_value=1.2

this_is_my_string= "Hello World!" or this_is_my_string=’Hello World!’

In Python you don’t have to specify type of variable (like in Java or C#), after assignment system will
automatically recognize proper type. Name of the variable is created by developer – always try to put
names that can easily suggest what specific variable is holding, it will help other people read your code.

For example if you need to save name and age of user it is better to create something like this:

user_age=10

user_name=”John”

than something like this:

value_temp1=10

string_temp2=”John”
Variables are helpful if we need to do some changes in the code:

If we need to change a and b value to 10 and 12 it will be much easier to do it in code above – we need
to just change two first lines, than in the code below where we need to change 4 lines:
List

On top of variables which help us save single value, in Python (like in other languages) we have
structures which can hold more than one value, we are calling them lists.

You can declare empty list as follow:

first_list=[]

or during declaration you can already assign values:

first_list=[1, 2, 3, 4]

If you first declare empty list and then you want to assign value, you are using the method “append”

first_list.append(1)

Similar to variables , lists can hold integer, string or float value and as with variables you don’t need to
specify type of lists:

string_list=[“apple”, “banana”, “pear”]

float_list=[1.3, 4.5, 6.7, 8.3]

To check how many elements are in the list you are using function “len”
To return specific list from the list you are writing below code – remember that index in the list is start
from 0, so if we want to get first element:

And second element

Dictionary

Dictionary is another form of storing your values, it is storing your values in key:value pairs.

Dictionary you declare with curly brackets: {}

empty_dictionary={}

or during declaration you can already assign values:

new_dictionary={“key1”:”value1”, “key2”:”value2”}
For example if you want to store targets per BUs:

After you declare such dictionary and in the code you want to use target for Hair Care you can write
below code:

IF Statement

When you want to compare two values with each other and act depends on comparison result you are
using IF statement, values you are comparing need to be the same type, so string you can compare with
string and number with number.

If condition:

What to do when condition is true

else:

What to do when condition is false

Important: new block of code in python you are starting with indentation:

In above example we are comparing “number1” with “number2”.

Loop

If you need to do something several times or want to iterate over list’s elements the best way is to us
Loops.

The most popular Loop in Python is for.

a. Let’s print 10 times “Europe SNH”

If you don’t need to use loop counter in the code you can also write loop like this:
b. Iterate over list, let’s assume we have below list:

loop_list=[10, 5, 6, 4, 12, 18, 3, 2]

and we want to implement the logic that if list element is lower or equal to 5 it should be doubled and if
it is higher, it should be zero out:
Function

If you need to use some piece of code several times, it is good idea to put this into the function.
Function in python you are defining with word “def”

def function_name (arguments):

function code

Let’s create function that is summing up two elements:

Our sum_function we can also write as below:
You can also create function without “return” statement:

Or even without input arguments:

Libraries

So instead of writing code from scratch we can use code that someone already wrote, tested, and
optimized.

Let’s assume that we need functionality to calculate average value

values_list=[34, 10, 2, 3, 4, 55, 100, 4, 50]

in this case we are looking for value 29.1111

How we can solve this problem?

a. Write code from scratch

Let’s create function which will calculate average value:

Let’s see how it works:
And test on one more example:

Work as expected.

But instead of writing our own code we can use libraries.

b. Libraries

For our case we need to use library numpy and function “mean”. Numpy is one of the most popular
libraries in python:

https://numpy.org/doc/stable/
To use libraries in the code we need to first import it:

import library_name

If you import like this, to use it in the code you need to put code like this:

If you put only function name “mean” you will receive an error:
we can use also alias name:

import library_name as alias:

In this case you can write code like this:

The most popular alias for numpy is “np”

You can also import one single function from library:

And in this case you can use only function name:

Another popular library is pandas which enabling to load data and process it:

https://pandas.pydata.org/

For example to load csv file:

PROJECT
In the project we will take flow from KNIME training and codify it in Python:

1. The easiest way to load data from PS Data Hub to Data Frame is to use library “spark” and
method “sql”, to display data frame we can use command “display”

2. After data is being loaded to spark data frame we can transfer it to pandas data frame with
function “toPandas()”:
3. To see sample data you can use “head” method, by default it will display first 5 rows:

4. You can see that rows are being numbered starting with 0, if you want to display more than 5
rows you can specify this in “head” method, to see first 15 rows:
5. To see Data Frame priorities – data types, column names etc., use method “info”:
6. We see that all our numeric columns have type object:

We need to change it into float column.

General code is as below:

df['DataFrame Column'] = df['DataFrame

Column'].astype(float)

in our case we have several columns and maybe we will need to use similar code elsewhere, so this is
good idea to create function for this:
Once function is created, we need to create list of columns which we want to change from object to
float:

With method “len” we can check how many columns we want to change:

Now let’s execute out function “object_to_float”:

After this we can run once again method “info” and as you see some columns change type from object
to float:

7. After we change type of columns, we can run method “describe” which will show us some useful
statistics for data frame:
By default, python will run it for numeric values and show you basic statistics like average, count,
standard deviation, or percentiles.

If you want to run it for all columns, add “include=’all’” inside describe method:

8. To see how many rows are in our data frame, similarly, to list we can use function “len”:
Or we can use properties “shape” to see also number of columns:

9. Next steps for us is to filter for the columns we really need for our analysis/visualization:

Geographic Group (Name)

SubSector (Name)

FPC (ID)

FPC (Long Name)

Category (Name)

Brand (Name)

Level 7 (ID)

Level 7 (Name)

Month number (i.e. 201302)

Day date (i.e. 01-JAN-2013)

SU Unfilled Cases (exc. 3.4.2 3.4.3)

SU Unfilled Cases

SU Shipments
10. After column selection data IS NOT BEING aggregated accordingly, what means that in our data
set we still have data on Customer Level, so next step for us is to aggregate data.
We want to sum all float columns and group them by object columns:

11. Now we can look at columns’ names:

a. some of them are not “user friendly” like Level 7 which is, in reality, Ship From location
b. Most of them contains spaces or brackets signs – it will be good to remove them as well
(some of the systems may not allow you use data where columns headers contain some
special signs or spaces etc.)
First let’s create dictionary with new column names:

Then we will use function “rename” to change columns headers. In the function we also put
“inplace=True” in order to save changed data frame in old data frame name – group_df.

Inplace=True is the same as we will write

group_df=group_df.rename(columns=rename_dictionary)

12. When we look on our data we see that FPC code and Ship From contains “[“ and “]” – those
brackets are not giving any information, so it will be good to remove them:
13. Next step for us is to do the mapping – we want to add to our data set TDC Val Name
a. First load data from table userdb_eupscanalytics_im.python_train_md

b. Join main data frame with prod mapping data frame:

For join operation we will use function “merge” from pandas library:

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html

first step is to import library:

For pandas usual alias is “pd”

After this we can use functions from this library in our code:

We don’t need column “material_id” so we can drop it:

After Join operation it is good to check rows count in new dataset, to make sure that we don’t have any
duplications:
Everything is fine, new data set has the same number of rows as group_df and we have one additional
column with TDC Val Name.

Another thing we should check after join operation is missing values, we can easily do it with one line of
code:
We see that for some rows system didn’t find TDC VAL Name, let’s replace NaN values with key word
“missing”:

Now when we filter for NaN we got empty data set:

And all NaN values were replaced with “missing”:

14. Now we can add business rule to our flow – If SMO is equal to TURKEY & CCAR as final cut we
should report su_cuts_excl and in every other case su_cuts

For training purpose, we will do two ways:

a. With loop
b. With Lambda function – recommended approach

Loop

First we will create empty list to keep our results

Then we will create the loop to iterate over all data frame rows:

Last step is to add new column with temp_cuts values:

Lambda

Exactly the same results (and much faster) we can achieve with below line of code:

15. Final Data Set Clean up

a. Drop columns su_cuts and su_cuts_excl
b. Keep only one of final_cuts column and rename it to su_cuts
c. Rename TDC NAME column into tdc_name

Drop columns su_cuts and su_cuts_excl

Keep only one of final_cuts column and rename it to su_cuts

Rename TDC NAME column into tdc_name

16. Data Filter

a. Single column filter
b. Multiple columns filter with AND
c. Multiple columns filter with OR

Single Column Filter:

Multiple columns filter with AND

Multiple columns filter with OR

17. Data Save

a. Convert pandas data frame into spark data frame
b. Save as table in database or download as csv
Convert pandas data frame into spark data frame

Download CSV:

Save as table:
18. Spark register temp table

Once you have spark data frame you can register it as temporary view and use it in normal way in SQL
code:

Python Cheat Sheet 2.0
100% (1)
Python Cheat Sheet 2.0
10 pages
Python For Data Science Cheat Sheet 2.0
100% (1)
Python For Data Science Cheat Sheet 2.0
11 pages
Python Cheat Sheet: Pandas - Numpy - Sklearn Matplotlib - Seaborn BS4 - Selenium - Scrapy
100% (4)
Python Cheat Sheet: Pandas - Numpy - Sklearn Matplotlib - Seaborn BS4 - Selenium - Scrapy
11 pages
KUKA Programming Notes
No ratings yet
KUKA Programming Notes
13 pages
Week 1: 1 The Python Programming Language: Functions
No ratings yet
Week 1: 1 The Python Programming Language: Functions
9 pages
Course - Introduction To Data Science (SD211105)
No ratings yet
Course - Introduction To Data Science (SD211105)
10 pages
01 Introduction To Python
No ratings yet
01 Introduction To Python
36 pages
DAO Cheatsheet
No ratings yet
DAO Cheatsheet
3 pages
Python 101: Understanding The Nuts and Bolts of Python
No ratings yet
Python 101: Understanding The Nuts and Bolts of Python
46 pages
Python
No ratings yet
Python
5 pages
Pandas What Can Pandas Do For You ?: Statsmodels SM Seaborn Sns
No ratings yet
Pandas What Can Pandas Do For You ?: Statsmodels SM Seaborn Sns
9 pages
01 Introduction To Python
No ratings yet
01 Introduction To Python
36 pages
Notes For Fintech Assesment, Cheatsheet
No ratings yet
Notes For Fintech Assesment, Cheatsheet
19 pages
Numbers: # Basic Calculations 1+2 5/6 # Numbers A 123.1 Print (A) B 10 Print (B) A + B C A + B Print (C)
No ratings yet
Numbers: # Basic Calculations 1+2 5/6 # Numbers A 123.1 Print (A) B 10 Print (B) A + B C A + B Print (C)
80 pages
Python For Data Science Cheat Sheet 2.0
No ratings yet
Python For Data Science Cheat Sheet 2.0
11 pages
Python Cheat Sheet For Excel Users
No ratings yet
Python Cheat Sheet For Excel Users
5 pages
Common Python Data Science Interview Questions1
No ratings yet
Common Python Data Science Interview Questions1
5 pages
Data Understanding and Preparation
No ratings yet
Data Understanding and Preparation
48 pages
Python Cheat Sheet For Excel Users
100% (2)
Python Cheat Sheet For Excel Users
5 pages
Python Notes by Prof T
No ratings yet
Python Notes by Prof T
10 pages
ML Lab Manual 2018-19
No ratings yet
ML Lab Manual 2018-19
129 pages
Introductiontocourse: 1 The Python Programming Language: Functions
No ratings yet
Introductiontocourse: 1 The Python Programming Language: Functions
11 pages
Python
No ratings yet
Python
61 pages
Pandas
No ratings yet
Pandas
5 pages
Module 1.Foundations of Data Science
No ratings yet
Module 1.Foundations of Data Science
17 pages
Python Foundation For Data Science
No ratings yet
Python Foundation For Data Science
9 pages
Python Soln
No ratings yet
Python Soln
20 pages
Data Analysis With Python
No ratings yet
Data Analysis With Python
12 pages
Internship
No ratings yet
Internship
31 pages
Commands SQL, Python (BASICS)
No ratings yet
Commands SQL, Python (BASICS)
7 pages
Python Summary
No ratings yet
Python Summary
10 pages
Unit 1
No ratings yet
Unit 1
69 pages
Pandas PDF
No ratings yet
Pandas PDF
25 pages
DS Final
No ratings yet
DS Final
46 pages
Pythonn SE
No ratings yet
Pythonn SE
18 pages
PRINCIPLES OF DATA SCIENCE Lab
No ratings yet
PRINCIPLES OF DATA SCIENCE Lab
20 pages
Python For DataScience
No ratings yet
Python For DataScience
47 pages
Cheat Sheet: Python For Data Science
No ratings yet
Cheat Sheet: Python For Data Science
4 pages
Cheat Sheet: Python For Data Science
No ratings yet
Cheat Sheet: Python For Data Science
4 pages
Pandas Research
No ratings yet
Pandas Research
14 pages
Pandas - Digitalocean
No ratings yet
Pandas - Digitalocean
15 pages
Pandas
No ratings yet
Pandas
50 pages
DA-Interview Go Through
No ratings yet
DA-Interview Go Through
59 pages
Usage of NumPy For Numerical Data in Detail
No ratings yet
Usage of NumPy For Numerical Data in Detail
52 pages
ENGG1810 Recap
No ratings yet
ENGG1810 Recap
28 pages
Columbiax - BAMM 101 - Python For Analytics
No ratings yet
Columbiax - BAMM 101 - Python For Analytics
38 pages
Data Handling Module
No ratings yet
Data Handling Module
10 pages
DAwHPC L03 Data Cleaning Practical
No ratings yet
DAwHPC L03 Data Cleaning Practical
43 pages
Python CheatSheet - Sahil
No ratings yet
Python CheatSheet - Sahil
8 pages
Data Science With Machine Learning Level 1-5
No ratings yet
Data Science With Machine Learning Level 1-5
7 pages
Python Interviews
No ratings yet
Python Interviews
154 pages
Getting Started With Python Cheat Sheet
No ratings yet
Getting Started With Python Cheat Sheet
1 page
AI Student HandbookXII 2025-26!8!20
No ratings yet
AI Student HandbookXII 2025-26!8!20
13 pages
Python - Week 1 PDF
No ratings yet
Python - Week 1 PDF
28 pages
Pandas
No ratings yet
Pandas
12 pages
Python Cheat Sheet For Beginners
No ratings yet
Python Cheat Sheet For Beginners
1 page
Data Aggregation and Group Operations
No ratings yet
Data Aggregation and Group Operations
34 pages
Lecture 01
No ratings yet
Lecture 01
69 pages
Dejene Chala Stat606 Screening Quiz Programming Part
No ratings yet
Dejene Chala Stat606 Screening Quiz Programming Part
12 pages
Linear Algebra - Strang G
No ratings yet
Linear Algebra - Strang G
187 pages
Computer Science: 8520/1-Paper 1 Computational Thinking and Problem-Solving Mark Scheme
No ratings yet
Computer Science: 8520/1-Paper 1 Computational Thinking and Problem-Solving Mark Scheme
21 pages
Wago Pro Manual
No ratings yet
Wago Pro Manual
458 pages
Python Interview Questions
No ratings yet
Python Interview Questions
14 pages
CH-3 (WORKING WITH FUNCTIONS) Question Answers
No ratings yet
CH-3 (WORKING WITH FUNCTIONS) Question Answers
38 pages
Basic Programming Techniques in QBASIC
100% (1)
Basic Programming Techniques in QBASIC
22 pages
Python Programming
No ratings yet
Python Programming
151 pages
Constructor and Destructor
No ratings yet
Constructor and Destructor
17 pages
BSIT New 2019 PDF
No ratings yet
BSIT New 2019 PDF
92 pages
Chapter Four Problem Solving Using Computers
No ratings yet
Chapter Four Problem Solving Using Computers
7 pages
COBOL-PPT-6 - Procedure Division - Conditional Statements
No ratings yet
COBOL-PPT-6 - Procedure Division - Conditional Statements
32 pages
CSE Lab-Report 4
No ratings yet
CSE Lab-Report 4
5 pages
Positional Arguments in Python
No ratings yet
Positional Arguments in Python
5 pages
Constructor
No ratings yet
Constructor
9 pages
C Cheatsheet C Cheatsheet: Table of Content Table of Content
No ratings yet
C Cheatsheet C Cheatsheet: Table of Content Table of Content
8 pages
Bods 4.2
No ratings yet
Bods 4.2
136 pages
PF Theory Course Outline
No ratings yet
PF Theory Course Outline
8 pages
GE3151 IAT-1 Q.Bank
No ratings yet
GE3151 IAT-1 Q.Bank
2 pages
1466566655microcont PDF
100% (2)
1466566655microcont PDF
688 pages
Tutorial Questions On Matlab
No ratings yet
Tutorial Questions On Matlab
2 pages
C nOTES
No ratings yet
C nOTES
16 pages
Operators: General Properties of Operators
No ratings yet
Operators: General Properties of Operators
23 pages
1567931355
No ratings yet
1567931355
11 pages
Digit Fastrack To C++ PDF
No ratings yet
Digit Fastrack To C++ PDF
123 pages
Chapter 1
No ratings yet
Chapter 1
42 pages
Client Side Scripting Language (22519) Semester - V (IF) : A Laboratory Manual For
No ratings yet
Client Side Scripting Language (22519) Semester - V (IF) : A Laboratory Manual For
28 pages
The Basic PHP - Chapter 1
No ratings yet
The Basic PHP - Chapter 1
109 pages
Notes Flow Control
No ratings yet
Notes Flow Control
5 pages
SANMOTION Model No PB M0008610B
No ratings yet
SANMOTION Model No PB M0008610B
42 pages

Python Training for data analysis

Uploaded by

Python Training for data analysis

Uploaded by

Created and owned by Europe SNH Analytics

this_is_my_string= "Hello World!" or this_is_my_string=’Hello World!’

than something like this:

You can declare empty list as follow:

or during declaration you can already assign values:

string_list=[“apple”, “banana”, “pear”]

float_list=[1.3, 4.5, 6.7, 8.3]

And second element

Dictionary you declare with curly brackets: {}

or during declaration you can already assign values:

What to do when condition is true

What to do when condition is false

In above example we are comparing “number1” with “number2”.

The most popular Loop in Python is for.

a. Let’s print 10 times “Europe SNH”

loop_list=[10, 5, 6, 4, 12, 18, 3, 2]

def function_name (arguments):

Let’s create function that is summing up two elements:

Or even without input arguments:

Let’s assume that we need functionality to calculate average value

values_list=[34, 10, 2, 3, 4, 55, 100, 4, 50]

in this case we are looking for value 29.1111

How we can solve this problem?

a. Write code from scratch

Let’s create function which will calculate average value:

But instead of writing our own code we can use libraries.

import library_name as alias:

In this case you can write code like this:

The most popular alias for numpy is “np”

You can also import one single function from library:

And in this case you can use only function name:

For example to load csv file:

We need to change it into float column.

General code is as below:

df['DataFrame Column'] = df['DataFrame

Now let’s execute out function “object_to_float”:

Geographic Group (Name)

FPC (Long Name)

Month number (i.e. 201302)

Day date (i.e. 01-JAN-2013)

SU Unfilled Cases (exc. 3.4.2 3.4.3)

11. Now we can look at columns’ names:

Inplace=True is the same as we will write

b. Join main data frame with prod mapping data frame:

first step is to import library:

For pandas usual alias is “pd”

We don’t need column “material_id” so we can drop it:

Now when we filter for NaN we got empty data set:

And all NaN values were replaced with “missing”:

For training purpose, we will do two ways:

First we will create empty list to keep our results

Last step is to add new column with temp_cuts values:

15. Final Data Set Clean up

Drop columns su_cuts and su_cuts_excl

Keep only one of final_cuts column and rename it to su_cuts

16. Data Filter

Single Column Filter:

Multiple columns filter with OR

17. Data Save

You might also like