MLStack Cafe 2
MLStack Cafe 2
Topics: Pandas
Answer:
We create a new column by assigning the output to the DataFrame with a new column name in between the
[] .
Let's say we want to create a new column 'C' whose values are the multiplication of column 'B' with
column 'A' . The operation will be easy to implement and will be element-wise, so there's no need to loop
over rows.
import pandas as pd
Also other mathematical operators ( + , - , \* , / ) or logical operators ( < , > , = , … ) work element-wise.
But if we need more advanced logic, we can use arbitrary Python code via apply() .
Depending on the case, we can use rename with a dictionary or function to rename row labels or column
names according to the problem.
Q2: How do you count unique values per group with Pandas? ☆
Topics: Pandas
Problem:
>>> data = {'ID': [123, 123, 123, 456, 456, 456, 456, 789, 789],
"domain":['vk.com', 'vk.com', 'twitter.com', 'vk.com','facebook.com',
'vk.com','google.com','twitter.com','vk.com']
}
>>> df = pd.DataFrame(data)
>>> df
ID domain
0 123 vk.com
1 123 vk.com
2 123 twitter.com
3 456 vk.com
4 456 facebook.com
5 456 vk.com
6 456 google.com
Page 1 of 11
MLStack.Cafe - Kill Your Data Science & ML Interview
7 789 twitter.com
8 789 vk.com
Solution:
>>> df = df.groupby('domain')['ID'].nunique()
>>> df
domain
facebook.com 1
google.com 1
twitter.com 2
vk.com 3
Name: ID, dtype: int64
Topics: Pandas
Answer:
DataFrame.iloc is a method used to retrieve data from a Data frame, and it is an integer position-based
locator (from 0 to length-1 of the axis), but may also be used with a boolean array. It takes input as
integer, arrays of integers, a slice object, boolean array and functions.
df.iloc[0]
df.iloc[-5:]
df.iloc[:, 2] # the : in the first position indicates all rows
df.iloc[:3, :3] # The upper-left 3 X 3 entries (assuming df has 3+ rows and columns)
DataFrame.loc gets rows (and/or columns) with particular labels. It takes input as a single label, list of
arrays and slice objects with labels.
Q4: What are the operations that Pandas Groupby method is based
on ? ☆☆
Topics: Pandas
Answer:
Page 2 of 11
MLStack.Cafe - Kill Your Data Science & ML Interview
Topics: Pandas
Answer:
Using .columns() method with the dataframe object, this returns the column labels of the DataFrame.
list(data.columns)
list(data.columns.values)
Using sorted() method, which will return the list of columns sorted in alphabetical order.
sorted(data)
Q6: In Pandas, what do you understand as a bar plot and how can
you generate a bar plot visualization ☆☆
Topics: Pandas
Answer:
A Bar Plot is a plot that presents categorical data with rectangular bars with lengths proportional to the
values that they represent.
A bar plot shows comparisons among discrete categories.
One axis of the plot shows the specific categories being compared, and the other axis represents a measured
value.
Page 3 of 11
MLStack.Cafe - Kill Your Data Science & ML Interview
Topics: Pandas
Answer:
DataFrame.iterrows is a generator which yields both the index and row (as a Series):
import pandas as pd
10 100
11 110
12 120
Topics: Pandas
Answer:
You can use the attribute df.empty to check whether it's empty or not:
if df.empty:
print('DataFrame is empty!')
Page 4 of 11
MLStack.Cafe - Kill Your Data Science & ML Interview
Q9: If we have a date column in our dataset, then how will you
perform Feature Engineering using Python? ☆☆
Answer:
From a date column, we can get lots of important features such as:
Moreover, we can extract the date, month, and year from that column also.
All these features can impact our prediction and make our model robust. For example, in a case study, the sales
of the business can be impacted by the month or day of the week.
To perform this kind of feature engineering in Python, we must convert the data type associated with the date
column in a datetime type using the Pandas library as follows,
Now to extract the month, the day of the month, and the hour we use the following commands,
Topics: Pandas
Answer:
The function used for sorting in pandas is called DataFrame.sort_values() . It is used to sort a DataFrame by its
column or row values. The function comes with a lot of parameters, but the most important ones to consider for
sort are:
by : The optional by parameter is used to specify the column/row(s) which are used to determine the sorted
order.
axis : specifies whether sort for row ( 0 ) or columns ( 1 ),
ascending : specifies whether to sort the dataframe in ascending or descending order. The default value is
ascending. To sort in descending order, we need to specify ascending=False .
Example
Page 5 of 11
MLStack.Cafe - Kill Your Data Science & ML Interview
>>> df = pd.DataFrame({
'col1': ['A', 'A', 'B', np.nan, 'D', 'C'],
'col2': [2, 1, 9, 8, 7, 4],
'col3': [0, 1, 9, 4, 2, 3],
'col4': ['a', 'B', 'c', 'D', 'e', 'F']
})
>>> df
col1 col2 col3 col4
0 A 2 0 a
1 A 1 1 B
2 B 9 9 c
3 NaN 8 4 D
4 D 7 2 e
5 C 4 3 F
#Sort by col1
>>> df.sort_values(by=['col1'])
col1 col2 col3 col4
0 A 2 0 a
1 A 1 1 B
2 B 9 9 c
5 C 4 3 F
4 D 7 2 e
3 NaN 8 4 D
# Sort descending
df.sort_values(by='col1', ascending=False)
col1 col2 col3 col4
4 D 7 2 e
5 C 4 3 F
2 B 9 9 c
0 A 2 0 a
1 A 1 1 B
3 NaN 8 4 D
Topics: Pandas
Answer:
Usign to_datetime() function we can not only convert str but int , float , list and more objects to
datetime . For example,
Page 6 of 11
MLStack.Cafe - Kill Your Data Science & ML Interview
B object
C float64
I_date datetime64[ns]
dtype: object
Q12: What does describe() percentiles values tell about our data?
☆☆
Topics: Pandas
Answer:
The percentiles describe the distribution of your data: 50 should be a value that describes the middle of the
data, also known as median. 25 , 75 is the border of the upper/lower quarter of the data. With this can get an
idea of how skew our data is.
If the mean is higher than the median, the data is right skewed.
Topics: Pandas
Answer:
Page 7 of 11
MLStack.Cafe - Kill Your Data Science & ML Interview
a b c
0 1 2 3
1 4 5 6
2 7 8 9
Topics: Pandas
Answer:
In general, it is safer to work on copies than on original DataFrames, except when you know that you won't be
needing the original anymore and want to proceed with the manipulated version.
This is because in Pandas, indexing a DataFrame returns a reference to the initial DataFrame. Thus, changing
the subset will change the initial DataFrame. Thus, you'd want to use the copy if you want to make
sure the initial DataFrame shouldn't change.
Normally, you would still have some use for the original data frame to compare with the manipulated version,
etc. Therefore, depending on the case it's a good practice to work on copies and merge at the end.
Topics: Pandas
Answer:
The in operator in Python tests dictionary keys, not values. In Pandas, Series are dict-like, therefore, the
in operator on a Series tests for membership in the index, not membership among the values. If we want
to test for membership in the values, we use the method isin() .
For DataFrames , likewise, in applies to the column axis, testing for membership in the list of column
names.
Q16: How can you find the row for which the value of a specific
column is max or min? ☆☆
Topics: Pandas
Problem:
Solution:
Page 8 of 11
MLStack.Cafe - Kill Your Data Science & ML Interview
Maximal value:
>>> df['A'].idxmax()
4
Minimal value:
>>>df['A'].idxmin()
1
Topics: Pandas
Answer:
In the first stage of the process, data contained in a pandas object, whether a Series , DataFrame , or
otherwise, is split into groups based on one or more keys that we provide.
The splitting is performed on a particular axis of an object. For example, a DataFrame can be grouped on its
rows (axis=0) or its columns (axis=1) .
Once this is done, a function is applied to each group, producing a new value. Finally, the results of all those
function applications are combined into a result object. The form of the resulting object will usually depend
on what's being done to the data.
In the figure below, this process is illustrated for a simple group aggregation.
Page 9 of 11
MLStack.Cafe - Kill Your Data Science & ML Interview
Topics: Pandas
Problem:
df = pd.DataFrame(['2017-12-05',
'2016-12-05',
'2017-12-05',
'2015-12-05',
'2017-12-06',
'2018-12-06',
'2019-12-05',
'2019-11-05',
'2020-12-05',
'2017-12-07'], columns=['date'])
Solution:
We first convert the date column from string dtype to datetime dtype and then we use value_counts() on the
year attribute.
Topics: Pandas
Answer:
A succinct way to convert a single column of boolean values to a column of integers 1 or 0 is:
df["somecolumn"] = df["somecolumn"].astype(int)
Topics: Pandas
Answer:
To replace missing values in a Pandas DataFrame we can use the fillna() function, In Pandas, some methods
available to use in this function are:
pad / ffill : propagate last valid observation forward to next valid back with df.fillna(method="pad") .
Page 10 of 11
MLStack.Cafe - Kill Your Data Science & ML Interview
fill / bfill : use next valid observation to fill the missing value with df.fillna(method="fill") .
Replace NaN with a scalar value with df.fillna(n) , where n can be int , str , etc.
Replace NaN with a PandasObject : the use case of this is to fill a DataFrame with the resulting operation of
apply a function to a column. For example replace NaN values with a mean of some column
( df.fillna(df.mean() ).
Page 11 of 11