0% found this document useful (0 votes)
25 views11 pages

MLStack Cafe 2

Uploaded by

Ankita Kurle
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views11 pages

MLStack Cafe 2

Uploaded by

Ankita Kurle
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

MLStack.

Cafe - Kill Your Data Science & ML Interview

MLStack.Cafe - Kill Your Data Science & ML


Interview

Q1: How to create new columns derived from existing columns in


Pandas? ☆

Topics: Pandas

Answer:

We create a new column by assigning the output to the DataFrame with a new column name in between the
[] .
Let's say we want to create a new column 'C' whose values are the multiplication of column 'B' with
column 'A' . The operation will be easy to implement and will be element-wise, so there's no need to loop
over rows.

import pandas as pd

# Create example data


df = pd.DataFrame({
"A": [420, 380, 390],
"B": [50, 40, 45]
})

df["C"] = df["A"] * df["B"]

Also other mathematical operators ( + , - , \* , / ) or logical operators ( < , > , = , … ) work element-wise.
But if we need more advanced logic, we can use arbitrary Python code via apply() .
Depending on the case, we can use rename with a dictionary or function to rename row labels or column
names according to the problem.

Q2: How do you count unique values per group with Pandas? ☆

Topics: Pandas

Problem:

You are given the following dataframe:

>>> data = {'ID': [123, 123, 123, 456, 456, 456, 456, 789, 789],
"domain":['vk.com', 'vk.com', 'twitter.com', 'vk.com','facebook.com',
'vk.com','google.com','twitter.com','vk.com']
}

>>> df = pd.DataFrame(data)
>>> df
ID domain
0 123 vk.com
1 123 vk.com
2 123 twitter.com
3 456 vk.com
4 456 facebook.com
5 456 vk.com
6 456 google.com

Page 1 of 11
MLStack.Cafe - Kill Your Data Science & ML Interview

7 789 twitter.com
8 789 vk.com

You are required to count unique ID values in every domain .

Solution:

We can use the nunique() function:

>>> df = df.groupby('domain')['ID'].nunique()
>>> df
domain
facebook.com 1
google.com 1
twitter.com 2
vk.com 3
Name: ID, dtype: int64

Q3: How are iloc() and loc() different? ☆☆

Topics: Pandas

Answer:

DataFrame.iloc is a method used to retrieve data from a Data frame, and it is an integer position-based
locator (from 0 to length-1 of the axis), but may also be used with a boolean array. It takes input as
integer, arrays of integers, a slice object, boolean array and functions.

df.iloc[0]
df.iloc[-5:]
df.iloc[:, 2] # the : in the first position indicates all rows
df.iloc[:3, :3] # The upper-left 3 X 3 entries (assuming df has 3+ rows and columns)

DataFrame.loc gets rows (and/or columns) with particular labels. It takes input as a single label, list of
arrays and slice objects with labels.

df = pd.DataFrame(index=['a', 'b', 'c'], columns=['time', 'date', 'name'])


df.loc['a'] # equivalent to df.iloc[0]
df.loc['b':, 'date'] # equivalent to df.iloc[1:, 1]

Q4: What are the operations that Pandas Groupby method is based
on ? ☆☆

Topics: Pandas

Answer:

Splitting the data into groups based on some criteria.


Applying a function to each group independently.
Combining the results into a data structure.

Page 2 of 11
MLStack.Cafe - Kill Your Data Science & ML Interview

Q5: Describe how you will get the names of columns of a


DataFrame in Pandas ☆☆

Topics: Pandas

Answer:

By Simply iterating over columns, and printing the values.

for col in data.columns:


print(col)

Using .columns() method with the dataframe object, this returns the column labels of the DataFrame.

list(data.columns)

Using the column.values() method to return an array of index.

list(data.columns.values)

Using sorted() method, which will return the list of columns sorted in alphabetical order.

sorted(data)

Q6: In Pandas, what do you understand as a bar plot and how can
you generate a bar plot visualization ☆☆

Topics: Pandas

Answer:

A Bar Plot is a plot that presents categorical data with rectangular bars with lengths proportional to the
values that they represent.
A bar plot shows comparisons among discrete categories.
One axis of the plot shows the specific categories being compared, and the other axis represents a measured
value.

# Code Sample for how to plot


df.plot.bar(x='x_values’', y='y_values')

Page 3 of 11
MLStack.Cafe - Kill Your Data Science & ML Interview

Q7: How would you iterate over rows in a DataFrame in Pandas?


☆☆

Topics: Pandas

Answer:

DataFrame.iterrows is a generator which yields both the index and row (as a Series):

import pandas as pd

df = pd.DataFrame({'c1': [10, 11, 12], 'c2': [100, 110, 120]})

for index, row in df.iterrows():


print(row['c1'], row['c2'])

10 100
11 110
12 120

Q8: How to check whether a Pandas DataFrame is empty? ☆☆

Topics: Pandas

Answer:

You can use the attribute df.empty to check whether it's empty or not:

if df.empty:
print('DataFrame is empty!')

Page 4 of 11
MLStack.Cafe - Kill Your Data Science & ML Interview

Q9: If we have a date column in our dataset, then how will you
perform Feature Engineering using Python? ☆☆

Topics: Pandas Dimensionality Reduction Feature Engineering

Answer:

From a date column, we can get lots of important features such as:

day of the week,


day of the month,
day of the quarter, and
day of the year, etc.

Moreover, we can extract the date, month, and year from that column also.

All these features can impact our prediction and make our model robust. For example, in a case study, the sales
of the business can be impacted by the month or day of the week.

To perform this kind of feature engineering in Python, we must convert the data type associated with the date
column in a datetime type using the Pandas library as follows,

# convert date_column to datetime type


df.date_column = pd.to_datetime(df.date_column)

Now to extract the month, the day of the month, and the hour we use the following commands,

# extract month feature


months = df.date_column.dt.month

# extract day of month feature


day_of_months = df.date_column.dt.day

# extract hour feature


hours = df.date_column.dt.hour

Q10: How can you sort the DataFrame? ☆☆

Topics: Pandas

Answer:

The function used for sorting in pandas is called DataFrame.sort_values() . It is used to sort a DataFrame by its
column or row values. The function comes with a lot of parameters, but the most important ones to consider for
sort are:

by : The optional by parameter is used to specify the column/row(s) which are used to determine the sorted
order.
axis : specifies whether sort for row ( 0 ) or columns ( 1 ),

ascending : specifies whether to sort the dataframe in ascending or descending order. The default value is
ascending. To sort in descending order, we need to specify ascending=False .

Example

Page 5 of 11
MLStack.Cafe - Kill Your Data Science & ML Interview

>>> df = pd.DataFrame({
'col1': ['A', 'A', 'B', np.nan, 'D', 'C'],
'col2': [2, 1, 9, 8, 7, 4],
'col3': [0, 1, 9, 4, 2, 3],
'col4': ['a', 'B', 'c', 'D', 'e', 'F']
})

>>> df
col1 col2 col3 col4
0 A 2 0 a
1 A 1 1 B
2 B 9 9 c
3 NaN 8 4 D
4 D 7 2 e
5 C 4 3 F

#Sort by col1
>>> df.sort_values(by=['col1'])
col1 col2 col3 col4
0 A 2 0 a
1 A 1 1 B
2 B 9 9 c
5 C 4 3 F
4 D 7 2 e
3 NaN 8 4 D

# Sort by multiple columns


>>> df.sort_values(by=['col1', 'col2'])
col1 col2 col3 col4
1 A 1 1 B
0 A 2 0 a
2 B 9 9 c
5 C 4 3 F
4 D 7 2 e
3 NaN 8 4 D

# Sort descending
df.sort_values(by='col1', ascending=False)
col1 col2 col3 col4
4 D 7 2 e
5 C 4 3 F
2 B 9 9 c
0 A 2 0 a
1 A 1 1 B
3 NaN 8 4 D

Q11: How to convert str to datetime format in Pandas? ☆☆

Topics: Pandas

Answer:

Usign to_datetime() function we can not only convert str but int , float , list and more objects to
datetime . For example,

>>> df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar'],


'B' : ['one', 'one', 'two', 'three'],
'C' : np.random.randn(4),
'I_date' : ['28-03-2021 2:15:00 PM', '28-03-2021 2:17:28 PM', '28-03-2021 2:50:50
PM', '28-03-2021 2:50:50 PM']
})

>>> df['I_date'] = pd.to_datetime(df['I_date'])


>>> df.dtypes
A object

Page 6 of 11
MLStack.Cafe - Kill Your Data Science & ML Interview

B object
C float64
I_date datetime64[ns]
dtype: object

Q12: What does describe() percentiles values tell about our data?
☆☆

Topics: Pandas

Answer:

The percentiles describe the distribution of your data: 50 should be a value that describes the middle of the
data, also known as median. 25 , 75 is the border of the upper/lower quarter of the data. With this can get an
idea of how skew our data is.

If the mean is higher than the median, the data is right skewed.

Q13: Define the different ways a DataFrame can be created in


Pandas ☆☆

Topics: Pandas

Answer:

We can create a DataFrame using the following ways:

Constructing DataFrame from a dictionary:

>>> d = {'col1': [1, 2], 'col2': [3, 4]}


>>> df = pd.DataFrame(data=d)
>>> df
col1 col2
0 1 3
1 2 4

Constructing a DataFrame from numpy ndarray:

>>> df2 = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),


columns=['a', 'b', 'c'])
>>> df2

Page 7 of 11
MLStack.Cafe - Kill Your Data Science & ML Interview

a b c
0 1 2 3
1 4 5 6
2 7 8 9

Q14: Why do should make a copy of a DataFrame in Pandas? ☆☆

Topics: Pandas

Answer:

In general, it is safer to work on copies than on original DataFrames, except when you know that you won't be
needing the original anymore and want to proceed with the manipulated version.

This is because in Pandas, indexing a DataFrame returns a reference to the initial DataFrame. Thus, changing
the subset will change the initial DataFrame. Thus, you'd want to use the copy if you want to make
sure the initial DataFrame shouldn't change.

Normally, you would still have some use for the original data frame to compare with the manipulated version,
etc. Therefore, depending on the case it's a good practice to work on copies and merge at the end.

Q15: What does the in operator do in Pandas? ☆☆

Topics: Pandas

Answer:

The in operator in Python tests dictionary keys, not values. In Pandas, Series are dict-like, therefore, the
in operator on a Series tests for membership in the index, not membership among the values. If we want
to test for membership in the values, we use the method isin() .

For DataFrames , likewise, in applies to the column axis, testing for membership in the list of column
names.

Q16: How can you find the row for which the value of a specific
column is max or min? ☆☆

Topics: Pandas

Problem:

>>> import pandas as pd


>>> df = pandas.DataFrame(np.random.randn(5,3),columns=['A','B','C'])
>>> df
A B C
0 -0.068471 -0.006429 1.453785
1 -0.655960 0.084291 0.344351
2 -0.058856 0.025537 1.303488
3 -0.300120 -0.207405 1.108704
4 2.027010 0.190007 -0.064194

Solution:

Use the pandas idxmax and idxmin function. It's straightforward:

Page 8 of 11
MLStack.Cafe - Kill Your Data Science & ML Interview

Maximal value:

>>> df['A'].idxmax()
4

Minimal value:

>>>df['A'].idxmin()
1

Q17: How does the groupby() method works in Pandas? ☆☆

Topics: Pandas

Answer:

In the first stage of the process, data contained in a pandas object, whether a Series , DataFrame , or
otherwise, is split into groups based on one or more keys that we provide.

The splitting is performed on a particular axis of an object. For example, a DataFrame can be grouped on its
rows (axis=0) or its columns (axis=1) .

Once this is done, a function is applied to each group, producing a new value. Finally, the results of all those
function applications are combined into a result object. The form of the resulting object will usually depend
on what's being done to the data.

In the figure below, this process is illustrated for a simple group aggregation.

Q18: How to get a count of the number of observations for each


year in the example dataframe? ☆☆

Page 9 of 11
MLStack.Cafe - Kill Your Data Science & ML Interview

Topics: Pandas

Problem:

df = pd.DataFrame(['2017-12-05',
'2016-12-05',
'2017-12-05',
'2015-12-05',
'2017-12-06',
'2018-12-06',
'2019-12-05',
'2019-11-05',
'2020-12-05',
'2017-12-07'], columns=['date'])

Solution:

We first convert the date column from string dtype to datetime dtype and then we use value_counts() on the
year attribute.

>>> df['date'] = pd.to_datetime(df['date'])


>>> pd.to_datetime(df['date']).dt.year.value_counts()
2017 4
2019 2
2016 1
2015 1
2018 1
2020 1
Name: date, dtype: int64

Q19: A column in a df has boolean True/False values, but for


further calculations, we need 1/0 representation. How would you
transform it? ☆☆

Topics: Pandas

Answer:

A succinct way to convert a single column of boolean values to a column of integers 1 or 0 is:

df["somecolumn"] = df["somecolumn"].astype(int)

Q20: Name some methods you know to replace NaN values of a


DataFrame in Pandas ☆☆

Topics: Pandas

Answer:

To replace missing values in a Pandas DataFrame we can use the fillna() function, In Pandas, some methods
available to use in this function are:

pad / ffill : propagate last valid observation forward to next valid back with df.fillna(method="pad") .

Page 10 of 11
MLStack.Cafe - Kill Your Data Science & ML Interview

fill / bfill : use next valid observation to fill the missing value with df.fillna(method="fill") .

Replace NaN with a scalar value with df.fillna(n) , where n can be int , str , etc.
Replace NaN with a PandasObject : the use case of this is to fill a DataFrame with the resulting operation of
apply a function to a column. For example replace NaN values with a mean of some column
( df.fillna(df.mean() ).

Page 11 of 11

You might also like