P03 Introduction To Pandas Ans
P03 Introduction To Pandas Ans
Reference:
10 minutes to pandas
List of comprehensive pandas tutorials
In [1]: import pandas as pd
import numpy as np
>>> s
0 4
1 7
2 -5
3 3
The items in the Series object are referenced by an index and their contents are stored in values . By
default, the position of an item is used as its index . The first column is the index of the data whereas the
second column [4, 7, -5, 3] stores the items ( values of the data) themselves.
In [2]:
s = pd.Series([4, 7, -5, 3])
Out[2]: 0 4
1 7
2 -5
3 3
dtype: int64
1. Initialize using a list and specify the index through the parameter index
2. Initialize using a dictionary. The index is specified by the key of the dictionary.
In [4]:
s = pd.Series([21, 'Two', 39, -4], index = ['ONE', 'TWO', 'THREE', 'FOUR'])
Out[4]: ONE 21
TWO Two
THREE 39
FOUR -4
dtype: object
In [5]:
data = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
s = pd.Series(data)
Texas 71000
Oregon 16000
Utah 5000
dtype: int64
You can access the item of a series by means of its index name
In [6]:
s['Ohio']
Out[6]: 35000
You can also access the item of a series by means of its position
In [7]:
s[0]
Out[7]: 35000
In [8]:
s.index
In [9]:
s.index = ['Perak', 'Penang', 'Selangor', 'Melaka']
Penang 71000
Selangor 16000
Melaka 5000
dtype: int64
In [10]:
s = pd.Series({'a' : 0.3, 'b' : 1., 'c' : -12., 'd' : -3., 'e' : 15})
Out[10]: a 0.3
b 1.0
c -12.0
d -3.0
e 15.0
dtype: float64
In [11]:
s[0]
Out[11]: 0.3
In [12]:
s[1:3]
Out[12]: b 1.0
c -12.0
dtype: float64
In [13]:
s[s > 0]
Out[13]: a 0.3
b 1.0
e 15.0
dtype: float64
In [14]:
s * 2
Out[14]: a 0.6
b 2.0
c -24.0
d -6.0
e 30.0
dtype: float64
In [15]:
np.exp(s)
Out[15]: a 1.349859e+00
b 2.718282e+00
c 6.144212e-06
d 4.978707e-02
e 3.269017e+06
dtype: float64
In [16]:
s.sum()
Out[16]: 1.3000000000000007
In [17]:
s.mean()
Out[17]: 0.2600000000000001
Exercise 1
Q1. Given the following table:
year population
1995 56656
2000 70343
2005 93420
year population
2010 122330
2015 223234
(a) Create a dictionary named data where the key is the year and the value is the population. Then, use
data to create a series object named population .
In [18]:
data = {1995: 56656, 2000: 70343, 2005: 93420, 2010: 122330, 2015: 223234}
population = pd.Series(data)
population
2000 70343
2005 93420
2010 122330
2015 223234
dtype: int64
(b) Go through the index in the population and get all data for before 2010. (Hint: use for loop to loop
through index)
Expected output:
1995 : 56656
2000 : 70343
2005 : 93420
In [19]:
for year in population.index:
print(year,':', population[year])
1995 : 56656
2000 : 70343
2005 : 93420
Q2. (a) Create the Series object X as shown below where each item is the square of the item's position. Try
to do this in a single line of code and do not use a dictionary to initialize the series. Use the parameter
index instead.
Hints: Use list comprehension to generate the list of indices where you should use 'Item{:d}'.format(i)
to generate the index at location i .
Item0 0
Item1 1
Item2 4
Item3 9
Item4 16
Item5 25
Item6 36
Item7 49
Item8 64
Item9 81
dtype: int32
In [20]:
X = pd.Series(np.arange(10)**2, index = ['Item{:d}'.format(i) for i in range(10)])
Out[20]: Item0 0
Item1 1
Item2 4
Item3 9
Item4 16
Item5 25
Item6 36
Item7 49
Item8 64
Item9 81
dtype: int32
(b) Select all items from X that are divisible by 3.
Ans:
Item0 0
Item3 9
Item6 36
Item9 81
dtype: int32
In [21]:
X[X%3 == 0]
Out[21]: Item0 0
Item3 9
Item6 36
Item9 81
dtype: int32
(c) Standardize the data as follows:
xi −mean(X)
∗
x =
i std(X)
Ans:
Item0 -1.006893
Item1 -0.971564
Item2 -0.865575
Item3 -0.688927
Item4 -0.441620
Item5 -0.123654
Item6 0.264972
Item7 0.724257
Item8 1.254200
Item9 1.854803
dtype: float64
In [22]:
(X - X.mean())/X.std()
Item1 -0.971564
Item2 -0.865575
Item3 -0.688927
Item4 -0.441620
Item5 -0.123654
Item6 0.264972
Item7 0.724257
Item8 1.254200
Item9 1.854803
dtype: float64
Generating DataFrames
We can construct Dataframes from numpy array. The following code generates a dataframe. By default, the
index is the row position and the columns is the column position of the sample.
In [23]:
df = pd.DataFrame(np.random.randn(6,4))
df
Out[23]: 0 1 2 3
We can specify the index names of the rows ( index ) and columns ( columns ).
In [24]:
df = pd.DataFrame(np.random.randn(6,4), index = ['r1', 'r2', 'r3', 'r4', 'r5', 'r6'], columns =
df
Out[24]: c1 c2 c3 c4
We can also construct Dataframes using a dictionary of lists. Each object can be of different types. We can
check the types of each column by using the command df.dtypes .
In [25]:
data = {'year': [2000, 2001, 2003, 2004, 2005],
Most of the time, we would like to arrange the columns in a certain order. You can do so through the order
you provide in the parameters columns .
In [26]:
df = pd.DataFrame(data, index = ['1st', '2nd', '3rd', '4th', '5th'], columns=['state', 'year',
df
In [27]:
df.dtypes
year int64
population float64
dtype: object
In [28]:
df.index
In [29]:
df.columns
In [30]:
df.values
We can retrieve a specific item using the index and column names using the command .at[row_index,
column_index]
In [31]:
df.at['5th','state'] # Indexing order: [row, column]
Out[31]: 'Osaka'
In [32]:
df.at['5th', 'state'] = 'Nagasaki'
df
We can retrieve a specific item using the row and column positions using the command
.iat[row_position, column_position]
In [33]:
df.iat[4, 0]
Out[33]: 'Nagasaki'
In [34]:
df.iat[4, 0] = 'Osaka'
df
In [35]:
df['year'] # access column 'year' by dict-like notation
2nd 2001
3rd 2003
4th 2004
5th 2005
In [36]:
df['year'] = np.arange(2010, 2015) # modifies the content of column 'year'
df
To access multiple columns, pass a list containing the column names which we wish to access.
In [37]:
df[['state', 'population']]
(2) A column can also be accessed by attribute. Each column will appear as individual attribute of the
DataFrame object.
In [38]:
df.state # Similar to df['state']
2nd Tokyo
3rd Kyoto
4th Hokaido
5th Osaka
In [39]:
df.year
2nd 2011
3rd 2012
4th 2013
5th 2014
In [40]:
df.population
2nd 1.7
3rd 3.6
4th 2.4
5th 2.9
In [41]:
df.loc[:, 'year']
2nd 2011
3rd 2012
4th 2013
5th 2014
In [42]:
df.loc[:, ['state', 'population']]
In [43]:
df.iloc[:, [1,2]]
Unlike numpy, we cannot use the row position or row index directly to access one row of a DataFrame
In [44]:
#df[0] # This is not allowed. Will generate an error
(1) To access a DataFrame's row by its row position, we have to use slice indexing.
In [45]: df[0:1] # Access the first row
In [46]:
df[0:1] = [100, 'KL', 1200] # Change the content of the first row
df
In [47]:
df[1:3] # Read the second and third row
In [48]:
df[1:3] = [[200, 'TAPAH', 1300], [300, 'KAMPAR', 1400]] # Change the second and third
df
(2) We can also access specific rows of a DataFrame using .loc and .iloc attributes.
In [49]:
df.loc['3rd']
year KAMPAR
population 1400.0
In [50]:
df.iloc[2] # Read the second and third row
year KAMPAR
population 1400.0
Notes: Indexing can be quite confusing in pandas. In general, for columns, use the column name or a list of
column name. For rows, use .iloc or slicing on the items' positions
In [52]:
df = pd.DataFrame(data, index = ['1st', '2nd', '3rd', '4th', '5th'], columns=['state', 'year',
df
In [53]:
df.iloc[[0,2], [1,2]] # Extract row 0 and 2 (`1st` and `3rd`), and column 1 and 2 (`year` a
In [54]:
df.loc[['1st','3rd'],['year','population']] # same as above
In [55]:
df.iloc[:, 1:3] # Extract items at all rows, 2nd and 3rd columns
In [56]:
df.iloc[1:3, :] # Extract items at 2nd and 3rd rows, all columns
Warning: Unlike list in Python, pandas is designed to load fully populated DataFrame. Minimize the adding
rows or columns to an existent Dataframe in your system.
In [57]:
df.loc['6th'] = ['City Y', 2006, -2.0]
df
In [58]:
df['LargeCity'] = df['population'] > 2.5
df
df.drop returns a copy of the original Dataframe object. The original object remains unchanged. If you
wish to drop a row in the original object, set the parameter inplace to True.
In [59]:
df.drop('6th', axis = 0, inplace = True) # Drop the last row
df
In [60]:
df.drop(['3rd', '4th'], axis = 0, inplace = True) # Drop the 3rd and 4th row
df
In [61]:
print(df.drop('LargeCity', axis = 1)) # drop the column 'LargeCity' (not i
df
In [62]:
df.drop(['year', 'population'], axis = 1) # drop the columns 'year' and 'populat
In [63]:
df = pd.DataFrame(np.random.uniform(-1, 1, (5, 3)), index = np.arange(5), columns = list('ABC')
df
Out[63]: A B C
In [64]:
df['C']*10
Out[64]: 0 3.979915
1 -1.335913
2 -3.000748
3 -2.150644
4 -7.712557
In [65]:
1/df
Out[65]: A B C
In [66]:
df > 0 # Create an indicator matrix signifying if the item in df1 is
Out[66]: A B C
In [67]:
df[df.A > 0] # Only retain rows with positive entries for column A
Out[67]: A B C
In [68]:
df[df > 0] # Only retain items that meet the criteria df1 > 0. For those
Out[68]: A B C
In [69]:
df['A'] > 0 # Filtering column A for values greater than 0
Out[69]: 0 True
1 False
2 False
3 True
4 False
In [70]:
np.exp(df) # Apply the exponential function on df1
Out[70]: A B C
In [71]:
1/(1+np.exp(-df)) # Apply the sigmoid function on all items in df1
Out[71]: A B C
In [72]:
df.mean() # get the mean of all columns
Out[72]: A -0.349208
B 0.462633
C -0.204399
dtype: float64
In [73]:
df.mean(axis = 1) # get the mean of all rows
Out[73]: 0 0.389441
1 -0.090459
2 -0.036764
3 -0.234218
4 -0.179623
dtype: float64
Exercise 2
Q1. Create the following DataFrame df to store the coursework and final examination marks of a group of
students. You must use the ID as the index of the dataframe.
S2 Joseph CS 34.5 90
S3 Jonathan IA 25.5 30
S4 Linda IA 70.9 95
S5 Sophos CS 80.2 20
In [74]:
data = {'Name': ['Mandy', 'Joseph', 'Jonathan', 'Linda', 'Sophos'],
df = pd.DataFrame(data, index = ['S' + str(i) for i in range(1, 6)], columns = ['Name', 'Progra
df
S2 Joseph CS 34.5 90
S3 Jonathan IA 25.5 30
S4 Linda IA 70.9 95
S5 Sophos CS 80.2 10
Hints: There are at least three ways to access a desired row. You can use either (1) .loc , (2) .iloc or (3)
slice indexing to do so.
Ans:
Name Jonathan
Programme IA
Coursework 25.5
Final 30
Programme IA
Coursework 25.5
Final 30
Hints: There are at least four different ways to access a desired column. You can use either (1) .loc , (2)
.iloc , (3) column indexing or (4) attribute.
Ans:
S1 100
S2 90
S3 30
S4 95
S5 10
In [76]:
# df.loc[:, 'Final']
# df.iloc[:, 1]
# df['Final']
df.Final
Out[76]: S1 100
S2 90
S3 30
S4 95
S5 10
Ans:
In [77]:
df[2:4]
S3 Jonathan IA 25.5 30
S4 Linda IA 70.9 95
Q5. Get the following information for all students: Name, Programme and Final.
Ans:
In [78]:
df[['Name', 'Programme', 'Final']]
S1 Mandy CS 100
S2 Joseph CS 90
S3 Jonathan IA 30
S4 Linda IA 95
S5 Sophos CS 10
Q6. Find all students who has final mark less than 40.
Ans:
In [79]:
df[df.Final<40]
S3 Jonathan IA 25.5 30
S5 Sophos CS 80.2 10
Q7. Create a new column Average that averages the score of both the final and coursework marks.
Ans:
In [80]:
df['Average'] = (df.Final + df.Coursework)/2
df
In [81]:
df.drop(['S3', 'S5'], inplace=True)
df
The following code loads the file testresult_withheader.csv in the working directory. Open the file to see what
we are reading. Note that the first row is the header which specifies the names of each column.
The command .head() displays the first few (default 5) lines of data frame. One can put the number of
line in as input like .head(2) .
In [82]:
loaded_df = pd.read_csv('testresult_withheader.csv')
loaded_df.head()
Notice that the first column of testresult_withheader.csv is used to index the dataset. We can specify this
through the parameter index_col .
In [83]:
df = pd.read_csv('testresult_withheader.csv', index_col=0)
df.head()
Some csv file do not come with the header. Open the file testresult_noheader.csv in the working directory to
see the file that we are going to load. Note that there is no header row in the file.
In [84]:
df = pd.read_csv('testresult_noheader.csv', header=None)
df.head()
Out[84]: 0 1 2 3 4 5 6 7
for files without headers, we may want to specify the name of the columns ourselves. To do this, we can
specify the names of the columns through the parameter names .
In [85]:
df = pd.read_csv('testresult_noheader.csv', names=['A', 'B', 'C', 'D', 'E', 'F', 'G'])
df.head()
Out[85]: A B C D E F G
In [86]:
df = pd.read_csv('testresult_withheader.csv', index_col = 0)
df
In [87]:
df.dtypes
last_name object
age float64
Gender object
State object
Test1 float64
Test2 object
dtype: object
This is because there is one of the sample has 'NAN' for Test2 . Pandas is unable to recognize it as a
missing value. Therefore, it has treated the whole column as a string. By default, pandas will only consider
the following entries as missing value:
1. empty space
2. NA
3. N/A
4. NaN
5. NULL
na_values
You can tell pandas to identify other types of missing value through the parameter na_values . The
following code shows how to specify the missing value in a DataFrame.
In [88]:
df = pd.read_csv('testresult_withheader.csv',
index_col = 0,
na_values=['NAN','-'])
df
In [89]:
df.dtypes
last_name object
age float64
Gender object
State object
Test1 float64
Test2 float64
dtype: object
We will demonstrate how to handle missing values later.
To view a small sample of a Series or DataFrame object, use the df.head to peek at the first few rows and
df.tail the last few rows of the df. The default number of elements to display is five, but you may pass in
a custom number.
In [90]:
df = pd.read_csv('testresult_withheader.csv', index_col = 0, na_values=['NAN','-'])
In [91]:
df.tail(6) # Peek at the last 6 rows of df
We can use describe() to get the statistics of all numerical columns of a DataFrame (excluding NaN
data). The categorical columns would be ignored.
In [92]:
stat = df.describe()
stat
In [93]:
df.Gender.value_counts()
Out[93]: Male 5
Female 4
In [94]:
df.State.value_counts()
Out[94]: Perak 3
Penang 3
Johor 2
Kedah 2
In [95]:
df = pd.DataFrame({"date": ["Jan", "Feb", "Mac", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct
"book_sales": 2*np.arange(12) - 2 + 3*np.random.randn(12),
df
In [96]:
import matplotlib.pyplot as plt # import Matplotlib plotting package
df.plot()
plt.show()
In [97]:
df.plot(x = 'date', y = ['book_sales', 'cd_sales'])
plt.show()
plt.show()
To plot multiple groups, repeat the plot method specifying target ax . Use different colors and label
keywords to distinguish the groups.
In [99]:
ax1 = df.plot(kind='scatter', x='book_sales', y='cd_sales', color='Red', label='CD')
In [100…
s = pd.Series(range(-3, 4))
Out[100… 0 -3
1 -2
2 -1
3 0
4 1
5 2
6 3
dtype: int64
In [101… s[s > 0]
Out[101… 4 1
5 2
6 3
dtype: int64
In [102…
s[(s < -1) | (s > 0.5)]
Out[102… 0 -3
1 -2
4 1
5 2
6 3
dtype: int64
In [103…
s[(s >= -1) & (s <= 0.5)]
Out[103… 4 1
5 2
6 3
dtype: int64
In [104…
s[~(s < 0)]
Out[104… 3 0
4 1
5 2
6 3
dtype: int64
Filtering a DataFrame
We can also select samples (rows) that fulfils some desired criteria.
In [105…
df = pd.read_csv('testresult_withheader.csv', index_col = 0, na_values=['NAN', 'NaN','-'])
df
The following code filters the list of person who score less than 50 for their Test2.
In [107…
df[df.Test2 > 50]
The following code filters the list of person who score less than 50 for both their Test1 and Test2.
In [108…
df[(df.Test1 > 50) & (df.Test2 > 50)]
In [109…
df = pd.read_csv('testresult_withheader.csv', index_col = 0, na_values=['NAN','-'])
df
df.isnull() returns a boolean dataframe the same size with df which indicates if an element is null or
not.
In [110…
df.isnull()
In [111…
df.isnull().any()
last_name True
age True
Gender True
State False
Test1 True
Test2 True
dtype: bool
(2) Showing columns with missing values
In [112…
columns_with_missing_values = df.columns[df.isnull().any()]
columns_with_missing_values
In [113…
df[columns_with_missing_values]
The command df.dropna() only returns a copy of the result and does not update the df itself. To
update df , use the parameter inplace = True .
In [114…
df.dropna(inplace = True)
df
In [115…
df = pd.read_csv('testresult_withheader.csv', index_col = 0, na_values=['NAN', 'NaN','-'])
In [116…
df.fillna(df.mean()) # replace missing values for numerical columns with their mean
df.fillna(df.mean()) # replace missing values for numerical columns with their mea
n values
In [117…
mean_of_test1 = df['Test1'].mean() # Fill missing values in column 'Test1'
df['Test1'].fillna(mean_of_test1, inplace = True)
df
.to_csv(filename)
In [118…
df.to_csv('testing.csv')
Exercise 3
The following questions processes a dataset called chipotle.csv that shows the orders received for
different items sold by a company.
1. Each row records the details of a single transaction involving one particular item.
2. Each order can include multiple transactions and hence span multiple rows
In [119…
chipo = pd.read_csv('chipotle.csv')
Ans:
In [120…
chipo.head(10)
Answer: 4622
In [121…
chipo.info()
#OR
chipo.shape[0]
<class 'pandas.core.frame.DataFrame'>
Out[121… 4622
Answer: 5
In [122…
chipo.shape[1]
Out[122… 5
In [123…
chipo.columns
'total_item_price'],
dtype='object')
Q6. How is the dataset indexed?
In [124…
chipo.index
Q7. Get the list of all sold items and show the number of transactions involving them.
Answer:
...
In [125…
chipo.item_name.value_counts()
Chips 211
Veggie Burrito 95
Barbacoa Burrito 91
Veggie Bowl 85
Carnitas Bowl 68
Barbacoa Bowl 66
Carnitas Burrito 59
Nantucket Nectar 27
Izze 20
Chicken Salad 9
Veggie Salad 6
Burrito 6
Steak Salad 4
Crispy Tacos 2
Salad 2
Bowl 2
Carnitas Salad 1
In [126…
mostOrdered = chipo.item_name.value_counts()
Ans: 4972
In [127…
chipo.quantity.sum()
Out[127… 4972
Q10. How many orders were made in the period? Remember that each order can include multiple items and
hence span multiple rows. Hints: The number of orders is given by the number of unique order IDs.
Answer: 1834
In [128…
total_orders = len(np.unique(chipo.order_id))
# OR
total_orders = chipo.order_id.value_counts().count()
total_orders
Out[128… 1834
Answer: 18.795
In [129…
chipo.total_item_price.sum() / total_orders
Out[129… 18.795147219193023
Answer: 50
In [130…
print(chipo.item_name.value_counts().count())
# OR
print(len(np.unique(chipo.item_name)))
50
50
Q13. How many transactions (rows) are there with total_item_price more than $10?
Answer: 1130
In [131…
len(chipo[chipo.total_item_price > 10])
Out[131… 1130
Answer: 4355
In [132…
len(chipo[chipo.quantity == 1])
Out[132… 4355
Q15. How many times people ordered more than one Canned Soda?
Answer: 20
In [133…
len(chipo[(chipo.item_name == 'Canned Soda') & (chipo.quantity > 1)])
Out[133… 20
Q16. Reload the dataset chipotle.csv to the variable chipo but you should also consider 'NIL' as the
missing value. Identify the columns with missing values.
Ans:
In [134…
chipo = pd.read_csv('chipotle.csv', na_values=['NIL'])
chipo.info()
<class 'pandas.core.frame.DataFrame'>
Q17. Replace all the missing values in choice_description with the value 'NULL'.
Ans:
Check your solution using chipo.info() . Ensure that there is no more missing values for
choice_description .
In [135…
chipo.choice_description.fillna('NULL', inplace=True)
chipo.info()
<class 'pandas.core.frame.DataFrame'>
Q18. Replace all the missing values in total_item_price with its mean value.
Ans:
Check your solution using chipo.info() . Make sure that there is no longer any columns with missing
values.
In [136…
mean_value = chipo.total_item_price.mean()
chipo.total_item_price.fillna(mean_value, inplace=True)
chipo.info()
<class 'pandas.core.frame.DataFrame'>
Ans:
In [137…
chipo.plot(kind='scatter', x='quantity', y='total_item_price')
plt.show()
It is also useful to plot a crosstab or contingency table to display table to show the frequency of
categorical columns in a Dataframe. . Use pd.crosstab to plot the frequency table of the specified
categorical columns.
In [138…
data = {'state': [ 'Perak', 'Selangor', 'Perak', 'Selangor', 'Selangor', 'Selangor', 'Selangor
'town' : ['Kampar', 'Jitra', 'Kampar', 'Rawang', 'Beruntung', 'Rawang', 'Rawang
'rating' : list('ABABCCBBCA') }
df = pd.DataFrame(data)
df
0 Perak Kampar A
1 Selangor Jitra B
state town rating
2 Perak Kampar A
3 Selangor Rawang B
4 Selangor Beruntung C
5 Selangor Rawang C
6 Selangor Rawang B
7 Kedah Jitra B
8 Kedah Kumit C
9 Perak Tapah A
In [139…
pd.crosstab(df.rating, df.state)
rating
A 0 3 0
B 1 0 3
C 1 0 2
In [140…
pd.crosstab([df.state, df.town], df.rating)
Out[140… rating A B C
state town
Kedah Jitra 0 1 0
Kumit 0 0 1
Perak Kampar 2 0 0
Tapah 1 0 0
Selangor Beruntung 0 0 1
Jitra 0 1 0
Rawang 0 2 1
Crosstabs can also be normalized to show percentages rather than counts using the normalize=True
argument.
In [141…
pd.crosstab([df.state, df.town], df.rating, normalize = True)
Out[141… rating A B C
state town
state town
df.groupby
We can group a DataFrame based on one or more categorical column. For the dataframe above, let's say we
want to group the dengue case based on the states.
The command df.groupby('state') to generate sample groups. Since there are two states in the
database, there are two groups.
The command df.groupby('state').sum() performs the summation operation on each group. You
can apply other types of statistical or arithmetic operations such as mean, std, max, min, etc.
In [142…
df = pd.DataFrame({'Town' : ['Kampar', 'Ipoh', 'Taiping', 'Kuala Kangsar',
df
In [143…
df.groupby('State').sum()
State
In [144…
df.groupby('State').mean()
State
In [145…
df.groupby('State')['2013'].sum()
Out[145… State
Perak 2615
Selangor 1123
In [146…
df.groupby('State')['2013'].mean()
Out[146… State
Perak 653.75
Selangor 280.75
In [147…
df.groupby(['State','DengueCase']).mean()
State DengueCase
Exercise 4
Q20. Create a dataframe to store the following data.
2 Rifdean CE 1 2 3.2
3 Sharmila CS 2 1 2.1
ID Name Programme Year Trimester CGPA
4 Subramaniam CE 1 1 3.6
5 Bernard CE 2 2 4.0
In [148…
df = pd.DataFrame({'Name': ['Tan Xuan Yong', 'Rifdean', 'Sharmila', 'Subramaniam', 'Bernard'],
'Year': [2,1,2,1,2],
'Trimester': [1,2,1,1,2],
df
2 Rifdean CE 1 2 3.2
3 Sharmila CS 2 1 2.1
4 Subramaniam CE 1 1 3.6
5 Bernard CE 2 2 4.0
Q21. Create the crosstab show the frequency of the students by (1) Year and (2) Programme.
Answer:
Programme CE CS
Year
1 2 0
2 1 2
In [149…
pd.crosstab(df.Year, df.Programme)
Out[149… Programme CE CS
Year
1 2 0
2 1 2
Answer:
Programme
CE 3.6
CS 2.2
In [150…
df.groupby('Programme')['CGPA'].mean()
Out[150… Programme
CE 3.6
CS 2.2