0% found this document useful (0 votes)

90 views26 pages

Introduction To Pandas

The document introduces pandas, a Python library for data analysis. It provides two main data structures: Series, which is a one-dimensional labeled array; and DataFrames, which are two-dimensional labeled data structures. The document demonstrates how to create Series and DataFrames from various data types like lists, dictionaries, and how to perform basic operations on them like selecting columns, inserting new columns, and deleting columns.

Uploaded by

Okewunmi Paul

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

90 views26 pages

Introduction To Pandas

Uploaded by

Okewunmi Paul

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 26

5/9/2019 Introduction to Pandas

Pandas
pandas is a Python library for data analysis. It offers a number of data exploration, cleaning and
transformation operations that are critical in working with data in Python.

pandas build upon numpy and scipy providing easy-to-use data structures and data manipulation
functions with integrated indexing.

The main data structures pandas provides are Series and DataFrames. After a brief introduction to
these two data structures and data ingestion, the key features of pandas this notebook covers are:

Generating descriptive statistics on data

Data cleaning using built in pandas functions
Frequent data operations for subsetting, filtering, insertion, deletion and aggregation of data
Merging multiple datasets using dataframes

Additional Recommended Resources:

Python for Data Analysis by Wes McKinney

Python Data Science Handbook by Jake VanderPlas

Let's get started with our first pandas notebook!

Import Libraries
In [2]: import pandas as pd

Introduction to pandas Data Structures

pandas has two main data structures it uses, namely, Series and DataFrames.

pandas Series
pandas Series one-dimensional labeled array.

In [3]: ser = pd.Series([100, 'foo', 300, 'bar', 500], ['paul', 'bob', 'wale', 'dan', 'bo

localhost:8888/notebooks/Desktop/AISaturday/Week-1-Pandas/Introduction to Pandas.ipynb 1/26

5/9/2019 Introduction to Pandas

In [4]: ser

Out[4]: paul 100

bob foo
wale 300
dan bar
bola 500
dtype: object

In [5]: ser.index

Out[5]: Index(['paul', 'bob', 'wale', 'dan', 'bola'], dtype='object')

In [6]: ser.loc[['wale','bob']]

Out[6]: wale 300

bob foo
dtype: object

In [7]: ser[[4, 3, 1]]

Out[7]: bola 500

dan bar
bob foo
dtype: object

In [8]: ser.iloc[2]
Out[8]: 300

In [9]: 'bob' in ser

Out[9]: True

In [10]: ser

Out[10]: paul 100

bob foo
wale 300
dan bar
bola 500
dtype: object

In [11]: ser * 2
Out[11]: paul 200
bob foofoo
wale 600
dan barbar
bola 1000
dtype: object

localhost:8888/notebooks/Desktop/AISaturday/Week-1-Pandas/Introduction to Pandas.ipynb 2/26

5/9/2019 Introduction to Pandas

In [12]: ser[['bola', 'paul']] ** 2

Out[12]: bola 250000
paul 10000
dtype: object

pandas DataFrame
pandas DataFrame is a 2-dimensional labeled data structure.

Create DataFrame from dictionary of Python Series

In [13]: d = {'one' : pd.Series([100., 200., 300.], index=['apple', 'ball', 'clock']),

'two' : pd.Series([111., 222., 333., 4444.], index=['apple', 'ball', 'cerill

In [14]: df = pd.DataFrame(d)
print(df)
one two
apple 100.0 111.0
ball 200.0 222.0
cerill NaN 333.0
clock 300.0 NaN
dancy NaN 4444.0

In [15]: df.index
Out[15]: Index(['apple', 'ball', 'cerill', 'clock', 'dancy'], dtype='object')

In [16]: df.columns
Out[16]: Index(['one', 'two'], dtype='object')

In [17]: pd.DataFrame(d, index=['dancy', 'ball', 'apple'])

Out[17]:
one two

dancy NaN 4444.0

ball 200.0 222.0

apple 100.0 111.0

In [18]: pd.DataFrame(d, index=['dancy', 'ball', 'apple'], columns=['two', 'five'])

Out[18]:
two five

dancy 4444.0 NaN

ball 222.0 NaN

apple 111.0 NaN

Create DataFrame from list of Python dictionaries

localhost:8888/notebooks/Desktop/AISaturday/Week-1-Pandas/Introduction to Pandas.ipynb 3/26
5/9/2019 Introduction to Pandas

In [19]: data = [{'alex': 1, 'joe': 2}, {'ema': 5, 'dora': 10, 'alice': 20}]

In [20]: pd.DataFrame(data)
Out[20]:
alex alice dora ema joe

0 1.0 NaN NaN NaN 2.0

1 NaN 20.0 10.0 5.0 NaN

In [21]: pd.DataFrame(data, index=['orange', 'red'])

Out[21]:
alex alice dora ema joe

orange 1.0 NaN NaN NaN 2.0

red NaN 20.0 10.0 5.0 NaN

In [22]: pd.DataFrame(data, columns=['joe', 'dora','alice'])

Out[22]:
joe dora alice

0 2.0 NaN NaN

1 NaN 10.0 20.0

Basic DataFrame operations

In [23]: df

Out[23]:
one two

apple 100.0 111.0

ball 200.0 222.0

cerill NaN 333.0

clock 300.0 NaN

dancy NaN 4444.0

In [24]: df['one']

Out[24]: apple 100.0

ball 200.0
cerill NaN
clock 300.0
dancy NaN
Name: one, dtype: float64

localhost:8888/notebooks/Desktop/AISaturday/Week-1-Pandas/Introduction to Pandas.ipynb 4/26

5/9/2019 Introduction to Pandas

In [25]: df['three'] = df['one'] * df['two']

df
Out[25]:
one two three

apple 100.0 111.0 11100.0

ball 200.0 222.0 44400.0

cerill NaN 333.0 NaN

clock 300.0 NaN NaN

dancy NaN 4444.0 NaN

In [26]: df['flag'] = df['one'] > 250

df
Out[26]:
one two three flag

apple 100.0 111.0 11100.0 False

ball 200.0 222.0 44400.0 False

cerill NaN 333.0 NaN False

clock 300.0 NaN NaN True

dancy NaN 4444.0 NaN False

In [27]: three = df.pop('three')

In [28]: three
Out[28]: apple 11100.0
ball 44400.0
cerill NaN
clock NaN
dancy NaN
Name: three, dtype: float64

In [29]: df

Out[29]:
one two flag

apple 100.0 111.0 False

ball 200.0 222.0 False

cerill NaN 333.0 False

clock 300.0 NaN True

dancy NaN 4444.0 False

In [30]: del df['two']

localhost:8888/notebooks/Desktop/AISaturday/Week-1-Pandas/Introduction to Pandas.ipynb 5/26

5/9/2019 Introduction to Pandas

In [31]: df
Out[31]:
one flag

apple 100.0 False

ball 200.0 False

cerill NaN False

clock 300.0 True

dancy NaN False

In [32]: df.insert(2, 'copy_of_one', df['one'])

df
Out[32]:
one flag copy_of_one

apple 100.0 False 100.0

ball 200.0 False 200.0

cerill NaN False NaN

clock 300.0 True 300.0

dancy NaN False NaN

In [33]: df['one_upper_half'] = df['one'][:2]

df
Out[33]:
one flag copy_of_one one_upper_half

apple 100.0 False 100.0 100.0

ball 200.0 False 200.0 200.0

cerill NaN False NaN NaN

clock 300.0 True 300.0 NaN

dancy NaN False NaN NaN

Use Pandas to Read the Dataset

In this notebook, we will be using three CSV files:

ratings.csv : userId,movieId,rating, timestamp

tags.csv : userId,movieId, tag, timestamp
movies.csv : movieId, title, genres

Using the read_csv function in pandas, we will ingest these three files.

localhost:8888/notebooks/Desktop/AISaturday/Week-1-Pandas/Introduction to Pandas.ipynb 6/26

5/9/2019 Introduction to Pandas

In [34]: movies = pd.read_csv('movies.csv')

print(type(movies))
movies.head(15)
<class 'pandas.core.frame.DataFrame'>

Out[34]:
movieId title genres

0 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy

1 2 Jumanji (1995) Adventure|Children|Fantasy

2 3 Grumpier Old Men (1995) Comedy|Romance

3 4 Waiting to Exhale (1995) Comedy|Drama|Romance

4 5 Father of the Bride Part II (1995) Comedy

5 6 Heat (1995) Action|Crime|Thriller

6 7 Sabrina (1995) Comedy|Romance

7 8 Tom and Huck (1995) Adventure|Children

8 9 Sudden Death (1995) Action

9 10 GoldenEye (1995) Action|Adventure|Thriller

10 11 American President, The (1995) Comedy|Drama|Romance

11 12 Dracula: Dead and Loving It (1995) Comedy|Horror

12 13 Balto (1995) Adventure|Animation|Children

13 14 Nixon (1995) Drama

14 15 Cutthroat Island (1995) Action|Adventure|Romance

In [35]: tags = pd.read_csv('tags.csv')

tags = tags.drop(["timestamp"], axis = 1)
tags.head()
Out[35]:
userId movieId tag

0 2 60756 funny

1 2 60756 Highly quotable

2 2 60756 will ferrell

3 2 89774 Boxing story

4 2 89774 MMA

localhost:8888/notebooks/Desktop/AISaturday/Week-1-Pandas/Introduction to Pandas.ipynb 7/26

5/9/2019 Introduction to Pandas

In [36]: ratings = pd.read_csv('ratings.csv', parse_dates=['timestamp'])

ratings = ratings.drop(['timestamp'], axis = 1)
ratings.head(20)
Out[36]:
userId movieId rating

0 1 1 4.0

1 1 3 4.0

2 1 6 4.0

3 1 47 5.0

4 1 50 5.0

5 1 70 3.0

6 1 101 5.0

7 1 110 4.0

8 1 151 5.0

9 1 157 5.0

10 1 163 5.0

11 1 216 5.0

12 1 223 3.0

13 1 231 5.0

14 1 235 4.0

15 1 260 5.0

16 1 296 3.0

17 1 316 3.0

18 1 333 5.0

19 1 349 4.0

Data Structures

Series

localhost:8888/notebooks/Desktop/AISaturday/Week-1-Pandas/Introduction to Pandas.ipynb 8/26

5/9/2019 Introduction to Pandas

In [37]: tags.head()
Out[37]:
userId movieId tag

0 2 60756 funny

1 2 60756 Highly quotable

2 2 60756 will ferrell

3 2 89774 Boxing story

4 2 89774 MMA

In [38]: #Extract 0th row: notice that it is infact a Series

row_0 = tags.iloc[0]
type(row_0)
Out[38]: pandas.core.series.Series

In [39]: print(row_0)

userId 2
movieId 60756
tag funny
Name: 0, dtype: object

In [40]: row_0.index

Out[40]: Index(['userId', 'movieId', 'tag'], dtype='object')

In [41]: row_0['userId']

Out[41]: 2

In [42]: 'rating' in row_0

Out[42]: False

In [43]: row_0.name
Out[43]: 0

In [44]: row_0 = row_0.rename('first_row')

row_0.name
Out[44]: 'first_row'

DataFrames

localhost:8888/notebooks/Desktop/AISaturday/Week-1-Pandas/Introduction to Pandas.ipynb 9/26

5/9/2019 Introduction to Pandas

In [45]: tags.head()

Out[45]:
userId movieId tag

0 2 60756 funny

1 2 60756 Highly quotable

2 2 60756 will ferrell

3 2 89774 Boxing story

4 2 89774 MMA

In [46]: tags.index
Out[46]: RangeIndex(start=0, stop=3683, step=1)

In [47]: tags.columns

Out[47]: Index(['userId', 'movieId', 'tag'], dtype='object')

In [48]: # Extract row 0, 11, 2000 from DataFrame

tags.iloc[ [0,11,2000] ]
Out[48]:
userId movieId tag

0 2 60756 funny

11 18 431 gangster

2000 474 5450 women

Descriptive Statistics
Let's look how the ratings are distributed!

In [49]: ratings['rating'].describe()

Out[49]: count 100836.000000

mean 3.501557
std 1.042529
min 0.500000
25% 3.000000
50% 3.500000
75% 4.000000
max 5.000000
Name: rating, dtype: float64

localhost:8888/notebooks/Desktop/AISaturday/Week-1-Pandas/Introduction to Pandas.ipynb 10/26

5/9/2019 Introduction to Pandas

In [50]: ratings.describe()
Out[50]:
userId movieId rating

count 100836.000000 100836.000000 100836.000000

mean 326.127564 19435.295718 3.501557

std 182.618491 35530.987199 1.042529

min 1.000000 1.000000 0.500000

25% 177.000000 1199.000000 3.000000

50% 325.000000 2991.000000 3.500000

75% 477.000000 8122.000000 4.000000

max 610.000000 193609.000000 5.000000

In [51]: ratings['rating'].mean()
Out[51]: 3.501556983616962

In [52]: ratings.mean()

Out[52]: userId 326.127564

movieId 19435.295718
rating 3.501557
dtype: float64

In [53]: ratings['rating'].min()
Out[53]: 0.5

In [54]: ratings['rating'].max()
Out[54]: 5.0

In [55]: ratings['rating'].std()

Out[55]: 1.0425292390605359

In [56]: ratings['rating'].mode()

Out[56]: 0 4.0
dtype: float64

In [57]: ratings.corr()
Out[57]:
userId movieId rating

userId 1.000000 0.006773 -0.049348

movieId 0.006773 1.000000 -0.004061

rating -0.049348 -0.004061 1.000000

localhost:8888/notebooks/Desktop/AISaturday/Week-1-Pandas/Introduction to Pandas.ipynb 11/26

5/9/2019 Introduction to Pandas

In [58]: filter_1 = ratings['rating'] > 5

#print(filter_1)
filter_1.any()
Out[58]: False

In [59]: filter_2 = ratings['rating'] > 0

filter_2.all()
Out[59]: True

Data Cleaning: Handling Missing Data

In [60]: movies.shape

Out[60]: (9742, 3)

In [61]: #is any row NULL ?

movies.isnull().any()
Out[61]: movieId False
title False
genres False
dtype: bool

Thats nice ! No NULL values !

In [62]: ratings.shape

Out[62]: (100836, 3)

In [63]: #is any row NULL ?

ratings.isnull().any()
Out[63]: userId False
movieId False
rating False
dtype: bool

Thats nice ! No NULL values !

In [64]: tags.shape
Out[64]: (3683, 3)

localhost:8888/notebooks/Desktop/AISaturday/Week-1-Pandas/Introduction to Pandas.ipynb 12/26

5/9/2019 Introduction to Pandas

In [65]: #is any row NULL ?

tags.isnull().any()
Out[65]: userId False
movieId False
tag False
dtype: bool

We have some tags which are NULL.

In [66]: tags = tags.dropna()

In [67]: #Check again: is any row NULL ?

tags.isnull().any()
Out[67]: userId False
movieId False
tag False
dtype: bool

In [68]: tags.shape
Out[68]: (3683, 3)

Thats nice ! No NULL values ! Notice the number of lines have reduced.

Data Visualization

localhost:8888/notebooks/Desktop/AISaturday/Week-1-Pandas/Introduction to Pandas.ipynb 13/26

5/9/2019 Introduction to Pandas

In [69]: %matplotlib inline

ratings.hist(column='rating', figsize=(15,10))
Out[69]: array([[<matplotlib.axes._subplots.AxesSubplot object at 0x00000096CE153CC0>]],
dtype=object)

localhost:8888/notebooks/Desktop/AISaturday/Week-1-Pandas/Introduction to Pandas.ipynb 14/26

5/9/2019 Introduction to Pandas

In [70]: ratings.boxplot(column='rating', figsize=(15,20))

Out[70]: <matplotlib.axes._subplots.AxesSubplot at 0x96ce192a58>

Slicing Out Columns

localhost:8888/notebooks/Desktop/AISaturday/Week-1-Pandas/Introduction to Pandas.ipynb 15/26

5/9/2019 Introduction to Pandas

In [71]: tags['tag'].head()

Out[71]: 0 funny
1 Highly quotable
2 will ferrell
3 Boxing story
4 MMA
Name: tag, dtype: object

In [72]: tags.head()

Out[72]:
userId movieId tag

0 2 60756 funny

1 2 60756 Highly quotable

2 2 60756 will ferrell

3 2 89774 Boxing story

4 2 89774 MMA

In [ ]:

In [73]: movies[['title','genres']].head()
Out[73]:
title genres

0 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy

1 Jumanji (1995) Adventure|Children|Fantasy

2 Grumpier Old Men (1995) Comedy|Romance

3 Waiting to Exhale (1995) Comedy|Drama|Romance

4 Father of the Bride Part II (1995) Comedy

localhost:8888/notebooks/Desktop/AISaturday/Week-1-Pandas/Introduction to Pandas.ipynb 16/26

5/9/2019 Introduction to Pandas

In [74]: ratings[-10:]

Out[74]:
userId movieId rating

100826 610 162350 3.5

100827 610 163937 3.5

100828 610 163981 3.5

100829 610 164179 5.0

100830 610 166528 4.0

100831 610 166534 4.0

100832 610 168248 5.0

100833 610 168250 5.0

100834 610 168252 5.0

100835 610 170875 3.0

In [75]: tags["tag"].head(15)

Out[75]: 0 funny
1 Highly quotable
2 will ferrell
3 Boxing story
4 MMA
5 Tom Hardy
6 drugs
7 Leonardo DiCaprio
8 Martin Scorsese
9 way too long
10 Al Pacino
11 gangster
12 mafia
13 Al Pacino
14 Mafia
Name: tag, dtype: object

localhost:8888/notebooks/Desktop/AISaturday/Week-1-Pandas/Introduction to Pandas.ipynb 17/26

5/9/2019 Introduction to Pandas

In [77]: tag_counts[:10].plot(kind='bar', figsize=(15,10))

Out[77]: <matplotlib.axes._subplots.AxesSubplot at 0x96ce4bd7b8>

Filters for Selecting Rows

localhost:8888/notebooks/Desktop/AISaturday/Week-1-Pandas/Introduction to Pandas.ipynb 18/26

5/9/2019 Introduction to Pandas

In [78]: is_highly_rated = ratings['rating'] >= 4.0

ratings[is_highly_rated][30:50]
Out[78]:
userId movieId rating

36 1 608 5.0

38 1 661 5.0

40 1 733 4.0

43 1 804 4.0

44 1 919 5.0

45 1 923 5.0

46 1 940 5.0

47 1 943 4.0

48 1 954 5.0

50 1 1023 5.0

51 1 1024 5.0

52 1 1025 5.0

53 1 1029 5.0

55 1 1031 5.0

56 1 1032 5.0

57 1 1042 4.0

58 1 1049 5.0

59 1 1060 4.0

60 1 1073 5.0

61 1 1080 5.0

localhost:8888/notebooks/Desktop/AISaturday/Week-1-Pandas/Introduction to Pandas.ipynb 19/26

5/9/2019 Introduction to Pandas

In [79]: is_animation = movies['genres'].str.contains('Animation')

movies[is_animation][5:15]
Out[79]:
movieId title genres

322 364 Lion King, The (1994) Adventure|Animation|Children|Drama|Musical|IMAX

483 551 Nightmare Before Christmas, The (1993) Animation|Children|Fantasy|Musical

488 558 Pagemaster, The (1994) Action|Adventure|Animation|Children|Fantasy

506 588 Aladdin (1992) Adventure|Animation|Children|Comedy|Musical

511 Snow White and the Seven Dwarfs

594 Animation|Children|Drama|Fantasy|Musical
(1937)

512 595 Beauty and the Beast (1991) Animation|Children|Fantasy|Musical|Romance|IMAX

513 596 Pinocchio (1940) Animation|Children|Fantasy|Musical

522 610 Heavy Metal (1981) Action|Adventure|Animation|Horror|Sci-Fi

527 616 Aristocats, The (1970) Animation|Children

534 631 All Dogs Go to Heaven 2 (1996) Adventure|Animation|Children|Fantasy|Musical|R...

In [80]: movies[is_animation].head(15)

Out[80]:
movieId title genres

0 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy

12 13 Balto (1995) Adventure|Animation|Children

44 48 Pocahontas (1995) Animation|Children|Drama|Musical|Romance

205 239 Goofy Movie, A (1995) Animation|Children|Comedy|Romance

272 313 Swan Princess, The (1994) Animation|Children

322 364 Lion King, The (1994) Adventure|Animation|Children|Drama|Musical|IMAX

483 551 Nightmare Before Christmas, The (1993) Animation|Children|Fantasy|Musical

488 558 Pagemaster, The (1994) Action|Adventure|Animation|Children|Fantasy

506 588 Aladdin (1992) Adventure|Animation|Children|Comedy|Musical

511 Snow White and the Seven Dwarfs

594 Animation|Children|Drama|Fantasy|Musical
(1937)

512 595 Beauty and the Beast (1991) Animation|Children|Fantasy|Musical|Romance|IMAX

513 596 Pinocchio (1940) Animation|Children|Fantasy|Musical

522 610 Heavy Metal (1981) Action|Adventure|Animation|Horror|Sci-Fi

527 616 Aristocats, The (1970) Animation|Children

534 631 All Dogs Go to Heaven 2 (1996) Adventure|Animation|Children|Fantasy|Musical|R...

Group By and Aggregate

localhost:8888/notebooks/Desktop/AISaturday/Week-1-Pandas/Introduction to Pandas.ipynb 20/26
5/9/2019 Introduction to Pandas

In [81]: ratings_count = ratings[['movieId','rating']].groupby('rating').count()

ratings_count
Out[81]:
movieId

rating

0.5 1370

1.0 2811

1.5 1791

2.0 7551

2.5 5550

3.0 20047

3.5 13136

4.0 26818

4.5 8551

5.0 13211

In [82]: average_rating = ratings[['movieId','rating']].groupby('movieId').mean()

average_rating.head()
Out[82]:
rating

movieId

1 3.920930

2 3.431818

3 3.259615

4 2.357143

5 3.071429

In [83]: movie_count = ratings[['movieId','rating']].groupby('movieId').count()

movie_count.head()

Out[83]:
rating

movieId

1 215

2 110

3 52

4 7

5 49

Merge Dataframes
localhost:8888/notebooks/Desktop/AISaturday/Week-1-Pandas/Introduction to Pandas.ipynb 21/26
5/9/2019 Introduction to Pandas

In [84]: tags.head()

Out[84]:
userId movieId tag

0 2 60756 funny

1 2 60756 Highly quotable

2 2 60756 will ferrell

3 2 89774 Boxing story

4 2 89774 MMA

In [85]: movies.head()

Out[85]:
movieId title genres

0 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy

1 2 Jumanji (1995) Adventure|Children|Fantasy

2 3 Grumpier Old Men (1995) Comedy|Romance

3 4 Waiting to Exhale (1995) Comedy|Drama|Romance

4 5 Father of the Bride Part II (1995) Comedy

In [86]: t = movies.merge(tags, on='movieId', how='inner')

t.head()
Out[86]:
movieId title genres userId tag

0 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy 336 pixar

1 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy 474 pixar

2 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy 567 fun

3 2 Jumanji (1995) Adventure|Children|Fantasy 62 fantasy

4 2 Jumanji (1995) Adventure|Children|Fantasy 62 magic board game

Combine aggreagation, merging, and filters to get useful

analytics

localhost:8888/notebooks/Desktop/AISaturday/Week-1-Pandas/Introduction to Pandas.ipynb 22/26

5/9/2019 Introduction to Pandas

In [87]: avg_ratings = ratings.groupby('movieId', as_index=False).mean()

del avg_ratings['userId']
avg_ratings.head()
Out[87]:
movieId rating

0 1 3.920930

1 2 3.431818

2 3 3.259615

3 4 2.357143

4 5 3.071429

In [88]: movies.head()
Out[88]:
movieId title genres

0 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy

1 2 Jumanji (1995) Adventure|Children|Fantasy

2 3 Grumpier Old Men (1995) Comedy|Romance

3 4 Waiting to Exhale (1995) Comedy|Drama|Romance

4 5 Father of the Bride Part II (1995) Comedy

In [89]: box_office = movies.merge(avg_ratings, on='movieId', how='inner')

box_office.head()

Out[89]:
movieId title genres rating

0 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy 3.920930

1 2 Jumanji (1995) Adventure|Children|Fantasy 3.431818

2 3 Grumpier Old Men (1995) Comedy|Romance 3.259615

3 4 Waiting to Exhale (1995) Comedy|Drama|Romance 2.357143

4 5 Father of the Bride Part II (1995) Comedy 3.071429

In [90]: is_highly_rated = box_office['rating'] >= 4.0

box_office[is_highly_rated][-5:]
Out[90]:
movieId title genres rating

9713 191005 Gintama (2017) Action|Adventure|Comedy|Sci-Fi 4.5

9716 193571 Silver Spoon (2014) Comedy|Drama 4.0

9717 193573 Love Live! The School Idol Movie (2015) Animation 4.0

9719 193581 Black Butler: Book of the Atlantic (2017) Action|Animation|Comedy|Fantasy 4.0

9723 193609 Andrew Dice Clay: Dice Rules (1991) Comedy 4.0

localhost:8888/notebooks/Desktop/AISaturday/Week-1-Pandas/Introduction to Pandas.ipynb 23/26

5/9/2019 Introduction to Pandas

In [91]: is_comedy = box_office['genres'].str.contains('Comedy')

box_office[is_comedy][:5]
Out[91]:
movieId title genres rating

0 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy 3.920930

2 3 Grumpier Old Men (1995) Comedy|Romance 3.259615

3 4 Waiting to Exhale (1995) Comedy|Drama|Romance 2.357143

4 5 Father of the Bride Part II (1995) Comedy 3.071429

6 7 Sabrina (1995) Comedy|Romance 3.185185

In [92]: box_office[is_comedy & is_highly_rated][-5:]

Out[92]:
movieId title genres rating

9708 190209 Jeff Ross Roasts the Border (2017) Comedy 4.0

9713 191005 Gintama (2017) Action|Adventure|Comedy|Sci-Fi 4.5

9716 193571 Silver Spoon (2014) Comedy|Drama 4.0

9719 193581 Black Butler: Book of the Atlantic (2017) Action|Animation|Comedy|Fantasy 4.0

9723 193609 Andrew Dice Clay: Dice Rules (1991) Comedy 4.0

Vectorized String Operations

In [93]: movies.head()

Out[93]:
movieId title genres

0 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy

1 2 Jumanji (1995) Adventure|Children|Fantasy

2 3 Grumpier Old Men (1995) Comedy|Romance

3 4 Waiting to Exhale (1995) Comedy|Drama|Romance

4 5 Father of the Bride Part II (1995) Comedy

Split 'genres' into multiple columns

In [94]: movie_genres = movies['genres'].str.split('|', expand=True)

localhost:8888/notebooks/Desktop/AISaturday/Week-1-Pandas/Introduction to Pandas.ipynb 24/26

5/9/2019 Introduction to Pandas

In [95]: movie_genres[:10]

Out[95]:
0 1 2 3 4 5 6 7 8 9

0 Adventure Animation Children Comedy Fantasy None None None None None

1 Adventure Children Fantasy None None None None None None None

2 Comedy Romance None None None None None None None None

3 Comedy Drama Romance None None None None None None None

4 Comedy None None None None None None None None None

5 Action Crime Thriller None None None None None None None

6 Comedy Romance None None None None None None None None

7 Adventure Children None None None None None None None None

8 Action None None None None None None None None None

9 Action Adventure Thriller None None None None None None None

Add a new column for comedy genre flag

In [96]: movie_genres['isComedy'] = movies['genres'].str.contains('Comedy')

In [97]: movie_genres[:10]

Out[97]:
0 1 2 3 4 5 6 7 8 9 isComedy

0 Adventure Animation Children Comedy Fantasy None None None None None True

1 Adventure Children Fantasy None None None None None None None False

2 Comedy Romance None None None None None None None None True

3 Comedy Drama Romance None None None None None None None True

4 Comedy None None None None None None None None None True

5 Action Crime Thriller None None None None None None None False

6 Comedy Romance None None None None None None None None True

7 Adventure Children None None None None None None None None False

8 Action None None None None None None None None None False

9 Action Adventure Thriller None None None None None None None False

localhost:8888/notebooks/Desktop/AISaturday/Week-1-Pandas/Introduction to Pandas.ipynb 25/26

5/9/2019 Introduction to Pandas

In [98]: movies['genres'].contains() == 'comedy'

---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-98-ac8ed1f9139a> in <module>()
----> 1 movies['genres'].contains() == 'comedy'

~\Anaconda3\anaconda\lib\site-packages\pandas\core\generic.py in __getattr__(se
lf, name)
3079 if name in self._info_axis:
3080 return self[name]
-> 3081 return object.__getattribute__(self, name)
3082
3083 def __setattr__(self, name, value):

AttributeError: 'Series' object has no attribute 'contains'

In [99]: movies.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9742 entries, 0 to 9741
Data columns (total 3 columns):
movieId 9742 non-null int64
title 9742 non-null object
genres 9742 non-null object
dtypes: int64(1), object(2)
memory usage: 228.4+ KB

In [ ]:

localhost:8888/notebooks/Desktop/AISaturday/Week-1-Pandas/Introduction to Pandas.ipynb 26/26

Veit Steinkamp - Python For Engineering and Scientific Computing-Rheinwerk Publishing (2024)
No ratings yet
Veit Steinkamp - Python For Engineering and Scientific Computing-Rheinwerk Publishing (2024)
515 pages
Data Science and AI Master's Program (With Unlimited Interview Calls)
No ratings yet
Data Science and AI Master's Program (With Unlimited Interview Calls)
52 pages
5.2 Data Visualization II
No ratings yet
5.2 Data Visualization II
39 pages
Group 8 - Mini Project Report
No ratings yet
Group 8 - Mini Project Report
16 pages
DSP Lab Manual: January 22, 2015
No ratings yet
DSP Lab Manual: January 22, 2015
20 pages
Lab Manual - NNDL
No ratings yet
Lab Manual - NNDL
63 pages
Logistic Regression For Malignancy Prediction in Cancer - by Luca Zammataro - Towards Data Science
No ratings yet
Logistic Regression For Malignancy Prediction in Cancer - by Luca Zammataro - Towards Data Science
32 pages
Deep Learning A Comprehensive Guide 1st Edition Vasudevan
No ratings yet
Deep Learning A Comprehensive Guide 1st Edition Vasudevan
60 pages
Predictive Modelling of Crime Dataset Using Data Mining
No ratings yet
Predictive Modelling of Crime Dataset Using Data Mining
16 pages
04 Proyek PDB
No ratings yet
04 Proyek PDB
39 pages
Data Analysis Using Python Day - 1 To Day - 4
No ratings yet
Data Analysis Using Python Day - 1 To Day - 4
30 pages
Numpy Lib
No ratings yet
Numpy Lib
19 pages
Machine Learning Notes
No ratings yet
Machine Learning Notes
21 pages
Info4604 Final Kylegragnola 1
No ratings yet
Info4604 Final Kylegragnola 1
20 pages
Neel Koshiya J
No ratings yet
Neel Koshiya J
18 pages
Ebook Python Interview Guide
No ratings yet
Ebook Python Interview Guide
15 pages
Data Visualization
No ratings yet
Data Visualization
18 pages
How Do I Extract Data From MATLAB Figures - MATLAB Answers - MATLAB Central
No ratings yet
How Do I Extract Data From MATLAB Figures - MATLAB Answers - MATLAB Central
6 pages
AI Lab Record For Class X
No ratings yet
AI Lab Record For Class X
11 pages
The Data Engineers Guide To Python For Snowflake
No ratings yet
The Data Engineers Guide To Python For Snowflake
15 pages
DAL Oral Question Bank
No ratings yet
DAL Oral Question Bank
7 pages
Ip-Class-Xii-Sample Question Paper
No ratings yet
Ip-Class-Xii-Sample Question Paper
12 pages
SciPy - Curve Fitting - GeeksforGeeks
No ratings yet
SciPy - Curve Fitting - GeeksforGeeks
11 pages
Pandas Basics
No ratings yet
Pandas Basics
21 pages
Computer Science
No ratings yet
Computer Science
25 pages
Lab06 - Questions: 1 CS 115 - Introduction To Programming in Python
No ratings yet
Lab06 - Questions: 1 CS 115 - Introduction To Programming in Python
2 pages
Ai Syllabus.
No ratings yet
Ai Syllabus.
23 pages
Class 11 Ip
No ratings yet
Class 11 Ip
2 pages
Machine Learning Assignment List
No ratings yet
Machine Learning Assignment List
2 pages
Chapter 2 Data Handling Using Pandas - I (DATA FRAME)
No ratings yet
Chapter 2 Data Handling Using Pandas - I (DATA FRAME)
15 pages
Cheat Sheet
No ratings yet
Cheat Sheet
10 pages
Pandas DataFrame1
No ratings yet
Pandas DataFrame1
22 pages
Name: Yandrapu Manoj Naidu Roll No: 20MDT1017: Choose Files
No ratings yet
Name: Yandrapu Manoj Naidu Roll No: 20MDT1017: Choose Files
7 pages
Pandas DataFrame
No ratings yet
Pandas DataFrame
70 pages
Python Pandas-Data Frames
No ratings yet
Python Pandas-Data Frames
41 pages
Cheat Sheet: The Pandas Dataframe Object: Preliminaries Get Your Data Into A Dataframe
100% (1)
Cheat Sheet: The Pandas Dataframe Object: Preliminaries Get Your Data Into A Dataframe
10 pages
Pandas - Ipynb - Colab
No ratings yet
Pandas - Ipynb - Colab
8 pages
Lecture 9 Pandas
No ratings yet
Lecture 9 Pandas
176 pages
DataFrame Ac Win Final
No ratings yet
DataFrame Ac Win Final
30 pages
Data Handing Using Pandas-I
100% (2)
Data Handing Using Pandas-I
46 pages
Python 3rd Unit Question and Answer
No ratings yet
Python 3rd Unit Question and Answer
25 pages
Pandas 1
No ratings yet
Pandas 1
89 pages
Pandas DataFrame Notes
100% (1)
Pandas DataFrame Notes
10 pages
Data Handling Using Pandas-1
No ratings yet
Data Handling Using Pandas-1
60 pages
Unit 4.2
No ratings yet
Unit 4.2
24 pages
Pandas
No ratings yet
Pandas
7 pages
Dataframe Ip
No ratings yet
Dataframe Ip
75 pages
04-Data Manipulation With Pandas
No ratings yet
04-Data Manipulation With Pandas
28 pages
Python Pandas ch-2
No ratings yet
Python Pandas ch-2
56 pages
Pandas
No ratings yet
Pandas
82 pages
SBLC 1
No ratings yet
SBLC 1
23 pages
Lab 9
No ratings yet
Lab 9
9 pages
Pandas
No ratings yet
Pandas
21 pages
P Unit-4 NP
No ratings yet
P Unit-4 NP
30 pages
Python Pandas Module - Introduction-07-11-2023
No ratings yet
Python Pandas Module - Introduction-07-11-2023
84 pages
Unit-4Introduction To Pandas
No ratings yet
Unit-4Introduction To Pandas
44 pages
09 - Pandas Slides
No ratings yet
09 - Pandas Slides
33 pages
Class 12 Practical File
No ratings yet
Class 12 Practical File
29 pages
Unit 2
No ratings yet
Unit 2
81 pages
Cheat Sheet: The Pandas Dataframe Object: Preliminaries Get Your Data Into A Dataframe
100% (1)
Cheat Sheet: The Pandas Dataframe Object: Preliminaries Get Your Data Into A Dataframe
12 pages
P03 Introduction To Pandas Ans
No ratings yet
P03 Introduction To Pandas Ans
45 pages
Loki Temp PPT Pandas 2
No ratings yet
Loki Temp PPT Pandas 2
31 pages
Pandas Cheat Sheet........
No ratings yet
Pandas Cheat Sheet........
11 pages
Ip Study
No ratings yet
Ip Study
18 pages
Exp3 Python
No ratings yet
Exp3 Python
15 pages
Python UnitIV
No ratings yet
Python UnitIV
20 pages
Pandas DataFrame Notes
No ratings yet
Pandas DataFrame Notes
10 pages
Pandas 1705297450
No ratings yet
Pandas 1705297450
21 pages
Pandas Library
No ratings yet
Pandas Library
5 pages
Pandas & Numpy
No ratings yet
Pandas & Numpy
32 pages
Python Data Frame New
No ratings yet
Python Data Frame New
32 pages
The Pandas Library
No ratings yet
The Pandas Library
39 pages
14 Pandas
No ratings yet
14 Pandas
25 pages
Lab-3 Pandas Library
No ratings yet
Lab-3 Pandas Library
14 pages
Pandas
No ratings yet
Pandas
27 pages
Pandas DataFrame Notes
No ratings yet
Pandas DataFrame Notes
13 pages
Unit 4
No ratings yet
Unit 4
36 pages
Python Pandas New Sylabus
No ratings yet
Python Pandas New Sylabus
53 pages
Pandas
No ratings yet
Pandas
16 pages
Class Notes: Class: XII Date: 7-Apr-2020 Subject: Informatics Practices Topic: 2. Python Pandas
No ratings yet
Class Notes: Class: XII Date: 7-Apr-2020 Subject: Informatics Practices Topic: 2. Python Pandas
4 pages
Pandas Dataframe Export The CSV File
No ratings yet
Pandas Dataframe Export The CSV File
9 pages
Pandas Handbook
No ratings yet
Pandas Handbook
33 pages
Pandas DataFrameObject
No ratings yet
Pandas DataFrameObject
4 pages
Pandas
No ratings yet
Pandas
5 pages
10 Lessons in Front-end
From Everand
10 Lessons in Front-end
Krasimir Tsonev
2/5 (1)
Pandas
No ratings yet
Pandas
8 pages
Block 1-Data Handling Using Pandas DataFrame
No ratings yet
Block 1-Data Handling Using Pandas DataFrame
17 pages
Getting Started with SAS Programming: Using SAS Studio in the Cloud
From Everand
Getting Started with SAS Programming: Using SAS Studio in the Cloud
Ron Cody
No ratings yet
Mastering Pandas in Python: Course Book
From Everand
Mastering Pandas in Python: Course Book
Pedro Martins
No ratings yet
TensorFlow深度学习项目实战: Chinese Edition
From Everand
TensorFlow深度学习项目实战: Chinese Edition
Posts & Telecom Press
No ratings yet