Introduction To Pandas
Introduction To Pandas
Pandas
pandas is a Python library for data analysis. It offers a number of data exploration, cleaning and
transformation operations that are critical in working with data in Python.
pandas build upon numpy and scipy providing easy-to-use data structures and data manipulation
functions with integrated indexing.
The main data structures pandas provides are Series and DataFrames. After a brief introduction to
these two data structures and data ingestion, the key features of pandas this notebook covers are:
Import Libraries
In [2]: import pandas as pd
pandas Series
pandas Series one-dimensional labeled array.
In [3]: ser = pd.Series([100, 'foo', 300, 'bar', 500], ['paul', 'bob', 'wale', 'dan', 'bo
In [4]: ser
In [5]: ser.index
In [6]: ser.loc[['wale','bob']]
In [8]: ser.iloc[2]
Out[8]: 300
Out[9]: True
In [10]: ser
In [11]: ser * 2
Out[11]: paul 200
bob foofoo
wale 600
dan barbar
bola 1000
dtype: object
pandas DataFrame
pandas DataFrame is a 2-dimensional labeled data structure.
In [14]: df = pd.DataFrame(d)
print(df)
one two
apple 100.0 111.0
ball 200.0 222.0
cerill NaN 333.0
clock 300.0 NaN
dancy NaN 4444.0
In [15]: df.index
Out[15]: Index(['apple', 'ball', 'cerill', 'clock', 'dancy'], dtype='object')
In [16]: df.columns
Out[16]: Index(['one', 'two'], dtype='object')
Out[17]:
one two
In [19]: data = [{'alex': 1, 'joe': 2}, {'ema': 5, 'dora': 10, 'alice': 20}]
In [20]: pd.DataFrame(data)
Out[20]:
alex alice dora ema joe
Out[21]:
alex alice dora ema joe
Out[22]:
joe dora alice
In [23]: df
Out[23]:
one two
In [24]: df['one']
In [28]: three
Out[28]: apple 11100.0
ball 44400.0
cerill NaN
clock NaN
dancy NaN
Name: three, dtype: float64
In [29]: df
Out[29]:
one two flag
In [31]: df
Out[31]:
one flag
Using the read_csv function in pandas, we will ingest these three files.
Out[34]:
movieId title genres
0 2 60756 funny
4 2 89774 MMA
0 1 1 4.0
1 1 3 4.0
2 1 6 4.0
3 1 47 5.0
4 1 50 5.0
5 1 70 3.0
6 1 101 5.0
7 1 110 4.0
8 1 151 5.0
9 1 157 5.0
10 1 163 5.0
11 1 216 5.0
12 1 223 3.0
13 1 231 5.0
14 1 235 4.0
15 1 260 5.0
16 1 296 3.0
17 1 316 3.0
18 1 333 5.0
19 1 349 4.0
Data Structures
Series
In [37]: tags.head()
Out[37]:
userId movieId tag
0 2 60756 funny
4 2 89774 MMA
row_0 = tags.iloc[0]
type(row_0)
Out[38]: pandas.core.series.Series
In [39]: print(row_0)
userId 2
movieId 60756
tag funny
Name: 0, dtype: object
In [40]: row_0.index
In [41]: row_0['userId']
Out[41]: 2
Out[42]: False
In [43]: row_0.name
Out[43]: 0
DataFrames
In [45]: tags.head()
Out[45]:
userId movieId tag
0 2 60756 funny
4 2 89774 MMA
In [46]: tags.index
Out[46]: RangeIndex(start=0, stop=3683, step=1)
In [47]: tags.columns
tags.iloc[ [0,11,2000] ]
Out[48]:
userId movieId tag
0 2 60756 funny
11 18 431 gangster
Descriptive Statistics
Let's look how the ratings are distributed!
In [49]: ratings['rating'].describe()
In [50]: ratings.describe()
Out[50]:
userId movieId rating
In [51]: ratings['rating'].mean()
Out[51]: 3.501556983616962
In [52]: ratings.mean()
In [53]: ratings['rating'].min()
Out[53]: 0.5
In [54]: ratings['rating'].max()
Out[54]: 5.0
In [55]: ratings['rating'].std()
Out[55]: 1.0425292390605359
In [56]: ratings['rating'].mode()
Out[56]: 0 4.0
dtype: float64
In [57]: ratings.corr()
Out[57]:
userId movieId rating
Out[60]: (9742, 3)
movies.isnull().any()
Out[61]: movieId False
title False
genres False
dtype: bool
In [62]: ratings.shape
Out[62]: (100836, 3)
ratings.isnull().any()
Out[63]: userId False
movieId False
rating False
dtype: bool
In [64]: tags.shape
Out[64]: (3683, 3)
tags.isnull().any()
Out[65]: userId False
movieId False
tag False
dtype: bool
tags.isnull().any()
Out[67]: userId False
movieId False
tag False
dtype: bool
In [68]: tags.shape
Out[68]: (3683, 3)
Thats nice ! No NULL values ! Notice the number of lines have reduced.
Data Visualization
ratings.hist(column='rating', figsize=(15,10))
Out[69]: array([[<matplotlib.axes._subplots.AxesSubplot object at 0x00000096CE153CC0>]],
dtype=object)
In [71]: tags['tag'].head()
Out[71]: 0 funny
1 Highly quotable
2 will ferrell
3 Boxing story
4 MMA
Name: tag, dtype: object
In [72]: tags.head()
Out[72]:
userId movieId tag
0 2 60756 funny
4 2 89774 MMA
In [ ]:
In [73]: movies[['title','genres']].head()
Out[73]:
title genres
In [74]: ratings[-10:]
Out[74]:
userId movieId rating
In [75]: tags["tag"].head(15)
Out[75]: 0 funny
1 Highly quotable
2 will ferrell
3 Boxing story
4 MMA
5 Tom Hardy
6 drugs
7 Leonardo DiCaprio
8 Martin Scorsese
9 way too long
10 Al Pacino
11 gangster
12 mafia
13 Al Pacino
14 Mafia
Name: tag, dtype: object
ratings[is_highly_rated][30:50]
Out[78]:
userId movieId rating
36 1 608 5.0
38 1 661 5.0
40 1 733 4.0
43 1 804 4.0
44 1 919 5.0
45 1 923 5.0
46 1 940 5.0
47 1 943 4.0
48 1 954 5.0
50 1 1023 5.0
51 1 1024 5.0
52 1 1025 5.0
53 1 1029 5.0
55 1 1031 5.0
56 1 1032 5.0
57 1 1042 4.0
58 1 1049 5.0
59 1 1060 4.0
60 1 1073 5.0
61 1 1080 5.0
movies[is_animation][5:15]
Out[79]:
movieId title genres
In [80]: movies[is_animation].head(15)
Out[80]:
movieId title genres
rating
0.5 1370
1.0 2811
1.5 1791
2.0 7551
2.5 5550
3.0 20047
3.5 13136
4.0 26818
4.5 8551
5.0 13211
movieId
1 3.920930
2 3.431818
3 3.259615
4 2.357143
5 3.071429
Out[83]:
rating
movieId
1 215
2 110
3 52
4 7
5 49
Merge Dataframes
localhost:8888/notebooks/Desktop/AISaturday/Week-1-Pandas/Introduction to Pandas.ipynb 21/26
5/9/2019 Introduction to Pandas
In [84]: tags.head()
Out[84]:
userId movieId tag
0 2 60756 funny
4 2 89774 MMA
In [85]: movies.head()
Out[85]:
movieId title genres
0 1 3.920930
1 2 3.431818
2 3 3.259615
3 4 2.357143
4 5 3.071429
In [88]: movies.head()
Out[88]:
movieId title genres
Out[89]:
movieId title genres rating
box_office[is_highly_rated][-5:]
Out[90]:
movieId title genres rating
9717 193573 Love Live! The School Idol Movie (2015) Animation 4.0
9719 193581 Black Butler: Book of the Atlantic (2017) Action|Animation|Comedy|Fantasy 4.0
9723 193609 Andrew Dice Clay: Dice Rules (1991) Comedy 4.0
box_office[is_comedy][:5]
Out[91]:
movieId title genres rating
Out[92]:
movieId title genres rating
9708 190209 Jeff Ross Roasts the Border (2017) Comedy 4.0
9719 193581 Black Butler: Book of the Atlantic (2017) Action|Animation|Comedy|Fantasy 4.0
9723 193609 Andrew Dice Clay: Dice Rules (1991) Comedy 4.0
Out[93]:
movieId title genres
In [95]: movie_genres[:10]
Out[95]:
0 1 2 3 4 5 6 7 8 9
0 Adventure Animation Children Comedy Fantasy None None None None None
1 Adventure Children Fantasy None None None None None None None
2 Comedy Romance None None None None None None None None
3 Comedy Drama Romance None None None None None None None
4 Comedy None None None None None None None None None
5 Action Crime Thriller None None None None None None None
6 Comedy Romance None None None None None None None None
7 Adventure Children None None None None None None None None
8 Action None None None None None None None None None
9 Action Adventure Thriller None None None None None None None
In [97]: movie_genres[:10]
Out[97]:
0 1 2 3 4 5 6 7 8 9 isComedy
0 Adventure Animation Children Comedy Fantasy None None None None None True
1 Adventure Children Fantasy None None None None None None None False
2 Comedy Romance None None None None None None None None True
3 Comedy Drama Romance None None None None None None None True
4 Comedy None None None None None None None None None True
5 Action Crime Thriller None None None None None None None False
6 Comedy Romance None None None None None None None None True
7 Adventure Children None None None None None None None None False
8 Action None None None None None None None None None False
9 Action Adventure Thriller None None None None None None None False
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-98-ac8ed1f9139a> in <module>()
----> 1 movies['genres'].contains() == 'comedy'
~\Anaconda3\anaconda\lib\site-packages\pandas\core\generic.py in __getattr__(se
lf, name)
3079 if name in self._info_axis:
3080 return self[name]
-> 3081 return object.__getattribute__(self, name)
3082
3083 def __setattr__(self, name, value):
In [99]: movies.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9742 entries, 0 to 9741
Data columns (total 3 columns):
movieId 9742 non-null int64
title 9742 non-null object
genres 9742 non-null object
dtypes: int64(1), object(2)
memory usage: 228.4+ KB
In [ ]: