0% found this document useful (0 votes)
10 views35 pages

Funda 3

Foundations of data science
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
10 views35 pages

Funda 3

Foundations of data science
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 35
4.5 FANCY INDEXING With NumPy array fancy indexing, an array can be indexed with another NumPy array, a Python list, or a sequence of integers, whose values select elements in the indexed array. Fancy indexing is like the simple indexing in which arrays are passed as indices in place of single scalars. This allows us to very quickly access and modify complicated subsets of an array’s values. When using fancy indexing. the shape of the result replicates the shape of the index arrays not the shape of the array being indexed: # Array as array index import numpy as mp 4 = nparrange( 1, 10) Printia) indices = np.arrayi[4. 5. 6) Printtalindices]) outeuT: 123456789 1S 67) Fancy indexing on multiple dimensions X snparrange( 12).reshaped 3. 1) Foundations of Data Scien, print X) X{row, col) ourpur: 0123) 14567) 18910 111) array({ 2, 5, 11) In the above example, the first value is X{0, 2} the second is X{1, 1], and te third is X(2, 3]. The pairing of indices in fancy indexing follows all the broadcast rules. It is to be noted that with fancy indexing, the retum value reflects the broadcast shape of the indices. rather than the shape of the array being indexed. 4.5.1 Combined Indexing Fancy indexing can be combined with the other indexing schemes we've see Example import numpy as np X = mparrange(12.reshapet(3, 4)) prin "X-*. X) X12, (2.0. HI) OUTPUT: x ONE 31 14567) 18910 11) carrayi{10. 8, 9) Example: Selecting Random Points Fancy indexing is used 10 select an N by D matrix representing N poi Points drawn from a two-dimensional the subsets of ints in D di Rormal distri Tows from a matrix. Com limensions, such as the follo®™ uti Let's use fancy indexing 10 se nae choosing 10 random indices with no of the original array lect 20 randoy fi Tepes, ang om PRINS. We'll do tis OM Ne these indices to select 4 pre tite I Libraries for Data Wea 4ar import nurmpy a5: np tant = np random RandomState(42) sean = (0. 0] eo = lle 2h, 2.51) = randmultivariate_normal(mean, cov, 100) pnt X shapes", X-shape) inlices = np.random.choice(X.shape[0]. 10, replace=| int “Indices:",indices) sdecion = Xfindices] # fancy indexing here scion shape ourrut: shape: (100, 2) Indices: [44 50 45 82 95 79 64 76 10 99] (i,2) This strategy is used to quickly partition datasets, as is often needed in train/test Soliting for validation of statistical models and in sampling approaches to answering Saisical questions. 482 Modifying Values with Fancy Indexing Fancy indexing can be used to access parts of an array and also to modify parts ofan array. For example, imagine we have an array of indices and we'd like to set tte corresponding items in an array to some value, any assignment-type operator can te used for this, * Undate the selected elements of an array 15 sp.arrange 10) = marray{2, 1, 8, 4)) # indices of elements 10 be updated “Ul 1000 # New value Pray OUrpuT: 9 1000 1000 3 1000 5 6 7 1000 91 “O method of ufunes ‘The at() method does an in-place application ‘of the given operator at the specified ‘ides (here, 1) with the specified value (here, 10). Same operation is repeated at all Se wi eee a sar 0a he ce) mead a 448 Foundations of Data Si, # Repeated Operation x = mpzeros(10) nparray((2, 1, 8, 4)) np.add.at(x, i, 10) # 10 is edded to all the elements indicated by index i prints) OUTPUT : 10.10. 100. 10.0.0.0.10.0) 4.5.3 Sorting Arrays ‘+ Python has builtin sort and sorted functions to work with lists ‘* NumPy’s np.sort function is much more efficient and useful ‘* By default np.sort uses an O[NlogN]. quicksort algorithm © mergesort and heapsort are also supported. © The default quicksort is used in many of the applications. ‘Sorting a Numpy array x =nparray[25, 1, 14, 3,58)) np-sorts) OUTPUT sarray({ 1. 3. 14. 25, 58)) # 0 sort the array in-place sont) | prints) OUTPUT: 12345) # argsort :returns the indices of the soned elements x =nparray([2, 1. 4, 3, S)) np argsort(s) printi) prints) OUTPUT: (Loz2a va goer he nr Of he sd EE fhe ate elem 1% used (via fancy indexing) to construc, cee 8 On. These indices ca aray ce Libraries for Data Wranginy 40 sorting slong rows or columns A useful feature of NumPy's sorting algorithms is the ability to sort along specific rows or columns of a multidimensional array using the axis argument. This treats ‘ach row or column as an independent array, and the relationships between the row ox column values will be lost. ‘on array with respect t0 specific row and column snport mumpy a5 np rand = np.random RandomState42) X= randrandina(0, 10. (3, 4)) 4 on each column of X Y= apsortX, axis=0) # son each row of X 2 mpsoraX, axis=1) Print “Original array:\n"X) Print "Sorted column’n”.¥) Pint "Sorted Rows\n".Z) ourpur: Original array: 6.3.7 4 16926) 1 37H Soned column Wo 32 4 6436 P97 Rows: W467 2660 Waray Foundations of Data 450 See Partial Sorts NumPy np.parttion function supports to fi 1k takes an array and a number K and gives a to the left of the partion, and the remaining val ind the k smallest values in the ap, few array with the smallest K va ues to the right, in arbitrary ax Python program explaining # partition) function ‘import mumps as np # input array in_arr = nparray{| 12, 30, 1. 25, 49. 9) rine ("Input array > ~ in_arr) ‘out_arr = np.partition(in_arr, 3) rine ("Output partitioned array out_arr) OUTPUT: Input array : [12 30 1 25 49 9 Output partitioned array» { 19 12 25 49 30] Similarly multidimensional array can be sorted along an arbitrary axis of X.7 result is an array where the first two slots in each row contain the smallest va from that row, with the remaining values filling the remaining slots. import numpy as np rand = np.random RandomState(42) X = rand.randinnO. 10. (3. 4) print °X-".X) Y= mppartitionX. 2. axis=1) Print *Y.°.¥) OUTPUT: X 16.37 41 16.926) 743711 ¥ [U8 4.6. 7h 12.6.6. 9h 1h 477i 4.6 STRUCTURED ARRAYS NumPy’s wrwtured areas and record compound, heterogeneous data. Nunpy"c sine arrays provide etticient S10 ured Array 1s used for grour™ Libraries for Data Wrar Ath of different ‘types: and sizes. It uses data containers called fields. Each data field can contain data of different type and size. Array elements can be accessed with the help of dot notation. Properties of Structured array «All structs in array have same number of fields. ©All structs have same fields names. Create a structured array using a compound data type specification # Use a compound data type for structured arrays daa zros(4, deype=f names’ ‘name’, ‘age’. ‘weight") ‘ormars':((U10", ‘8, 8))) print "Data Type: data.dtype) # store values in three separate arrays ame = [‘Alice’, ‘Bob’, ‘Cathy’. ‘Doug') age = 25, 45, 37. 19] weight = (55.0, 85.5. 680. 61.5] fll the array with the lists of values dav name'] = name dataf age’ = age daa! weight’) = weight Prin( “Dara: *.data) 4 Get all names Prin "Names :".data'name'}) # print all names 9 Ger first row of data Pro “Datalo}:".darato}) # Print the elements np soridata. order="ase’) Print Sorting aecording 10 the Ourput: Daa Type: [Cname’. “10° Care Pata: [alice 25, 55. 1 (BA AS of Oth record cage’. b) seis’) (weight, “PEW 35.5) Cath’, 37, 0% 1 Doug’, 19, 61.5) Names [Alice “Beb" ‘Cathy Doug’) Daeiop ater 28.55) ee ng according (Bob 45, x3.51 aan Foundations of Data Soe, Using Boolean masking to filter on age # Get names where age is under 30 lataldaral age’) {’name') OUTPUT: array({"Alice’. “Doug'). dtype="10") Data is arranged together in one convenient block of memory. *UIO" translates to "Unicode string of maximum length 10 “id” translates to “4-byte (ie., 32 bit) integer "8" translates to "B-byte (ie., 64 bit) float ‘Structured array data types can be specified in a number of ways. (a) Dictionary method >>>npdrypet{ names':("name’, ‘age’ ‘formats’:('U10", 4’. 18'))) OUTPUT: dtypet{(‘name’, ">onpdnpe(((name’, 'S10°, Cage’, i), (‘weight 8°))) OUTPUT: dlypet{(‘name’, '$10'), ‘age’, ">>np dtypet 'S10.i4f8") OUTPUT: drypet lt JO". “SIO. U$1". 8, CR. “ and/or missing dat, * Offers a convenient storage interlace for labeled data © Pandas implements «number of powerlul data operations familar 10 users of both database frameworks and spreadsheet programs Perform Group by ‘operation easily Datasets are mutable using pandas and allows tuadd new rows and columns ¥y tw handle missing dats SEE re cee ere eee re eee eee cute of 0s eg ee Peete ot Oe © Menge and join datasets © Indexing and subsetting data NumPy°s ndarvay data structure defines the significant features for the type y clean, well-organized data typically seen in numerical computing tasks, Numpy ix, Aewibility (e.g.. atlaching labels to data, working with missing data, ete.) and ag Suitable for element-wive broadcasting (e.g. groupings. pivots. et), So analyzing the Jess structured data available in many forms in the world ar us ty difluclt to perform with Numpy. Pandas, and in particular its Senes a DataFrame objects, builds on the NumPy array structure and provides efficient acs to these sorts of “data munging™ tasks that consumes much of a data scientists bm Characteristies Numpy Array Pandas Dataframe Homogeneity [Arays consist of only [Dataframes have heterogene homogeneous elements (elements elements, Jot same data type) Mowbitity [Av ys are mutable Dataframes are mutable Access ‘Amay clementy can be accessed using integer positions [Dataframes can be access fusing both integer position = well as index. Dataframes have that Mexibil Flexibility Arrays do not have flexibility to deal with dynamic data sequence! Jand mixed data types, Daw type [Array deals with numerical data | Dataframes deal wah UN ta, Table 4.1 : Numpy Array Vs Pandas Dataframe Mutabitity | Homogeneity | Accessibility ouness int mutable heterogeneous linteger posinonl “Python bull ® . data structut numpy-rMarray | ~mutable | heteropensoue ‘mtexer position [higher perfo™ array cally Pani DaaFe | mabe [iceopeaas| gest Posttion of siructu® Inde ‘Table 4.2 : Python Li 7 ae iamaiancaenez] crak pret ear per eben foe Dela Whanging yssaling and Using Pandas Installation of Pandas on your system requires NumPy to be installed, and if nuikling the library from source, requires the appropriate tools to compile the C and {aihon sources on which Pandas is built eck the version of Pandas spoimpert pand as pas_tersion usr wwe can import Pandas under the alias pd: smyort pandas as pd 474 Pandas Objects Pandas objects are an enhanced versions of NumPy structured arrays in which te rons and columns are represented with labels, Pandas provides useful tools, sets, and functionality on top of the basic data structures. ‘Three fundamental Pandas data structures: the Series, DataFrame, and Index. Pandas Series Object ‘A Pandas Series is a one-dimensional array of indexed data. It can hold datatypes ite integer, string. boolean, Moat, python object etc. A Pandas Series can hold only ome data type at a time. The axis label of the data is known as the index of the seties, The labels need not to be unique but must be a hashable type. The index of the series can be: integer. string and even time-series data. Pandas Series is nothing column of an excel sheet with row index being the index of the series> fdas Series) constructor Pandas Serest[dana, inde. dep ene, cop oN) Parameters Remarks [Sa> aray tne, [Contains data stored in Series Hable. diet, or scalar value Index arrayclike or Index (1d) Values must be hashable and have the! same length as data, Non-unique index values are allowed. Will default to [Rangelndex (0. 1, 2, ....m) if not provided. If both a dict and index equence are! ised, the index will override the keys found in the dict Oe te eet ee eee eee ee ee ee eee eee te eae ic 456 Foundations of Data, Parameters Remarks diype Sit, numpy.diype, _or[Data type for the output Series. If, ExtensionDtype, optional specified, this will be inferred from das copy + bool, default False (Copy input data Creating an empty Pandas Series import pandas as pd ‘empty series = pd.Seriest) rinttempty_series) #Outpur: Seriest{]. diype: floar6-t) 4.7.2 Create a Pandas Series from a list Pandas Series can be created from Python list by passing the lis | Pandas Series(). In this case, the pandas will set the default index of the Series: import numpy as np import pandas as pd dara = pd Series({0.23. 0.5. 0.75, 1O)) awa ourPuT: 0 02s eeerase eet) eee cone dtype: Moat64 ) The Series object wraps both a sequence of val ge which can be aecessed with the value. es and a sequence of ind! index attributes. The values are si ieee eee Ace [uy Wie eter cerenyed cel aaa coe more general and exible compared t0 the one-done oe, Pesndex. and >>>data values arast] 023,08 .075 1 fy >>>udata inde Rangelndert starts) wep=4. ep: notation “ited index using the Python squat ae EY Sere oranea ernie eae a p>deta| 1] 0s popdatal 1:3] 1 050 2 07 ype: floaro4 A Explicit index definition This explicit index definition gives the Series object additional capabilities For example, the index need not be an integer, desired type. strings can also be used as an index. >erdaia = pal Series((0.25. 05, 075. LOjindess|‘a’ 'b: ‘ey >>>dua «025 * 050 © 07 4 100 ‘type: float6d And the item to access works ay expected >>>datal‘c"] 075 but can consist of values of any Non-contiguous or non-sequent >>>data = pul Series [0.25. U5, 075. Lijindes=[2. $2 7) >> >dana 225 30.59) 3075 Tho type: Net expected! And the item access worky ay eXPect 2 dana eee ase Foundations of Data Soy Create a Pandas Series from Numpy array Pandas Series can be created from a numpy array by passing the Numpy 2, to pandas Series() as under ‘import numpy as np ‘import pandas as pd data = np.array({10,20.30,40.50}) a.series = pd Series data) prinia_series) ourput: o wo 1 20 fase 9 3 40 a 0 type: int64 Create a Pandas Series from a Python Dictionary + Pandas Series a like a specialization of a Python dictionary * A dictionary is a structure that maps arbitrary Keys to a set of abit values ¢ Series is a structure that maps typed keys to a set of typed values This Pings tmponant:Type-pecitic compiled code behind a NunP) 2 makes il more efficient than a Python li for many operations. the type info of a Panday Series makes it much more efficient than Python dictionaries for ‘operations Dictionary ‘ray (Python) (Numy) as ow bs 129 eT 230 a8 337 o8 448 nueane sera hare ane 567 areme or naeces wheveay 67 arene data eo Logg Pi Oe i Libraries for Data Wra ea Pandas Series can be created from a dictionary by passing the dictionary to .Series(). Index of the Pandas Series will be the keys of the dictionary and the Ties will be the values of the dictionary ingont pandas as pd elo’ Le wo 2, ‘vee 3 Your 4. vers I ses = pa Serestdct) Areate series from dletionary interes) Opt o 1 es 2 tne 3 fur 4 es ‘ype: imé4 D Create a Pandas Series from scalar data > >pdSeries (data, index=index) Data can be a list or NumPy array. integer sequence >>>pdSeriex (20, 40. 60)) in which case index defaults to an ype: into. Dara can he a scalar, which is repeater ied inden. sd 10 Till the spec Pia Seriew 18, indey=] 0. 20, 40 Tae SreeeidissseeestetddiastetTTEEESSPTtT 460, Foundations of Data scan, $Fointiont of Bato Sey OUTPU os 4s wos type: intod © Data can be a dictionary, in which index defaults t0 the vorted dictonyy keys >>> pu Serie (200, 1ev ieeeage 3 og “D type: object * Index can be explicitly set if a different result is preferred: >>opulSeriet (200. Hex AE indes=13. 2) ourrut: to 2 dtype: object Notice that inthis ease, the Series is populated only with the explicitly iene keys. Function Description Seren Seriewy constructor method To create pandas wenn Combine fit [Combines two series ino one count Returns sumher oF non-NA/null observations inthe Series BaU Returns the umber of elements in series namet) ‘Assign name to a Series object like column | Tacumquery [Returns rue af values in the object are unique ae idxmaxt) Extracts the index positions of the highest value(s) in a serie idxmint) Extract the index positiony of the Tawest value(s) in a Sene_ onde Fo son the vallies of a series in aycending or descending Tan indewt)_[F srtvalues OF a ereshy the anew tse af wy vale 4 jreotindes pe oo pron Ubrares for Data Wranging —— — a Function Description |= a feo Returns @ specified numer of wows tn the being vf w Weries lao [Retuins 8 specited numer uf vow trom hw eral of w Bares The method returns a new Serie [tind [0 chip value Below and uluve the given Min and’ Max value [eip_lower() —|To clip values below u pissed Lower value —— [eae LE [cip_opper’) [To clip values above w parwed Upper oT) The method pe) [To change dats type of w ve tit [To convert w series wo Tit - eo [To extract values from u Series. Thin Ws alternative syntax W) the traditional bracket syntax origuet) [To verily the unique values in w panicular column | fecionze() [To get the numeric representation of an array hy wlenintying dio values 47.3 Pandas DataFrame Object Pandas dataframe is a primary daly structure of pandan It iy two-dimensional mutable array with both flexible row yndices and flexible column names. DataFrame can be considered as w genetuliation of » NumPy arta, 2 specialization of Python dictionary Ie is simiar to an excel sheet or SQL tuble Different ways of creating # Pandas Dutaframe (8 Panda Data ramet) caste tor 1 Pt DaaE ramet dana, mules. colina dope, mame oops — Wh Dalatrame can be crented from: Dict oF LD ndarrays. Hints, diets oF Sen 24 numpy ndurray a Fountains © Another DataFrame + The parameters for the constuctor of a Pandas Datatrame se detailed « under: Parameters Remarks ] [data: ndarray (structured of| Dict homogeneous), Uterable, dict, object Jor DataFrame | in-contain Series, arrays, constants, oF Histlite, index: Index oF aray-like [Column labels 10 uve for resulting frame type. default None Daia type to force. Only a single dtype is allowel| | ory: bool. default False [Copy data from inputs. Only affects Datalrame [2 ndarray input J Empty Pandas Dataframe Create an empty Pandas Dataframe using : Pandas.Datuframe() and add the columm Using df.columns = [list of column names} a and append rows to it >>> import pandas as pd >>> df = pd.DataFramet) >>> df Empey DaraFrame Columns: ] : Index: {] Create a Pandas Dataframe from a single Series object rom a sing Pandas ies will be the olumn name of Create a Pandas Dataframe fi pd.DataFrame(). Index of the seri Series by passing the ser ‘will automatically set 0 as the co index of the datatrame and pol the Dataframe. import pandas ax pd ker = Manel ts 2 hee heirs Gee) numbers = pd Seresidict) Wf = pel Datuk ramet{ Numbers muners)y prow tf yton beanies for Date Wranging 0S ovTPUT: Numbers wt te 3 jor 4 fe 5 Dataframe from multiple Pandas Series Create a Pandas Dataframe from multiple Pandas Series by passing the dictionary cf multiple series to pd.DataFrame() as below, The keys of the dictionary will comprise the columns of the Pandas Dataframe: ‘import pandas as pd eth = (one’s 1, ‘two's 2 three’: 3, four's 4.five’’S} \let2 = [one’s ‘Excellent’, “rwo": ‘Very Good’. three’> ‘Good’. four Moderate’ five" Satisfactory’) tumbers] = pd Series(dict!) tumbers2 = pd Seriesdict2) 4 = plata ramet Numbers" munbers!, Grade’:rumbers2}) rn (ap OUTPUT: Numbers Grade ad 1 Excellent os Very Good ve Goud a 4 Moderate oe 5 Satsfacors /. DataFrame can be considered as a generalization of a two-dimensional NumPy ad where both the rows and columny have @ generalized index for accessing the *>dfindes # rows Ide ine “pete Jor oe dese >> dfcolumns teolunns tn Mt Numbers. Grale'| dvpe= obi") Foundations of Data Sai 464 ictionari Areate a Pandas Dataframe from a list of Python Dictionaries sn python dictionaries by passing th jy Dataframe is constructed with colung 1g value are added as “Nyy We can create a Pandas Dataframe fro of the dictionaries to pd.DataFrame() Pandas as a union of keys of the dictionaries and the mi (ie “not a number") import pandas as pd f = pd DataFrame{{‘a"> 1, “b: 2). ('b' 3. e's 4h (as Ie rina dp A sm) ourrur: a 6 . 0 10 2 NaN NaN 3 40 2 10 3 30 Create a Pandas Dataframe from 2D Numpy array ‘A pandas dataframe can also be created from a 2 dimensional numpy aay using the following code: import pandas as pa impor numpy as np of = pa. DataFrameinp random rand 3.21) Mlefault col & index are set as 0.1.2 prinudp OUTPUT: 0 , 0 0.720348 0.514865 1 0.265421 01543528 2 0.370571 oso Column and index can be specified in the datafram ve label and a. b & ¢ are the indes label maframe as: below «ty are « of = pd. Darabrametnp. random rand set oecy Stane eee a) prmdps -_ Libraries for Data Wranglr 465 uTPUT: 4 0.570220 0.141353 3 0.805120 0.236790 ‘ 0.148473 0.961625 Create a Pandas Dataframe from a Dictionary of Numpy arrays or list Alternatively, a Pandas Dataframe can also be created from a dictionary of nd amys oF list, the Keys of the dictionaries will be the columns of the dataframe and ‘it will have the default integer index, if no index is passed. imort pandas a pd fet = ‘one's [1. 2, 3. 4, ‘0's [4 3. 2. LD 4 = pd DaraFrame(cic) proud # Oupu one two 0 10 40 1 20° 30 2 30 20 340 10 Create Pandas Dataframe from a Numpy structured array We can create a Pandas Dataframe from a numpy structured array using the folowing ‘code. Structured arrays are ndarrays whose datatype is a composition of similar datatypes. Inport puns as ped ‘ump as. np daa = np zerosi(2, 1, dispestCA', HL CB. FE) CO. al) structured array eta] = 61, 2. Goud) (2. Be "Duv"Hd 4 pa. DanaFrametdana) Mdataframe using srwctared ars Paap OUTPUT: A 8 6 120 WGownd 2 40 b'Dus 1 fc ge maine tO Sea 4.7.4 Pandas Index Object Both the Series and DataFrame objects contain a refer and modify data This Index object is a structure in itself, and it ean be thought of either 35 immutable array or as an ordered set (technically 2 multi-set. as Index objects my contain repeated values). Those views have some interesting consequences In the operations available o Index objects. As a simple example, let's construct an Index from a list of integey >>>ind=pd.Index((2.3.5.7,17)) >>>ind 4” OUTIL Int64Index( (2. 3. 5. 7, 11) dtypes in explicit index that allows Index as immutable array ‘The Index in many ways operates like an array. For example, we can use sande Pytion indexing notation to retrieve values or slices: >>>ind{ 1] 3 >>>ind{:2) InvéAIndest(2. 5. 11) dev Index objects also have many of the attbutes similar to NumPy arrays >> >printind sce ind shape.ind ndim nd drspe) 515) 1 int6s ‘One difference between Index objects and NumPy arrays is that indices # immutabie-that is. they cannot be modified via the normal means: .)* >>>indl Topercor — Trucebuck (mont recent call last) in t) 1 ind| 120 Wserstukevdpanacondafibiprton3 Site puchegedpundavindexevbuse pn —setitem_tself. hes, value) 1243 1244 def _settem_tself. he. value > [245 muse TepeErront “Indes does nat support mutable operations”) 1246 1247 def —gettem_telf. hess ‘TypeError: Index does not support mutable operations This immutability makes it safer to share indices between multiple DataFrames aod arrays, without the potential for side effects from inadvertent index modification index as ordered set Pandas objects are designed to facilitate operations such as joins across datasets that depends on set arithmetic, The Index object follows many of the conventions used by Python's built-in set data structure, so that unions, intersections, differences. an other combinations can be computed in the same way: >evindAzpd Index({1.3.5.7.9) >o>indB=pd.tndex({2,3.5.7.11]) >p>indAindBF intersection Inblindet/3, 5. 7]. diype=imt6s") >oeindAlindBH# union Ibtindest{1. 2. 3, 5, 7. 9. 11]. deypes tb") >o>indAdindB# symmetric difference Ibtindeu 1, 2, 9. 11). divpe= int6s") ‘These operations may also be accessed via object methods, for example IndA intersection(indB), 48 DATA INDEXING AND SELECTION Indexing in pandas refer to selecting specific rows and columns of data from a Series of DataFrame. Indexing means selecting all the rows and some of the columns, some of the rows and all of the columns, or some of each of the rows and columns Indexing can also be known as Subset Selection. ‘sed t0 access pandas data structures across a wide range of use cases. The index ‘slike an address, that’s how any data point across the data frame or series can be ‘Keessed. Both Rows and columns have indexes,Object selection has several \errequested auditions to support more explicit location-based indexing. | The Python and NumPy indexing operators {] and attribute operator *” (dot) are | | Axis labeling in pandas objects is useful for: © Identities data (ie. provides metadata) using known indicators, important for analysis, visualization, and interactive console display * Enables automatic and explicit data alignment * Allows intuitive getting and setting of subsets of the data set 4.68 Foundations of Data Soe, Seay 4.8.1 Data Selection in Series A Series object acts like a one-dimensional NumPy array, and in many ya, like a standard Python dictionary Series as dictionary Like a dictionary, the Series object provides a mapping from a collection y keys t0 a collection of values: import pandas as pd data=pd Series({0.25,0.5.0.75.1.0]index=["a'.'b'¢d'P) data | ual: | a 025 b’ 050 © 07s a 100 type: float64 >>>datal'b'] #{ | 10 access the value os We can also use dictionary-like Python expressions and methods to examine t keys/indices and values: ° >>>"a" in data True >>>data keys) Musing dott.) operator 10 access the values Out} : tndestt'a’, 'b*, “’. “al. dtype= object’) >>>listdata.items()) Wa’, 0.25), (0°. 05). (e075) Ca’, 1.0)) Series objects can be modified with a syntax similar to dictionary. We can et a Series by assigning to a new index value >S>datal'e'J=1.25 >>>dara ouTt | a 025 b 050 © 075 a 10 e Las Lubraves for Data Wrangi 400 arype: foat64 This easy mutability of the objects is a useful feature. Pandas is making decisions about memory layout and data copying that might need to take place. series as one-dimensional array AA Series built on dictionary-like interface and provides array-style item selection via the same mechanisms as NumPy arrays ~ that is, slices, masking, and fancy indexing. Examples of these are as follows: 4 slicing by explicit index impor pandas as pd data = pal Series (0.25, 0.5, 0.78, 1.0). index=['a'.’b.‘c’. 'd') data") Afrom “a” 10 ’c* utput: «035 + 050 © 07s type: float64 # acing. by implicit imeger index ‘mpom pandas as pd axa = pa Series [025, 0.8, 0.78, LO). indexs('a’ °c. “d') 10:2} trom 0 10 1 column Output: «02s 6 050 type: Nloatea | # masking >>>daaltdarad 3)Aidatad 8] # vulues satisfving he given conditions Our, 0% ors SYP Noatos fa "Y indewne Eee ere eee eee ~— 470 Foundations of Data Sa, 27 See em creer Oa ca >>>datall'a’'e"I] a 02s (ema? 5) type: float64 Note: Sticing with an explicit index (.e., data’a'sc")) : the final index is include the slice Slicing with an implicit index (i.e., data[U:2)) : the final index is excluded fx the slice. Pandas provides some special indexer attributes that explicitly expose cen indexing schemes. These are not functional methods, but attributes that expos particular slicing interface to the data in the Series. Pandas supports three types of Multi-axes indexing 4.8.2 Indexing & Description aoc) ‘Label based Jiloc) Integer based ix) ‘Both Label and Integer based impon pandas as pd data = pd-Seres|‘opple’, ‘banana’. “copricotl,index=[1. 2, 3) print ‘Data:\vdata) prin ‘Daral data!) # explicit index when indexing # implicit index when slicing print ‘Datol 1:3} data !:3}) OUTPUT: Data ! pple 2 banana . capricot type: object Data}: apple pe wnvnan Dae engi nro enuneeer | pos 3: banana ‘i capricot Ange object 4821 loc attribute allows indexing and slicing Joc attribute allows inlexing and slicing that always references the explicit index: -mgort pandas as pd daa = r'Series('apple’. ‘banar, ‘caprcot', index=[1, 2 31) print ‘Data.\v’ daa) pnt ‘Data{\n'dara.loc{1}) # explicit index wl n indexing ‘+ inplicit index when slicing int’ 1:3] sdataloc{ 1:31) nas i ' apple | _ banana - apricot Ape. objet Das) pple Data{ 1:3): ' ample iO banana 8 apricot Spe: object $822 iloe attrit :te() The ioe atribute all Phony index: Stor pandas pd 0a = pet Series({‘apple’. ‘hanana’ “eapricet Pr Dara data soe ex when indering ‘ht Data n° data ioc. # expt ows indexing and slicing that always references the implicit index=;. 2. 311 an Foundations of Data Se # implicit index when slicing Print ‘Data 1:3." davaioc| 1:3) filoe ourruT: Data 1 apple 2 banana B apricot ype: object Dataf1} banana Dataf 1:3]: 2 ‘banana 3 ceapricot dtype: object 4.8.3 Data Selection in DataFrame DataFrame acts in many ways like a two-dimensional or structured aay, a in other ways like a dictionary of Series structures sharing the same index. DataFrame as a dictionary import pandas as pd diced = (one’s 1, three’: 3, four’: 4, five':5) dict? = one’: “Excellent, ‘two's “Very Good’ three’: Good’, four “Satisfactory’) numbersl = pd.Seriestdct!) mumbers2 = pd Series(dict2) A = pd DataFrame{’Numbers’:numbers!, Grade" mumbers2}) print (dp OUTPUT: Numbers Grade one 1 Excellent 0 2 Very Good three 3 Good four 4 Moderate five 5 Satisfactory “Moderate’, ‘five | _ Libraries for Data Wrangl a jecessing values using dictionary-style indexing gprdfT Numbers’) # [ ] operator is used to access ouTPUT ne 0 tiee for fue Name: Numbers, dtype: im64 >>>df Numbers ovrrur: to tiree four fe Name: Numbers, dtype: int64 Dictionary-style syntax can also be used to modify the object. in this case adding a new column: >>>dff Remarks‘) = df['Numbers'] *10 >bap our: Numbers Grate Remarks ine 1 Excellent 0 ad 2 Very Good 20 po 3 Good 30 ma 4 “Moderate wo 5 Sanifactor so 474 Foundations of Date Sou Sn nee CeO 4.8.3.1 Joc) based indexing. In i Pandas provide various methods to have purely label ty the start bound is also included, Integers are valid labels refering 10 the label not the position loc() has multiple access methoc, like - + A single label, eg. 5 or ‘a’ © A list or array of labels [° + A slice object with labels b’. “c']. a © A boolean array (any NA values will be treated as False), + A callable function with one argument Toe takes two singlefistrange operator separated by °,’. The first one inex the row and the second one indicates columns select all rows for a specific column import the pands library and aliasing as pd import pandas as pd Import numpy as np Af=pd DataFrame (np random randn(84). index=[a'b.e'd'e'f-g'h], columns = ['A'B'°C*-D+] select all rows for colunn A print dfocl: A’) ourPur: A B c . 1.072918 “1.254569 0.093088 b 1.388752 0072105 0.779739 ‘ 0.493108 0912124 0.027013 a 0.948629 0.101997 0.136843 . 0.339322 0218347 062784 - 1.072918 b 1.388752 ; 0.493108 a 0.948629 | 0.339322 Name: A. drype: floar6s Libraries for Data Wra fuk Set all rows for muitiple columns given in a list rmport the pandas library and aliasing as pd port the pandas library and aliasing as pd sngort pandas as ped Inport mumps as np f= pa DataF rame(np.random.randntS. 3) index ac pnt ‘Data Framed) faclect all rows for multiple columns given in a list print ‘Selected: dfloc|.'A'’C'1)) Lab ved}, colun OUTPUT: Data Frame A 8 c ‘ 0.478075 1331s srw 5 1671238 0.932502 1.535282 = 1048747 1.283268 O.127417 ‘ 0.854136 0.180605 1 AS2383- ~ 0.763562 0.428309 0.366394 Selected: A c . 0.478075 1.873309, C 1.671238 A SISIND ‘ LOMA747 O127417 bs OS8S4136 1 AS2AS4 . 0.763862 0.300308 6 for munpte counns, se lv Prine floc ab LEACH A&C columns for a and Brows 4s leer range of rams forall cotinns Pn Uthat Mall « hums for tae a td roy Hop emg suaues with homdean airs POM 'SELECTED i dp loc| a ]>0) HAIL values of ram at site salem aa Foundations of ouTPU Data Frame: A B @ 4 0.099012 1.936825 0.623656 6 01843665 0.922914 0.378428 ‘ 0.053655 0.360925 1.159739 a 0.154242 0.383701 0.633789 e 0.711617 1.264535, 0.933291 ‘SELECTED: A False 8 True iG False Name: a, dtype: bool 48.3.2 ioc) Pandas provide various methods in ordet to Like python and numpy, these are O-based indexit ‘The various access methods are as follows - get purely integer based indextt ing, © An integer eg. 5. * A list or array of integers (4, 3, 0} © A slice object with ints. 1:7, © Slelemchsy (Cy MN cates ca es geet oy False), * A callable function with on EXAMPLE: Select all rows for a specific column 4# import the pandas Wibrars and aliasing as pt import pandas as pd import numpy as np 4 = pd DaraFrametnp random randy, 4, columns = print’Data: Frame "lf TA Bop - °° 2 | Libraries for Data Wrangir an sp ulect all rows for @ specific colunn pon (Stected Wfiloc[.2)) # All ros for column 2 ourruT: usa Frame: A B iG 0 0.427079 0.091721 0577475 1 0.739794 0.808951 0.494559 2 0.859058 1583702 0.385784 3 2.624747 0530093 0.058633 ‘ 0.358775 0015555 1.053216 Selected a 057475 1 0.494559 2 0.385784 3 0.058633 ‘ +1.053216 Name: C, dye: float Integer slicing ‘port pandas as pd ‘port numpy as np = pd.DataFrametnp.random.randn(S, 3). columns = [’A’ 'B. Cl) # meger slicing Print (Selection In dfiloc|:4]) # 4 rows and all columns Print (Selection 2n"dfiloc[1:S. 2:3)) #1 19 5 rows and column 2 output: Selection 1 A 8 c 0.360723, 1.637749 1.772226 1.061229 0.920609 0436158 0.050478 1.000889 1.039163, 0.740353 0946621 0.701014 EE eee race eee eee etre. 478 Foundations of Ota son, 7 SE US Pret Do Selection 2 ' c | y 0.436158 2 1039163, | 2 0.701014 | 4 0.789554 Difference between loc and iloc: © loc selects rows and columns with specific labels '* loc se*2cts rows and columns at specific integer positions Pandas Selecting Data - loc[] & iloe[] foe{] = Select data via the index abet Ioe[] = Select data via the index position ‘Single nem ‘multiple terns, Fool] Dtiocrow tobel cot tobe toc {row post tow por? Woot see it Hoel] _otoc[rm_positon cal _postion) Phioc|row eH) row Jobe open ea poe 48.3.3 ix) Besides pure label based and integer based, Pandas provides a hybrid for selections and subsetting the object using the .ix() operator. Jmpont pandas as pd import numpy as np df = pd DutaFrameinp random randn(8}. colunns =['AB'°C-D")) Minteger slicing printdfit-41) #6 rows and all columns Libraries for Data Wrangin 479 Its output is as follows - A . c > : 0699435 0256039 127072 0.045195 ' 0.685354 0990791 03813012 0.631615 ; 0783192 0531378 0.025070 0.230806 f 0.539042 1.284314 0.826977 0026251 Index slicing Slicing the data by specifying the required columns and rows impor pandas as pd inport mumpy as np 4 = pDataFrame (np.random.randn(8.4), columns =['AB°C’D')) finder slicing pr an“A'D) All rcs and colunn A alone Its output is as follows - o 0.699435 1 0.685354 2 0.783192 3 0.539042 4 1.044209 a “141sait io 1.062095 iY 0.994204 Name: A, dtype: float64 The explicit nature of loc and iloc make the data frame very useful in maintaining and readable code: especially in the case of integer indexes They make the code to read and understand, and 10 prevent subtle errors due to the mixed. "Sdexing/sticing. convention lean in gNmPy style data access plterns can he used within these indexers, For example "he loc indexer we can combine masking and fancy indeuing, These sndemeg tions may uo be used wo set or modify values. By convention. inesing ete Columns, alc ing refers to rows and direct masking operation, are also interpreted “MIS€ rather than columa-wise

You might also like