0% found this document useful (0 votes)
130 views18 pages

10 Min Pandas

for Data cleaning tips

Uploaded by

raj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
130 views18 pages

10 Min Pandas

for Data cleaning tips

Uploaded by

raj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 18
10 minutes to pandas ‘Tiss short introduction to pandas, geared mainly for new users. Youcan see more complex recipes inthe Cookbook. Customarily, we import as follows: Anport nunpy a oe In (2): Anport pandas as pd Object creation ‘See the Data Structure Intro section, CCreatinga series by passing lst of values, letting pandas create a default integer Index: In (3): $= pa.sertest(t, 3, 5, ep.nans 6, 8)) type: floats (Creating a pataérane by assing 2 NumPy array, with adatetime index and labeled columns: Datetinetndex((/ 2013-03-21", " eypen eavetineséfns]", 26°}, ° freae'0") In [7]: df = pa.vataFrane(ap.randon.randn(s, 4), index-dates, colums-List("ALCD")) Sliaua “2.173015 "0.199209 -2.44a236 (Creating a pataérane by passing adit of objects that can be converted to series. In (9): d€2 = pd.vatarrana(('A's 1.4 8°: pa Tinestano(/'20130102"), (Cs pd Sertes(ty, indexelist(ranget4)), type #03032"), D's parray( (3°, atypes"int32"), E's pcateporical({"test", “train, "test", “train, Ft a0") tn 10): are cutie): 3 1.@ 2013-01-82 1.0 3 train foo “The columns ofthe resulting oatatrane have diferent types. Searchthe docs. ominutestonandas ‘Esso nasi functional, 10 tools ent.CS¥. HOES, Jncesngan selecting data ‘Cotopricaldate ‘Nulable intone datatvon ‘Nulale Boolean data tae ation ‘Computationaltoals ‘Scouobucsoll-annh-combine “Tine eres date functionality Tinedtas ara. ctypes a faoatse 5 datetineeatns] © ‘lost? e category F object faeype: object It you're using Python tab completion for column names (as wells pubic attrinutes) is automatically ‘enabled, Here's subset ofthe attributes that wil be complete: an (1]: 4F2. # naga: £225, £909 ora. bool Gra. boxplot ar?! € 812. prefix eevalign apy eral ‘count Gt2.any conbine rz. append ° ar2.aooty dosent 412. applymap are ane duplicated ‘As you can see, the columns 6, and oare automatically tab completed, ‘the attributes have been truncated for brevity 3nd Fare there aswel: the rest of Viewing data Seethe Basics section. Here ishow to view the top ané bottom rows af the frame: sf. neadt) 191569 0.490529 Leriaes 996771 “1.839575 0.271860 so7e20 0.276232 -1 087402 9.485 2.721555 afta) 0.721855 -0,706771 -2.039575. 0.271860, 1924872 0.562020 0.296252 -1.000 9.673690 0113648 1.478627 6.524988, Display the index. columns: Datetinetndex( (2613-01-21, "2013 eypes catering 86'|, atetinesé{ns]", frece'o") in (16): af. colunns ‘ut{a6): Index({'A, "8", °C*, °O"], dlypestensect y atatrane.to_auney() gives a NumPy representation ofthe underlying data, Note that this can bean ‘expensive operation when your DataEraga has columns with different datatypes, which comes down toa fundamental difference between pandas and NumPy: NumPy arrays have one dtype for the entire array, ‘hile pandas DataFrames have one dtype per column, When you calloataérans,to-nusny(), pandas wil ind ‘the NumPy dtype that can hold ll ofthe dtypes inthe DataFrame. This may end up being object, which requires casting every value toa Python object. For ef, our patafeana ofall lating point values, atafrane.te_sunay( 1s fast and doesnt require copying data, she -to,nuneyt) ‘lonsandscilass Srrayi{T 9.4681, -0.2829, 2.5001, {hang eformanes [ tasat) “e.are, ease, salngtobree datasets Ueme, “see, Soarsednta tucures Uaier, an requnty Asked Questions (220) Fore, thepatatrare with multiple dypes,sataEcanete nurs relatively expensive 442.0. sunpy0) Srray{{[1-8, Tinestanp(* 2013-6 00°), 1.0, 3, "test", oo", [ive, Tirestanp(" 2013-85-89 0:00:08"), 408, 35 ‘train’, "o0'], [i%e, Trestanpt" 212 09:08"), 2.8, 3, "test", 400"), (20, Turestanp(- 2013 09:08}, 2.8, 3, “eratn’, *400"]) atyperebject)| Note Datarrane.to nyoy() doesnot include the index or column label inthe output. descr tbe() shows 2 quick statistic summary of your date: In (19): aF.eesertbet) out(aa): “Tansposing your data in (20): afr uti): Sorting by an ats in [2]: dF.sort_index(axise1, sscending-False) outta) Sorting by values an {2}: df-sort_values(ey="8") Selection © Note \While standard Python / Numpy expressions for selecting and setting are intuitive and come in handy for interactive work, for production code, we recommend the optimized pandas data access methods, 2,13, loeand toe ‘See the indexing documentation Indexing and Selecting Data and Multlladex / Advanced indexing, Getting ‘Selecting single column which yields a Sartes, equivalent tod. aon) e.4sonn2 peste Fregi 0, Nanei A, dtype: floater Selecting via [} which slices the rows. eF(0:2] Satan “e.173015 9.115209 “1.244236 961869 °2! 104569 -0.494829 1.071804 sa: zersen0n' 20130208") 8189 “2108569 -0.406929 “1.071804 (0.721555 0.386771 «1.039575. 0.271860 Selection by label See morein Selection by Label For getting across section using label: basses? Nave! 3613-03-01 60:80:08, type: floats Selecting on a multi-axis by label: o.asiaas “2.108569 1721555 0.706779, ‘Showing labe slicing, both endpoints ae ince an [28]: oF loc{ "20730302" 20030104", A", 81) outi28): Reduction inthe mensions ofthe returned object: In (29): df. tect 26030102", [°8", °9°}] For getting ascalar valu: In (30): aF-toctdates(e}, “#1 out( se}: 8. aeo1122090071863 For getting fast access toa scalar (equivalent tothe prior method): in [31]: aF.stCdatest0], °°] fout[34]: 0.a63112299907%863 Selection by position See more in Selection by Position Selec via the position ofthe passed integers: tn (32): df. ttee(3] By integer slices, acting similar to rumpyipython: an (a3}: dF.tloc{ 2:5, 0:2] out( 33}: . : By lst of integer postion locations, similar to the numpy/python style: In (34): dF.tloc( 2, 2, 4), (0, 2)) out(34) For slicing rows explicitly an (38: dF.tlect4=3, 2] out(3s}: For slicing columns exp tn (36): dF Aloet:, 229) uti): For getting avalue explicitly an (37): df-tteela, For getting fast access toa scalar (equivalent tothe prior method): an (38]: af-498(1, 2] ‘ut{38): -2.1732148:905330058 Boolean indexing Usinga single column's values to select data In (39): affefC'a'] > 0) ‘Selecting values rom a DataFrame where a boolean conditions met. tn [40]: afta > 0} outta) Using the isin) method for filtering an (any: de2 = a-conyt) In [42]: d#2[°C"] = ['one', ‘one’, "tuo", "whee", “fours “three ] in (43) outta) ae p.4zaay2 9.se7ez0 9.276252 -2.687401 four In faa]: de2[sFa{"e" J Asin(l two’, Four") outa Setting Setting a newcolumn automaticaly aligns the data by the indexes. In [45]: Si = pd.Series((1, 25 3, 4% 5, 6]y Andexepe.date.range(’ 20136102", perio#s-6)) tn cas): st outta) sn (a7): deh) = st ‘Setting values by label: In (48): df. at(dates{o}, 'A°] = @ ‘Setting values by position an [49]: aF-satte, 21 = Setting by assigning with a NumPy array: an (S0]: dF-toc{:, °0°) = ap.array((5] * Lentae)) “The result ofthe prior setting operations Avere operation with setting an (52]: dea = af-conyt) 1m (53}: d#z[a2 > 9} = 402 in tsa}: ate cutis) Missing data pandas primarily uses the value np. an to represent missing data. Its by default not inched in computations, See the Missing Data section. Reindexing allows you to change/add/elete the index ona specified axis, Thisreturnsacopy ofthe data an [SS]: d#2 = df retnaex(ndexcaates[0s8], columnselist(df-columns) + (01) an (S6]: dFt.doc{dates(@):dates(3), ‘¢°] ‘To drop any rows that have missing data in (58) outs (42, eropnathow! 299") Filing missing data. in (59): #3. Fil.na(velve-s) outis9): ‘To get the boolean mask where values are nan pé.tsnacaeay Operations See the Basicsctin on Binary Ops. Stats Operations in general exclude missing data. Performing a descriptive statisti: an [2 af. nean() cutis) type? fasten Same operation on the other axis: sn (62: de.nean(2) Freq: 0, dtype: loates Operating with objects that have different dimensionality and need alignment. Inaddition, pandas automatically broadcasts along the speciied dimension, In (G3]: 5 = pa.sertes((1, 5, 5, ap.nan, 6, 8], inden-dates) .shiet(2) Fregi 0, dtype: floater in [65]: dF.sub(s, sxds-"indox’) Apply Applying functions to the data In [66]: dF. apply(np.cumsun) In (67): dF. apply(lanbaa x: x.max() - x.niQ) yper floats Histogramming ‘See more at Historramming and Diseretization In [68]: 5 = pa.Serses(np.randon.randint(@, 7, s826-10)) type: inte String Methods Seriesis equipped witha set of string processing methods in the tr atribute that makeilt easy to operate on ‘each element of the array asin the code snippet below. Note that pattern-matching instr generally uses sepular expressions by default (and in some cases always uses them). See more at ectorized String Methods, ace", mporan, CARA", ‘doe’, *eat"1) In [71]: 5 = pd.Sertes(("A', “by 1, “haba, an (72): s.ste.towert) Concat pandas provides various facilities for easly combining together Series and DataFrame objects with various kinds of set logicfor the indexes and relational algebra functionality in the case of join/ merge-type operations. Seethe Merging section ‘Concatenating pandas objects together with concat( In [73]: dF = p.DataFrane(np-randon.randn(1e, 6) : 2 2 (0.296223 0.495767 0.362949 1.548206, 31191345 -0.649329. 9.337863 -0.545867 break 4¢ Anco pieces In [75]: pieces = [eF{s3}, 4¢[3:7), Hf(727) coneat(pteces) 2 -a!ne3952 01591468 -0.519069 0.256046 5 “alyeseea 2ess052 “3.e37882 1.705775, 11193855 8.88095, © Note ‘Adding a column toa patatrana is relatively fast, However, adding a row requires acopy, and may be expensive. We recommend passing a pre-built of records tothe Datafrane constructor instead of bullding aatatrane by iteratvely appending records tot. See Appending todatatrame for more. Join ‘SQL style merges. See the Databae style joining section. In [77]: ert = pd.DataFrane({*key": ['Fo0', “foo"}, "vats [ty 21) 4m (7}: right = pd.batarrane((*key's ["f00", “foo'), “eval”: (4, 51) In (79): dere out ey val es 1m (80) ‘outtee) In (61: pd.nerge(lett, right, one'ey") out) key val ral pie 3 lf 3 $ ‘Another examole that canbe gven is 4m (82): Left = pd.vatarrane({*hey': ['fo0', “bar"], “Wval"s (14 2) In [83]: right ~ pé.DstaFrane((’key's [°foo", “bar"Iy ‘oval’: (4 51D) key Wal ef s bar 2 pé.nerge(lert, right, ony") aval vat Grouping By “group by" weare referring toa process involving one or more of the following steps +S iting the data into groups based on some criteria + Applying a function to each group independently “+ Combining the results into adata structure See the Grouping section. In (O7}: dF = pdDatarrame({a's (°Fo0", ‘bar’, “fo0', "bar", : Joo", *bat' "foo", “foo"}, 8°: (lone, ‘one's "two", “three”, 00s fone’ “three, ‘Co: np.random.randn(#), b°+ peranaan:randn(8}}) a ° fone 1.346063 -1.577585 a fone 1513763 396823 footw tassras f20 one 0.268820 -0. 080052 foo tree 8.024580 0.264630 Grouping and then apolying the sunt) function tothe resulting groups. «F.groupby('8").sun() bor 1.732787 Lara Grouping by mutisleclurmns forms a hierarchical index, and again we can apoly the sun() function, at groupbyt L'a", “8 D). sun) bor one 3.511763 0.396823 ‘hee 0.859582 -B.532532 Reshaping ‘See the sections on Hierarchical Indexing and Reshaoing. tuples = List ain(“[T bar’, ‘bar’, ‘bar's "baz foo", “foo, "aux" 'eux'd, ore’ “tuo"y “one™y “e60" 1D) In (92): index ~ pd.multitndes.trom_tuples(cuples, ranes-['first', second’) in [93]: af « pd.DataFrane(np-randon.ranan(8, 2), tndex-ingex, columnss(°A, °62) an [94]: de2 = a4] sn (95): a#2 outs): ‘ses second bar one 2.727965 “The stack) method “compresses” a level inthe DataFrame's columns, In {96}: stacked = of2.stack() in [97]: stacked outis7) ar one. A 0.727965 bez one A 10.338355 atype: oats \Witha stacked DataFrame or Series (having anuisindox asthe index), the inverse operation of stack is unstack), which by default unstacks the last level: out a . bez one 8.230355 stacked unstack(1) 8 olseaeie seen In [100]: stacked unstaek(o) uae): 8 co lsao3as “o.ss3e16 two A b.3asses Bawa Pivot tables See the section on Pivot Tables, 3m [IO]: GF = pd.batarrawe((‘A's Cone", ‘one’, “two's three!) * 3, : ey ec Pay 400", "foo's "foots “bar'y '8ar"y ‘bar'] sp random. randn(=2), sprancon.randn(22)}) A foo -1.202872- 0.047609 8 foo “tlesaa7e -0.236873 5 1.928123 -0) pé.plvot_table(a, valuese'0', Anders’, ‘8°. columnse[/c°]) be foo 0.366599 Nn Nak 0.807207 Time series pandas has simple, powerful. and efficient functionality for performing resampling operations during {frequency corwersion (eg, converting secondly data into S-minutely ata). This is extremely common in, but not limited to, financial applications. See the Time Series section In (14}: rng = pd-date_range('2/1/2012", perteds-200, eg-°5") 1m (105]: te ~ pd.Serses(ap.rendom.randine(®, 580, lentong)), Andex-rng) 1m 1106) oxtite6) rea! ST, atypes inte ‘5. nesanple(' sin") -s040) Time zone representation: In (107): rng = pd.date_range('3/5/2012 00:08", pertods=s, freq="0") In (108): ts = paSeries(np-random.ranén(len(rng))» a) 2.520002 Freq: 0, dtype: floates in (10): ts ute = ts. tz Locatsze(UIE") wo:ee:00.00:00 1.857704 Bo:e9:00-08:08 11299845 Freq: 0, dtype: floats Converting to another time zone: ‘2, convert(-Us/tastern’) Converting between time span representations In (113}: Png = pd.€ate_range('2/1/2012", pertods-s, freqr'h') tn [LUA]: te = pd.Serdes(ap-random.ranen(len(ong)), Andox-rre) 4m (116): pe = ts.te_pertod() an E17: ps coutiai7]: deyper floater ps. to_tinestane() cout(aia] Converting between period and timestamp enables some convenient arithmetic functions tobe used. Inthe following example, we convert 2 quarterly recuency with year endingin November to Yam of the end of the ‘month following the quarter end: 1m [119]: prog = pénperiod range(’1990Qi', "280008", freqe'Q-NoV") In [20]: ts = pa-Series(ap-randem.randn(len(peng)), ene) In [121]: ts.Andox = (peng.astrea('H', “e') + 2)-astreqC’, ‘sy + 9 in 212) out ftz2]: Fregi i, dtype: Aoates ‘tssheaa) Categoricals pandas can include categorical data in apatarane. For fulldacsseethe categorical intraduction and the APL documentation In (123): GF = pdDataFrana("S0"s (1s 20 3p 4s Ss Sle . srawgrade™? ('a" “bey fo', Sats ‘a, eT) Convert the raw grades to categorical datatype. In [124]: aF{ grader] = ae "rou grade" J-astype( category") an 225) outi25] aeterader] Name: grade, étype: estegory Categories (2, Sasect) a", “bi, 8] Rename the categories to more meaningful names (assigning to Series.cat.categaries( iin place) In [326]: af{grade"].eat categories = ("very good", “good”, “very Bae") Reorder the categories and simultaneously add the missing categories (methods under serias.cat().returna new Series by default) sn (127): a¢{"grade"] = ae{"grave”).cat.ser_categortes({ very bad, "bau", “nedtun’, : good", "very e0e"]} an [128]: ar{"grade"] coutia28] Bn very goss 1 Boos 2 008 5 very gooe 5 Nery bse Name: grade, type: category Categories (5, eject): ("very bad", "bad', ‘aediun’, “good', ‘very good") Sortingis per order inthe categories not lexical order. an (1a9]: af. sort_values(by="grede") out(a29) sarawersce grade sé ‘every bat 12 BT goo 2a 5 peed re 3 very good 34 3 very good as 3 very ood ‘Grouping bya categorical column also shows empty categories. in 130) ovt38] ery bad 008 very good type inter ¢.groupoy ("grace"). s4200) Plotting See the lating docs We use the standard convention for referencing the matolotib API: 1m (134]: Anport watplotlib.pyplot as plt 4m [332]: plt.close('s11") 1m [133]: ts = pd.Series(np.randam.ranén(i0@@), aie Sndesepedate_range("1/2/2000", perdods-1000)) an (a4): te = teccunsum() an 1135] outia35): ss.plot() “enatplot lib. sxe. subplots AxesSubplot st 6x7 16034872080 jan ml Jan ia nm ma 2000 2001 2002 ‘Ona DataFrame, the alet() method sa convenience to plat all ofthe columns with lbels in 1136}: 1 = pd.botatrane(np-randon.rangn(1090, 8), index-tstndex, columectny By ey 07) 2 oF scunsun() pit. figured) 2 EFlgure S120 64480 with @ hxes> an 139]: oF. pet() ‘out{239]: cnatplotlsb. oes. subplots. AuesSubplot at ex7eeo347e9140> plt.tegenatioc=‘sest') ‘natplotlib. legend. Legend at 0471603479490» ml 4 ia Getting data in/out CSV Wolting toa csvfle An [1A]: af. toes fo0. 50") |: pe read csv 400.50") a a ° or7es4 31.474581 47.516 2321627390 pores [2000 rows «5 columns] HDFS Reading and writing to HDFStore Writing toa HDFS Store In (3: af. tombe F005", “EY Reading froma HOFS Store. pé.nead.net('foo.ns', “eF") 2.350262 e7oasse 9.792298 0.546873 1323782 -8:s62651 3002-03.23 47207912 321527390 505264 48 828327 2002-03-24 -48.907133 31990482 67320528 49392051, [2000 rons x 4 coluans] Excel Reading and writing to MS Excel Writing toan excel ile In [1S]: oF. to.excel(foo.xlse', sheet nane~"shect!') Reading from an excl fle In [16]: p.rend_excel(" 0.226%", ‘Sheett", index col-None, ra_values= (NA°) out(sas) 2 9083.03 3566 5 veces-08 273883 998 2882-09-25 -5¢.146062 3.736778 67.717438 -49.037577 998 2002-09-26 “48.724318 33.479952 G8, s0908 -48 422050 [2000 rons x § coluans] Gotchas I you are attempting to perform an operation you might see an excention ke: >>> Lf pd.series({False, True, False]): prine(*T uae tree") odeewace Valucétror: The truth value of an array 45 ambiguous. Use avenpty, a.any() or 2.8110) ‘See Comparisons for an explanation and what to do. ‘See Gotcha as well {© Copyright 2008-2020, the pandas development team. Created using Sphinx 3.1.1.

You might also like