0% found this document useful (0 votes)
39 views3 pages

Lannet

The document provides 3 tasks related to data cleaning and feature engineering for machine learning: 1) Write a function to identify date columns in a dataframe and generate new columns with differences between date columns. 2) Write a function to deal with outliers in continuous variables by either removing or imputing them in a fast and robust manner. 3) Write a function to drop columns that have Pearson correlation greater than 0.85 to reduce redundancy in features. Consider efficiency, robustness, and ability to work on any dataset.

Uploaded by

Abhyuday Shukla
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views3 pages

Lannet

The document provides 3 tasks related to data cleaning and feature engineering for machine learning: 1) Write a function to identify date columns in a dataframe and generate new columns with differences between date columns. 2) Write a function to deal with outliers in continuous variables by either removing or imputing them in a fast and robust manner. 3) Write a function to drop columns that have Pearson correlation greater than 0.85 to reduce redundancy in features. Consider efficiency, robustness, and ability to work on any dataset.

Uploaded by

Abhyuday Shukla
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

1/3

The pr oductthatwe ar e dev el


oping i n an Aut o ML solut
ion
(automatedmachi nel ear
ning).So,datacl eani
ng,dat amani
pulat
ion
i
sbi gpartofthepr ocess.Usercani nputanydat asetandwehav eto
detectwhat ’
sdate,what ’
schar act
eret c.Sincenodat aisperf
ect,a
verysimpledatacl eaningcodewon’ tbeabl etor eadmostoft he
vari
ablesandwi llremov et hosev ari
ablesbyt het imeitcomest o
predict
ivemodelli
ng.Consi derthisandt hentrytoseekanswer sto
thetasksgiven.

Pl
easeusegooglecol
ab,
andifanypackagey
ouf
eeli
smi
ssi
ng(
that
maybeel
sewhere),
useal
ter
nat
epackage.

I
ncaseofanyquesti
ons,
Youcantakebasi
cassumpt
ionsi
fyouwant
t
o,wewantt oseehow innov
ati
veyouthink.Wewanttoseeyour
t
hinki
ngabil
it
iesandhow muchbigyoucant hi
nkfrom t
hisgi
ven
i
nformat
ion.

Wearegoi
ngt
olooki
ntohow deepy
out
hinkandhow r
obustt
he
codei
s.

Onlysendusagoogl ecol
abl i
nkandfort
hetheoret
icalpart
,puti
tin
comment sinthe colab i
tsel
f.Ther
ei s a googl
er esponse l
ink.
htt
ps:/
/for
ms.gl
e/UKJc6NAVWkdVaPxz 6-fi
llyourname,emai land
googl
ecolabl
inkhere.

1)Wr i
teaf unct i
oni npyt
hont hatinputsadat afr
ameandi denti
fy
whichcolumnshav edateint hem.Usi ngt hesedat ecolumnsmake
new columnswhi char edif
ferencebet weent hesecolumnst aking2
atatime.(f
ori nstancei
fthereisdate1, date2,date3columns,output
should be li
ke dat e1-
date2,dat e2-date3,dat e1-
date3).Fort his
probl
em only
,pr i
ntoutdatainthecol ab.
Thingtoconsider

Abhy
udayShukl
a. 170420111050
·Datecol
umnmi ghthav esomei nval
identr
iesinthem
·Datecanbeofdiff
erentformatt hr
oughoutthecolumn
·Codeshouldbeeffi
cientandf ast
·Codeshouldbewellcomment edandeasyt oi
nterpr
et
·UsegoogleCol
ab
·Codeshouldberobustenought or
unonanydat aset
·Makeadummydat asetbyy oursel
f.

2)Wr it
eapy t
honf unct i
onwhi cht akeadat af rameasi nputanddeal s
withthei ssueofout liersinal lthecont inuousv ari
ables.
2/3
Thingst oconsi der:
·It
’supt oy ouonhowy ouwantt odealwi thout li
ers.Youcanei ther
remov ethem ori mput ethem.
·Weconsi derout li
ersasi ncor r
ectent riesandnott heonewhi char e
natural
.Forexampl e,int hesal arycol umn,i ftherei sav alueof
$1,000,000t hent hisv aluecanbeduet oanat uralcause( li
kei t’sa
sal
ar yforaCEO)ori tcanbeacaseofi ncorrectent r
y( l
ikesomeone
putanext r
azer o).So, wear eonl yaft erincorr ectent ri
es.
·Functionshoul dalsoi dent i
fywhi chcol umnsar econt inuoussot hat
youcanper form outlierremov alont hesecol umns
·Thecodemustbev eryf astsoy oucannotusemul ti
v ariateappr oach
whichar ebasedondi stancecal culationbet weenal lpoi nts.
·Codeshoul dbewel lcomment edandeasyt oi nterpret
·Usegoogl eCol ab
·Codeshoul d ber obustenought or unonanydat asetand t he
datasetonwhi chwewi l
ltestwi llnotbeaper f
ectdat asetasi nt he
caseofr eal wor l
d
·Makeadummydat asetbyy our selforpassanypubl icl
yav ailable
datasettot estouty ourl ogic

3)Wr
it
eaf
unct
ioni
npy
thont
hatt
akedat
afr
ameasi
nputanddr
op

Abhy
udayShukl
a. 170420111050
col
umnshav ingPear soncorrel
ati
onmoret han0.
85
Thi
ngt oconsi der
·Codeshoul ddr opl eastamountofv ar
iableaspossibl
e.(t
hisi
san
i
mpor tantpoint)
·Codeshoul dbeef fi
cientandfast
·Codeshoul dbewel lcommentedandeasyt oint
erpr
et
·Usegoogl eCol ab
·Codeshoul dber obustenought orunonanydataset
·Makeadummydat asetbyyoursel
forpassanypubl i
clyavai
l
abl
e
datasettotestouty ourlogi
c

Hint:Ther
ei snor estr
icti
ononcopy i
ngcodef r
om theinternet,but
remembert hatmostoft hecodefoundovert
heinter
net
,
·Wor ksonnearper fectdatawhichisi
mpossibl
einther
eal world
·I snotequi ppedt owor konev erydat
asetwhichiscentraltoour
businessmodel
·Hav emultiplefl
aws
Wehav especi f
icall
ydesi gnedeachquesti
ont oseey ourthinking
abil
ity.

3/3

Abhy
udayShukl
a. 170420111050

You might also like