0% found this document useful (0 votes)
137 views6 pages

The Data Science Machine, or How To Engineer Feature Engineering'

The document discusses a system called the Data Science Machine developed by MIT researchers that can perform feature engineering and complete an end-to-end data science pipeline to generate predictive models from raw data. The system has beaten nearly 70% of human teams in data science competitions by automating feature engineering and optimization. It uses an algorithm called Deep Feature Synthesis to intelligently identify and generate features from relational datasets and then performs iterative feature engineering and modeling to optimize predictive performance.

Uploaded by

san
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
137 views6 pages

The Data Science Machine, or How To Engineer Feature Engineering'

The document discusses a system called the Data Science Machine developed by MIT researchers that can perform feature engineering and complete an end-to-end data science pipeline to generate predictive models from raw data. The system has beaten nearly 70% of human teams in data science competitions by automating feature engineering and optimization. It uses an algorithm called Deep Feature Synthesis to intelligently identify and generate features from relational datasets and then performs iterative feature engineering and modeling to optimize predictive performance.

Uploaded by

san
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

10/12/2016 TheDataScienceMachine,orHowToEngineerFeatureEngineering

KDnuggets
DataMining,Analytics,BigData,andDataScience
SubscribetoKDnuggetsNews|Follow |Contact
searchKDnuggets Search

SOFTWARE
NEWS
Topstories
Opinions
Tutorials
JOBS
Academic
Companies
Courses
Datasets
EDUCATION
Certificates
Meetings
Webinars

TheEvolutionofClassification,Webinarpart1,Oct19

KDnuggetsHomeNews2015OctNews,FeaturesTheDataScienceMachine,orHowToEngineerFeatureEngineering(15:n35)

LatestNews,Stories

Top10KDnuggetsBlogPosts,
lookingbackayear
Webinar:PredictiveAnalytics:Failure
toLaunch[Oct13]
Humans&MachinesEthics
Framework:AssessingMac...
PredictiveAnalyticsWorld:Hot
TopicsinAnalyticsfo...
KDnuggetsTopBloggersin
SeptemberGoldandS...

MoreNews&Stories|TopStories

http://www.kdnuggets.com/2015/10/datasciencemachine.html 1/6
10/12/2016 TheDataScienceMachine,orHowToEngineerFeatureEngineering

Datascope:DataScienceConsulting

TDWIAustin,Dec49,DriveBusiness
InsightwithDataRegisterNow

TopStories
LastWeek
MostPopular
1. The10AlgorithmsMachine
LearningEngineersNeedtoKnow
2. TopAlgorithmsandMethods
UsedbyDataScientists
3. TopDataScientistClaudia
PerlichonBiggestIssuesinData
Science
4. Data
Science
Basics:Data
Miningvs.
Statistics
5. 21Must
KnowData
Science
Interview
QuestionsandAnswers
6. DataScienceforInternetof
Things(IoT):TenDifferencesFrom
TraditionalDataScience
7. 7StepstoMasteringMachine
LearningWithPython


LastWeekMostShared
1.DataScienceforInternetofThings
(IoT):TenDifferencesFrom

http://www.kdnuggets.com/2015/10/datasciencemachine.html 2/6
10/12/2016 TheDataScienceMachine,orHowToEngineerFeatureEngineering
TraditionalDataScience

2.TopDataScientistClaudiaPerlich
onBiggestIssuesinDataScience
3.DataScienceBasics:DataMining
vs.Statistics
4.EmbeddedAnalytics:TheFutureof
BusinessIntelligence
5.PredictingFutureHumanBehavior
withDeepLearning
6.WhyNotSoHadoop?
7.TopDataScientistClaudiaPerlichs
FavoriteMachineLearning
Algorithm

TheDataScienceMachine,orHowToEngineerFeatureEngineering
Previouspost
Nextpost
11 Share 30
Tweet
Tags:Automated,DataScience,FeatureEngineering,FeatureExtraction,MIT

MITresearchershavedevelopedwhattheyrefertoastheDataScienceMachine,whichcombinesfeatureengineeringandanendtoenddatascience
pipelineintoasystemthatbeatsnearly70%ofhumansincompetitions.Isthisgamechanging?

ByMatthewMayo,KDnuggets.

RecentresearchbyMITMaster'sstudentMaxKanterhasledtotheimplementationofwhathereferstoasthe'DataScienceMachine.'Apaperonthe
DataScienceMachine(DSM)anditsunderlyinginnovation,theDeepFeatureSynthesisalgorithm,byKanterandKalyanVeeramachaneni,histhesis
supervisoratCSAIL,issettobepresentedattheIEEEInternationalConferenceonDataScienceandAdvancedAnalyticsnextweek.Theirpaper'Deep
FeatureSynthesis:TowardsAutomatingDataScienceEndeavors'isavailableonlinenow.

http://www.kdnuggets.com/2015/10/datasciencemachine.html 3/6
10/12/2016 TheDataScienceMachine,orHowToEngineerFeatureEngineering

TheDSMisconciselydescribedbyKanter&Veeramachanenias"anautomatedsystemforgeneratingpredictivemodelsfromrawdata,"which
combinestheauthors'innovativefeatureengineeringapproachwithanendtoenddatasciencepipeline.TheDSMhas,thusfar,managedtobeat68.9%
ofteamsindatasciencecompetitionsthatithasbeenenteredinto.Perhapsmostnoteworthy,submissionswhichattainthissuccessratearegenerally
completedinunder12hours,asopposedtothemonthswhichteamsofhumanscanlaborfor.

TheDSMispremisedontheobservationsthatdatasciencecompetitionproblemsgenerallyhavethefollowingpropertiesincommon:theyare
structuredandrelational,theymodelhumaninteractionwithacomplexsystem,andthereisanattemptmadetopredictsomeaspectofhumanity.

DeepFeatureSynthesis

Aswithanydatascienceproblem,featuresmustfirstbeidentifiedfromexistingvariables,orbecreatedfromleveragingexistingvariables.While
concedingthatfeatureengineeringhasmadesignificantrecentadvancementsintheareasofnonrelationaldatasuchastextandimages,Kanter&
Veeramachaneninotethatitisstillthistaskthatmostreliesonhumaninterventioninthedatasciencepipeline,andcanbedifficultandtimeconsuming
evenforseasoneddatascientists.Itisalsothistaskthatmustmostcloselyreplicatetheefficiencyofahumanbeingifitistobetrulyautomated.

DeepFeatureSelection(DFS),theDSM'sfeatureengineeringalgorithm,isstrictlyforrelationaldatasets,andisusedtoautomatetheidentificationand
generationofinsightelicitingfeatures.DFStakesrelationaltablesasinput,andisabletoprocessthevarioustypesofdataheldwithinsuchadata
structure.Tobesuccessful,DFSaimstothinklikeadatascientist,lookingtoturninsightfulquestionsintoinputfeatures.

TheDFSalgorithmwalkstherelationshipsandappliesfeatureselectionfunctionsasitdoesso,creatingafinalfeaturestepbystep.Asitperformsthis
walk,DFSstacksthecalculationsofthemathematicalfunctionstoaparticulardepth,andthisiswherethenameDFSisderived.

Dependingontheinputdatatypes,anumberofmathematicalfunctionsareappliedat2distinctlevelsintheDSM:entityandrelational.Entitylevel
featuresfocusonconversionandtranslationfunctions,suchaschangingdatarepresentations,roundingnumbers,andextractingexistinggeneralized
attributesintomorenumerousandconciseattributes.Relationallevelfeaturesareconcernedwiththerelationshipsbetweenentitiesintables(thinkabout
yourprimaryandforeignkeys).Thesefeaturefunctionsarethenabletoextractrelateddatafromothertablestoassociatewithagivenfeature(for
example,findingthemaxitempriceoritemcountassociatedwithanorder),datawhichcouldpotentiallybeexploitedasausefulfeaturetofeedintoa
model.

MachineLearningPathway

TostartofftheDSM'smachinelearningpathway,oneoftheinputfeaturesischosentomodel,whichisreferredtoasthetargetvalue,andwhichisused
toformthepredictionproblem.Appropriatefeatures,knownaspredictors,areselectedviametadatatohelpinthepredictionprocess.TheDSMthen
createsapathwayfordatapreprocessing,featureselection,dimensionalityreduction,modeling,andevaluation,allofwhichisparameterizedand
availableforreuseifnecessary.

ParameteroptimizationisaccomplishedusingaCopulaProcess,andanattemptismadetoreducethenumberoffeaturesbyobservingcorrelation.The
reducedsetoffeaturesisthentestedonsampledata,recombiningthemindifferentwaystooptimizetheaccuracyofthepredictionstheyyield.Byits
useofautotuning,whichtheauthorsargueisasolutelycriticaltoitsperformance,theDSMwasabletoincreaseitsscoreatallthreeofitscompetitions.

Discussion

Whatthisallseemstosuggest,essentially,isthis:TheDSMusesintelligentrelationaldatabaserelationwalkingtohelpbuildandestablishcandidate
features,narrowsthisfeaturesetdownbylookingforcorrelatedvalues,andusescombinatoricsinwhatamountstobruteforcefeatureengineering,to
applyiterativefeaturesubsetstosampledatawhilerecombiningthemforoptimizationuntilthebestpossiblesolutionisfound.

TomeasuretheDSM'sperformance,itwasenteredincompetitionsatKDDCup2014,IJCAI,andKDDCup2015,where,asmentioned,it
outperformedmorethan2/3ofthehumancompetitors.Kanter&Veeramachaneniclaimthatevenduringitsworstperformance(IJCAI),theDSMstill
managedtoframethepredictionprobleminsimilartermstohumancompetitors,evidencedbythefactthatitproceededinthetaskbypursuingsimilar
avenuesofdatamodeling.Inthissamecompetition,itfinishedwithanAUCdifferenceofapproximately0.04ofthecontestwinner,suggestingthatthe
DSMcapturedwhatcouldbeconsideredthemajoraspectsofthecompetitiondataset.

Kanter&Veeramachaneniarguethat,whileitcannotcurrentlycompetewiththehighestperforminghumanscientists,theDSMneverthelesshasarole
alongsidethem.EventhoughanumberofhumansbeatouttheDSMineachofitscompetitions,itwasabletooutperformthemajorityofthemwith
considerablylesseffort(lessthan12hoursversusmonths,insomecases).Theysuggestthat,inlightofthis,itcanbeusedforsettingbenchmarksas
wellasforfosteringcreativity.Frontloadingfeatureengineeringandgeneratingsetsofpotentialtopperformingsetscouldallowhumanstomoveonto
rethinkingtheproblemwithinhours,effectivelystartingwiththeDSMsolutionandmovingforwardfromthatpoint.

Itshouldbenotedthat,whiletheDSMisimpressive,it'shardlythefirstsystemaimingtoautomatemachinelearning.Otherexamplesincludemany
systemsthatautomaticallybuildmodelstobidonadvertising,orKXENModelFactory(nowpartofSAP),whichofferedAutomatedModelBuilding
alreadyin2010.Also,itisclearthattheDSMisnotusefulforalltypesofdata,andisasystemimplementedsolelyfocusingontheexploitionof
relationaldatasets.Itisalsoyettobeshownthatitcanbeeffectiveinrelationaldatasetsthatdonotconformtothepreviouslyidentifieddatascience
competitionproblempattern.

TheDSMhasalreadybeenspunoffintoastartupcalledFeatureLab,toutedas"InsightswithanInterface,"withKanterasitsCEO.Thewebsitestates
"Domorewithyourdata,withoutmoredatascientists,"andclaimsthatitisthe"bestsolutionforcompanieslookingtoincreasetheirdatascience
resources."Thesearebothboldclaims,especiallyinlightofthefactthatnoneoftheindividualpiecesofDSMcanreallybeconsideredbreakthroughs.
ItisentirelypossiblethatFeatureLabgetslostinacloudof"businessintelligence"serviceplatforms.ButBigDataisgoingnowhere,andfeature
engineeringhasbeenoneofthehottesttopicsinmachinelearningoverthepast12months.ItjustmaybethattheDSM'sparticularcombinationof
technologiesatwhatmayendupbeingtherighttimeleadstoanewwayofthinkingaboutdatascience.

MargoSeltzer,aHarvardcomputerscienceprofessor,hasstatedinreferencetotheDSM,"Ithinkwhatthey'vedoneisgoingtobecomethestandard
quicklyveryquickly."Ifthisisthecase,FeatureLabsstandstobewellpositioned.
http://www.kdnuggets.com/2015/10/datasciencemachine.html 4/6
10/12/2016 TheDataScienceMachine,orHowToEngineerFeatureEngineering
quicklyveryquickly."Ifthisisthecase,FeatureLabsstandstobewellpositioned.

YoucanreadmoreaboutKanter&Veeramachaneni'sDataScienceMachinehere.

Bio:MatthewMayoisacomputersciencegraduatestudentcurrentlyworkingonhisthesisparallelizingmachinelearningalgorithms.Heisalsoa
studentofdatamining,adataenthusiast,andanaspiringmachinelearningscientist.

Related:

3ThingsAboutDataScienceYouWon'tFindInBooks
SevenTechniquesforDataDimensionalityReduction
Aug2015Analytics,BigData,DataMining,DataScienceAcquisitions,StartupRoundup

Previouspost
Nextpost

TopStoriesPast30Days

MostPopular MostShared

1.The10AlgorithmsMachineLearningEngineersNeedto 1.TopAlgorithmsandMethodsUsedbyDataScientists
Know 2.DataScienceforInternetofThings(IoT):TenDifferences
2.21MustKnowDataScienceInterviewQuestionsand FromTraditionalDataScience
Answers 3.7StepstoMasteringApacheSpark2.0
3.HowtoBecomeaDataScientistPart1 4.BattleoftheDataScienceVennDiagrams
4.7StepstoMasteringMachineLearningWithPython 5.TopDataScientistClaudiaPerlichonBiggestIssuesinData
5.TopAlgorithmsandMethodsUsedbyDataScientists Science
6.9KeyDeepLearningPapers,Explained 6.DataScienceBasics:DataMiningvs.Statistics
7.7StepstoMasteringApacheSpark2.0 7.AutomatedDataScience&MachineLearning:AnInterview
withtheAutosklearnTeam

TDWIAustin,Dec49,AnalyzeandDiscoverRegisterNow

http://www.kdnuggets.com/2015/10/datasciencemachine.html 5/6
10/12/2016 TheDataScienceMachine,orHowToEngineerFeatureEngineering

NYUMSinBusinessAnalytics
forProfessionalsLearnMore

MoreRecentStories

HeresHowITDepartmentsareUsingBigData
TopStories,Oct39:BattleoftheDataScienceVennDiagrams...
DoMultipliersTrumpBigDataAnalytics?
DataNatives,EuropeDataScienceconference,Oct2628,Berli...
TopSeptemberStories:TopAlgorithmsandMethodsUsedbyData...
TheEvolutionofClassification,Oct19,Oct26Webinars
PredictiveAnalytics.
MaxResults.MinTime.
AdversarialValidation,Explained
Microsoft:PrincipalDataScientist
TempleUniversity:DataScienceFacultyPositions
StillSearchingforROIinBigDataAnalytics?YoureNotAl...
Top/r/MachineLearningPosts,September:OpenImagesDataset...
TheCoronationofPredictiveAnalytics:AFourYearRetrospective
UMBC:DataScience/BigDataFacultyPositions
ACIWorldwide:DataScientist
BattleoftheDataScienceVennDiagrams
Toptweets,Sep28Oct4:7StepstoMasteringSQLfor#Dat...
UniversityofNotreDame:DataScienceConsultant
EmoryUniversity:LecturerinComputerScience
9BizarreandSurprisingInsightsfromDataScience

KDnuggetsHomeNews2015OctNews,FeaturesTheDataScienceMachine,orHowToEngineerFeatureEngineering(15:n35)

2016KDnuggets.AboutKDnuggets

SubscribetoKDnuggetsNews
|Follow @kdnuggets| |
X

http://www.kdnuggets.com/2015/10/datasciencemachine.html 6/6

You might also like