The document discusses the concepts of modeling in data science, distinguishing between statistical and algorithmic modeling, with a focus on their applications and limitations. It also addresses common myths in data science, emphasizing the crucial role of programmers in data collection, storage, processing, and modeling, while clarifying that machines primarily assist in executing tasks. Overall, it highlights the importance of human expertise in the data science process despite the reliance on machines for certain functions.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
2 views9 pages
Lecture7 Myths of Data Science
The document discusses the concepts of modeling in data science, distinguishing between statistical and algorithmic modeling, with a focus on their applications and limitations. It also addresses common myths in data science, emphasizing the crucial role of programmers in data collection, storage, processing, and modeling, while clarifying that machines primarily assist in executing tasks. Overall, it highlights the importance of human expertise in the data science process despite the reliance on machines for certain functions.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9
Modelling in Data Science
& Myths in Data Science
•Dr Vatan Sehrawat
•Asst. Professor, Computer Sc. & Engg. Department •RBS-SIET Zainabad •[email protected] •8059211113 Modelling Data: • Modelling is to find a distribution function, to find the relation between input and output. The function can be as simple as a linear equation and as complex as Quadratic, Polynomial, sine, tan functions.
• So Modelling can be mentioned in two separate classes:
• Statistical Modelling • Algorithmic Modelling Statistical Modelling: • Modelling underlying data distribution • Modelling underlying relations in data • Formulate and test hypothesis • Give statistical guarantees(p-values, goodness-of-fit tests) • Statistical modelling are simple intuitive models suited for low dimensional data but robust statistical analysis. Algorithmic Modelling: • Finding, the relation between input and output i.e. Y = f(x) • f(x), can be any function. In real world data, the function can be very complex. The ultimate goal is to estimate a function f, using data and optimization techniques • Complex Flexible models • Can work with high dimensional data • Not suitable robust statistical analysis • Focus is on prediction. • Data hungry models Myths of Data Science: Machine does Everything.(lets debunk this myth) Collecting data: • What to collect? ->Programmer job • Where to collect? -> Programmer job • How to collect data? -> Programmer job(by experimenting etc) • Labelling data? -> Programmer Job • Executing Scripts? ->Machine Job(Processing long complex jobs) Storing Data: • What schema? -> Programmer Designs • Which file system? -> Programmer Decides(but machine provides the system resources like storage) • Processing Data: • Domain knowledge required in Wrangling and munging data. - > Programmer Job • What data to clean? Programmer decides • How to clean? Programmer has to know what to clean using statistics • Study and Integrate: Programmer Job • Multiple formats: Programmer decides what format to work with • Machine helps in executing scripts for processing large amount of data. Describing Data: • Which columns? Programmer decides what column data is usable • Which plots? Human readable format, Programmer decides • Study trends? Programmer decides which trends using machine • Execute scripts by machine to formulate large amounts of data Modelling Data: • Hypothesise, Propose, models, Oversee, Training, all done by Programmer • Estimate, parameters are learnt by machine by trying to optimise using some learning algorithm.
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB