Module 1 introduces data science as a multidisciplinary field focused on extracting insights from data using statistical inference, modeling, and analysis. It discusses the current landscape of data science jobs, the importance of datafication, and the relationship between big data and data science. Key concepts include statistical inference, populations and samples, modeling, and the challenges of overfitting in predictive models.
Download as ODP, PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
3 views
Module 1
Module 1 introduces data science as a multidisciplinary field focused on extracting insights from data using statistical inference, modeling, and analysis. It discusses the current landscape of data science jobs, the importance of datafication, and the relationship between big data and data science. Key concepts include statistical inference, populations and samples, modeling, and the challenges of overfitting in predictive models.
Download as ODP, PDF, TXT or read online on Scribd
You are on page 1/ 19
MODULE 1
Introduction to Data Science
Introduction: What is Data Science? Big Data and Data Science hype – and getting past the hype, Why now? – Datafication, Current landscape of perspectives, Skill sets. Needed Statistical Inference: Populations and samples, Statistical modelling, probability distributions, fitting a model. What is Data Science? ● Data science is the study of data to extract meaningful insights for business. ● It is a multidisciplinary approach that combines principles and practices from the fields of mathematics, statistics, artificial intelligence, and computer engineering to analyze large amounts of data. ● This analysis helps data scientists to ask and answer questions like what happened, why it happened, what will happen, and what can be done with the results. Big Data and Data Science Hype ● There’s a lack of definitions around the most basic terminology. ● What is “Big Data” anyway? What does “data science” mean? ● What is the relationship between Big Data and data science? ● Is data science the science of Big Data? ● Is data science only the stuff going on in companies like Google and Facebook and tech companies? Why do many people refer to Big Data as crossing disciplines (as‐ tronomy, finance, tech, etc.) and to data science as only taking place in tech? ● Just how big is big? Or is it just a relative term? These terms are so ambiguous, they’re well-nigh meaningless. ● There’s a distinct lack of respect for the researchers in academia and industry labs who have been working on this kind of stuff for years, and whose work is based on decades (in some cases, cen‐ turies) of work by statisticians, computer scientists, mathemati‐ cians, engineers, and scientists of all types ● Statisticians already feel that they are studying and working on the “Science of Data.” That’s their bread and butter Why Now? ● We have massive amounts of data about many aspects of our lives, and, simultaneously, an abundance of inexpensive computing power. ● Shopping, communicating, reading news, listening to music, search‐ ing for information, expressing our opinions—all this is being tracked online, as most people know. ● It’s not just Internet data, though—it’s finance, the medical industry, pharmaceuticals, bioinformatics, social welfare, government, educa‐ tion, retail, and the list goes on. ● Amazon recommendation systems, friend recommendations on Face‐ book, film and music recommendations Datafication ● Datafication as a process of “taking all aspects of life and turning them into data.” ● As examples, they mention that Google’s augmented- reality glasses datafy the gaze. Twitter datafies stray thoughts. LinkedIn datafies professional networks. ● We are being datafied, or rather our actions are, and when we “like” someone or something online, we are intending to be datafied,or at least we should expect to be. ● And when we walk around in a store, or even on the street, we are being datafied in a completely unintentional way, via sensors, cameras, or Google glasses. ● Once we datafy things, we can transform their purpose and turn the information into new forms of value. The Current Landscape ➔ Data Science Jobs • To be experts in computer science, statistics, communication, data visualization, and to have extensive domain expertise ➔ A Data Science Profile • Computer science • Math • Statistics • Machine learning • Domain expertise • Communication and presentation skills • Data visualization Statistical Inference ● Imagine spending 24 hours looking out the window, and for every minute, counting and recording the number of people who pass by. ● This overall process of going from the world to the data, and then from the data back to the world, is the field of statistical inference. ● More precisely, statistical inference is the discipline that concerns itself with the development of procedures, methods, and theorems that al‐ low us to extract meaning and information from data that has been generated by stochastic (random) processes. Populations and Samples ➔ Population: • It could be any set of objects or units, such as tweets or photographs or stars. • If we could measure the characteristics or extract characteristics of all those objects, we’d have a complete set of observations, and the con‐ vention is to use N to represent the total number of observations in the population ➔ Samples: • When we take a sample, we take a subset of the units of size n in order to examine the observations to draw conclusions and make inferences about the population. • There are different ways you might go about getting this subset of data, and you want to be aware of this sampling mechanism What is a model? ● Humans try to understand the world around them by representing it in different ways. ● Architects capture attributes of buildings through blueprints and three-dimensional, scaled-down versions. ● Molecularbiologists capture protein structure with three- dimensional visualizations of the connections between amino acids. ● Statisticians and data scientists capture the uncertainty and randomness of data-generating processes with mathematical functions that express the shape and structure of the data itself. ● A model is our attempt to understand and represent the nature of reality through a particular lens, be it architectural, biological, or mathematical. Statistical modeling ● Before you get too involved with the data and start coding, it’s useful to draw a picture of what you think the underlying process might be with your model. What comes first? What influences what? What causes what? What’s a test of that? ● But different people think in different ways. Some prefer to express these kinds of relationships in terms of math. The mathematical expressions will be general enough that they have to include parameters, but the values of these parameters are not yet known. ● In mathematical expressions, the convention is to use Greek letters for parameters and Latin letters for data. So, for example, if you have two columns of data, x and y, and you think there’s a linear relationship, you’d write down y = β 0 + β 1 x. You don’t know what β 0 and β 1 are in terms of actual numbers yet, so they’re the parameters. ● Other people prefer pictures and will first draw a diagram of data flow, possibly with arrows, showing how things affect other things or what happens over time. This gives them an abstract picture of the relationships before choosing equations to express them. ● One place to start is exploratory data analysis (EDA), which we will cover in a later section. This entails making plots and building intu‐ ition for your particular dataset. EDA helps out a lot, as well as trial and error and iteration. Probability distributions ● The classical example is the height of hu‐ mans, following a normal distribution—a bell- shaped curve, also called a Gaussian distribution, named after Gauss. Fitting a model ● Fitting a model means that you estimate the parameters of the model using the observed data. You are using your data as evidence to help approximate the real-world mathematical process that generated the data. ● Fitting the model often involves optimization methods and algorithms, such as maximum likelihood estimation, to help get the parameters. ● Fitting the model is when you start actually coding: your code will read in the data, and you’ll specify the functional form that you wrote down on the piece of paper. ● Then R or Python will use built-in optimization methods to give you the most likely values of the parameters given the data. Overfitting ● Overfitting is the term used to mean that you used a dataset to estimate the parameters of your model, but your model isn’t that good at capturing reality beyond your sampled data. ● You might know this because you have tried to use it to predict labels for another set of data that you didn’t use to fit the model, and it doesn’t do a good job, as measured by an evaluation metric such as accuracy.