Datascience (Mod1)
Datascience (Mod1)
Management.
Chapter 1: Introduction
Example: Data science can be used in health tech to predict patient outcomes, in retail to optimise
inventory, or in finance to identify fraudulent activities.
Example: Companies like Netflix and Amazon use big data to recommend movies and products,
but the actual value lies in how this data is analyzed and interpreted, not just the size of the data.
Example: GPS data from smartphones, social media activity, and customer transactions all provide
a wealth of data that can now be analyzed for trends and patterns.
fi
The Current Landscape (with a Little History)
Data science has evolved significantly over time. In the early days, statistics and computational
tools were more isolated, and data analysis was mostly done by statisticians. Today, with the
explosion of data, there’s a greater emphasis on automation, machine learning, and real-time data
processing.
Data science roles can vary widely but often include positions like Data Analyst, Data Scientist,
Machine Learning Engineer, and Data Engineer. These roles require a mix of skills in programming
(Python, R), statistics, and machine learning, as well as domain expertise to effectively apply
methods to solve real-world problems.
A data scientist should be curious, analytical, and comfortable with ambiguity. They should possess
skills in programming, statistics, and communication, as well as an understanding of the business
problem they are solving. A combination of technical expertise and critical thinking is key.
In the age of big data, statistical thinking has become increasingly important. With large amounts of
data, it’s essential to think critically about how to sample the data, test hypotheses, and interpret
results. Traditional statistical methods may not always apply when dealing with big data, and the
complexity of models can sometimes lead to overfitting.
Statistical Inference
Statistical inference is the process of making conclusions or predictions about a population based on
a sample of data. This involves using techniques like hypothesis testing, confidence intervals, and
p-values to make educated guesses about a larger group from which the sample is drawn.
Example: If a company wants to know whether a new marketing strategy increases sales, they
might sample data from a small group of customers and use statistical inference to estimate the
effect on the larger population of customers.
A population is the entire set of data you want to learn about, while a sample is a subset of that
data. In data science, we often work with samples due to the impracticality of studying entire
populations. Proper sampling techniques are critical to ensure the sample is representative of the
population.
Example: If you're studying the income levels of all employees in a company, you might sample
100 employees, assuming this sample is representative of the entire company.
In big data, the concept of population and sample can become blurred because datasets may be large
enough to encompass entire populations. However, the challenge remains in selecting the right data
and not falling into the trap of overfitting the model to the entire dataset.
Modeling
Modeling is at the heart of data science. It involves creating mathematical representations of
relationships within the data to make predictions or discover patterns. There are two main types of
models:
1. Predictive models (e.g., regression, classification) that aim to predict future outcomes.
2. Descriptive models (e.g., clustering, association rules) that aim to discover patterns or
groupings within the data.
Example: A predictive model could be used to predict house prices based on factors like square
footage, location, and number of bedrooms, while a descriptive model could identify customer
segments based on purchasing behaviour.
A data scientist’s role is to guide the entire data science process, from understanding the problem,
collecting and cleaning the data, performing exploratory analysis, building models, and finally
communicating the results. Data scientists bridge the gap between technical teams and decision-
makers, ensuring the data is used effectively.