0% found this document useful (0 votes)
28 views

CS250

Uploaded by

weigix chehhal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views

CS250

Uploaded by

weigix chehhal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 55

CS250: Python for Data Science :

his course includes the following units:

 Unit 1: What is Data Science?


 Unit 2: Python for Data Science
 Unit 3: The numpy Module
 Unit 4: Applied Statistics in Python
 Unit 5: The pandas Module
 Unit 6: Visualization
 Unit 7: Data Mining I – Supervised Learning
 Unit 8: Data Mining II – Clustering Techniques
 Unit 9: Data Mining III - Statistical Modeling
 Unit 10: Time Series Analysis

Upon successful completion of this course, you will be able to:

 use Google Colaboratory notebooks to implement and test Python


programs;
 explain how Python programming is relevant to data science;
 construct and operate on arrays using the numpy module;
 apply Python modules for basic statistical computation;
 construct and operate on dataframes using the pandas module;
 apply the pandas module to interact with spreadsheet software;
 implement Python scripts for visualization using arrays and dataframes;
 apply the scikit-learn module to perform data mining;
 explain techniques for supervised and unsupervised learning;
 apply supervised learning techniques;
 apply unsupervised learning techniques;
 apply the scikit-learn module to build statistical models;
 implement Python scripts to perform regression analyses;
 apply the statsmodels module to build and analyze models for time
series analysis; and
 explain similarities and differences between AR, MA, and ARIMA
models.

1- History:
Data science is a discipline that incorporates varying degrees of Data
Engineering, Scientific Method, Math, Statistics, Advanced Computing,
Visualization, Hacker mindset, and Domain Expertise. A practitioner of Data
Science is called a Data Scientist. Data Scientists solve complex data analysis
problems.
Origins
The term "Data Science" was coined at the beginning of the 21st Century. It is
attributed to William S. Cleveland who, in 2001, wrote "Data Science: An
Action Plan for Expanding the Technical Areas of the Field of Statistics".

Development:
During the dot-com bubble (1998-2000), hard drives became inexpensive, leading
corporations and governments to buy many. As per a corollary of Parkinson's Law, data
expands to fill available disk space, creating a cycle of buying more disks and accumulating
more data, resulting in big data. Big data is vast and complex, requiring special management
tools. Companies like Google, Yahoo!, and Amazon developed cloud computing to handle
this, with MapReduce and Hadoop being key innovations. Hadoop's complexity led to the
creation of mass analytic tools with simpler interfaces, like recommender systems and
machine learning, requiring specialized knowledge. This specialization gave rise to data
scientists who analyze big data for new insights. Data science, ideally done in teams, tackles
large-scale problems that single individuals cannot manage alone. In summary: cheap disks →
big data → cloud computing → mass analytic tools → data scientists → data science teams
→ new analytic insights

(The "dot-com" bubble of 1998-2000 was a period of excessive speculation and investment in internet-
based companies, fueled by the rapid growth and adoption of the internet. Many investors poured
money into startups with ".com" in their names, leading to a surge in stock prices. However, many of
these companies had unsustainable business models and eventually failed. The bubble burst in 2000,
leading to a significant stock market crash and substantial financial losses for investors.)

(Parkinson's Law is an adage that states, "Work expands to fill the time available for its completion."
This means that if you allocate more time to a task, it will take longer to complete, often due to
procrastination, inefficient work habits, or unnecessary complexities.

A corollary of Parkinson's Law applies to data storage: "Data expands to fill the available disk space."
This means that as more storage becomes available, the amount of data stored increases accordingly,
often leading to more data accumulation than initially expected or necessary. The law highlights how
resources, whether time or storage space, tend to get fully utilized, often leading to inefficiency.)
Data Engineering:
Data Engineering is a key component of data science that involves acquiring,
ingesting, transforming, storing, and retrieving data, often accompanied by
adding metadata. A data engineer must manage these interconnected tasks as a
whole, understanding how data storage and retrieval impact ingestion and
processing.
Key Processes in Data Engineering:

1. Acquiring: Identifying data sources and obtaining data, which can come from
various places and in different formats, such as text, images, or sensor data.
2. Ingesting: Moving data into computer systems for analysis, considering data
volume, speed, and storage capacity.
3. Transforming: Converting raw data into a usable format for analysis, often
from CSV to structured formats like spreadsheets.
4. Metadata: Adding data about data, such as collection time, location, and other
relevant information, to enhance understanding and usability.
5. Storing: Choosing the appropriate storage system, like file systems for speed
or databases for functionality, based on data and analysis needs.
6. Retrieving: Extracting and querying data for analysis and visualization,
ensuring storage strategies align with retrieval requirements.
Example: For highway data, sensors might collect speed data in CSV format. This
data is ingested, transformed into a structured format, metadata is added, stored
in a database, and retrieved for analysis, such as calculating average speeds
during rush hours.

The Scientific Method


The Scientific Method is the scientific foundation of data science, involving the
acquisition of new knowledge through reasoning and empirical evidence from
testing hypotheses via repeatable experiments.
Key Elements of the Scientific Method:
1. Reasoning Principles:
- Inductive Reasoning: Deriving general principles from specific observations.
- Deductive Reasoning: Drawing specific conclusions from general principles.
- Example: "All known life depends on liquid water" (inductive) and "Socrates is
mortal" (deductive).

2. Empirical Evidence:
- Data obtained from observation or experiment, as opposed to logical
arguments or myths.
- Example: Galileo's telescope observations supporting Copernicus's
heliocentric theory versus Aristotle's geocentric model.

3. Hypothesis Testing:
- Involves two propositions: the null hypothesis (current understanding) and the
alternative hypothesis (new proposition).
- Example: In a trial, "the defendant is not guilty" (null hypothesis) and "the
defendant is guilty" (alternative hypothesis).

4. Repeatable Experiments:
- Methodical procedures that verify, falsify, or establish the validity of a
hypothesis, relying on repeatable methods and logical analysis.
- Example: Galileo's inclined plane experiment disproving Aristotle's theory of
falling bodies.
Role in Data Science:
Data scientists use the Scientific Method to critically evaluate evidence,
understand reasoning behind conclusions, test hypotheses, and ensure
experiments can be replicated to validate results.

Math:
Mathematics, alongside statistics, forms the intellectual core of data science,
focusing on the study of quantity, structure, space, and change, especially when
applied to practical problems.

Key Elements of Mathematics in Data Science:

1. Quantity:

- Numbers: Representing data with various types of numbers (integers,


fractions, real numbers, complex numbers).

- Example: Measuring highway lengths in miles (integers) and using arithmetic


to analyze these quantities.
2. Structure:

- Internal Structure: Identifying and analyzing the internal structure of data


through equations and relationships.

- Example: Understanding the structure of speed limits or lane widths on


highways using algebra.

3. Space:

- Spatial Components: Investigating and representing the spatial aspects of


data in two- or three-dimensional space.

- Example: Mapping highway segments' locations using latitude and longitude


or analyzing the smoothness of highway surfaces with geometry and
trigonometry.

4. Change:

- Dynamic Relationships: Describing how relationships between data points


change over time or distance.

- Example: Studying how the sharpness of curves in a highway changes with


speed limits or how asphalt depth affects traffic flow using calculus.

Role in Data Science:

Data scientists use mathematics to quantify and analyze data, understand its
structure, represent spatial relationships, and describe changes over time or
distance, enabling them to solve complex practical problems.

Statistics:
Statistics, together with mathematics, forms the intellectual foundation of data
science. It involves the collection, organization, analysis, and interpretation of
data to discover patterns, create models, and make future predictions.

Key Elements of Statistics in Data Science:

1. Collection:
- Designing Research: Creating research and experimental designs to ensure
data is collected in a way that allows valid conclusions.
- Example: Working with data engineers to develop procedures for data
generation.
2. Organization:
- Coding and Archiving Data: Ensuring data is coded, archived, and documented
appropriately for analysis and sharing.
- Example: Creating a data dictionary to specify variables, valid values, and
data formats, which data engineers use to develop a database schema.
3. Analysis:
- Summarizing and Modeling: Using descriptive and inferential statistics to
summarize data, test hypotheses, and create models.
- Example: Analyzing data to determine if there are significant differences
between groups or to identify correlations.
4. Interpretation:
- Reporting Results: Collaborating with subject matter experts and visual artists
to present data in comprehensible ways.
- Example: Creating tables and graphs to report results to stakeholders in an
understandable manner.

Role in Data Science:


Statisticians in data science ensure data is collected, organized, analyzed, and
interpreted correctly, enabling the discovery of insights, patterns, and
relationships that inform decision-making and predictions.

Advanced computing:
Advanced computing is the heavy lifting of data science, encompassing the
design, coding, testing, debugging, and maintenance of software to perform
specific operations.

Key Elements of Advanced Computing in Data Science:

1. Software Design:
o Process: Transforming software purpose and specifications into a
detailed plan, including components and algorithms.
o Example: Using modeling languages like UML to create software
designs, which programmers implement by writing source code.

2. Programming Language:
o Definition: Artificial languages designed to communicate
instructions to computers, controlling their behavior and external
devices.
o Example: Choosing between low-level languages (e.g., assembly)
and high-level languages (e.g., Java, Python, C++) to solve specific
problems.

3. Source Code:
o Definition: Collections of computer instructions written in human-
readable languages, translated into machine code for execution.
o Example: Using IDEs to type, debug, and execute source code, such
as the traditional "Hello World" program in Java and Python.

Role in Data Science:


Programmers in data science create, optimize, and maintain software that
processes data, leveraging their expertise in programming languages and
software design to solve complex computational problems efficiently.

Visualization:
Visualization is the "pretty face" of data science, focusing on the visual
representation of abstract data to enhance human understanding and cognition.

Key Elements of Visualization in Data Science:

1. Creative Process:
- Definition: Creating something original and worthwhile through divergent
thinking, conceptual blending, and honing.
- Role: Visual artists in data science explore multiple ways to present data and
refine visualizations through iterations.

2. Data Abstraction:
- Definition: Handling data meaningfully by visualizing manipulations like
aggregations, summarizations, correlations, and predictions, rather than raw
data.
- Role: Simplifying data content to make visualizations meaningful in the
context of the problem being addressed.

3. Informationally Interesting:
- Definition: Creating visuals that are not only informative but also aesthetically
pleasing and engaging, often incorporating elements of beauty such as symmetry
and harmony, with touches of surprise.
- Role: Making visualizations attractive to capture and retain human attention,
enhancing the communication of data insights.

Example:
A partial map of the Internet from early 2005 demonstrates effective
visualization. Each line represents connections between two IP addresses,
abstracting a subset of internet data. Through numerous iterations, a harmonious
color scheme and overall symmetry with surprising details (bright "stars") were
achieved, making the map both informative and visually engaging in the context
of understanding the World Wide Web.

The hacker mindset:


The hacker mindset is the "secret sauce" of data science, emphasizing creativity,
boldness, and persistence in solving data problems. According to Wikipedia,
hacking involves modifying, building, or creating software and hardware to
enhance performance or add new features. For data scientists, this mindset
extends to inventing new models, exploring new data structures, and creatively
combining multiple disciplines.

Key aspects of the hacker mindset include:

 Innovation: Creating and modifying systems and tools to improve


functionality and solve unique problems.
 DIY Approach: Embracing a do-it-yourself attitude to develop
unconventional solutions.
 Collaboration: Working in hacker-like spaces to share ideas and develop
new analytic solutions collectively.
 Examples:
o Steve Wozniak's Apple I: A hand-made computer built from
surplus parts, leading to the formation of Apple Inc.
o Carnegie Mellon Internet Coke Machine: An internet-connected
vending machine that allowed students to check the temperature of
sodas remotely.
The hacker part of a data scientist asks, "Do we need to modify our tools or
create something new to solve our problem?" and "How can we combine different
disciplines to reach an insightful conclusion?"

Domain Expertise:
Domain Expertise is the glue that holds data science together. It involves having
proficiency and special knowledge in a particular area, known as subject matter
expertise (SME). Any field, such as medicine, politics, sciences, marketing,
information security, demographics, and literature, can be subject to data
science inquiry. A successful data science team must include at least one domain
expert.

Key points about domain expertise:

 Importance of Problem Identification: Knowing which problems are


significant to solve.
 Defining Sufficient Answers: Understanding what constitutes an
adequate solution.
 Customer Insight: Knowing what information customers want and how to
present it effectively.
 Examples:
o Geographic Distribution of Soft Drink Terms: Edwin Chen at
Twitter visualized terms like "soda," "pop," and "coke" used in
different US regions. Understanding why these linguistic differences
exist requires insights from sociology, linguistics, history, and
anthropology.
o Nate Silver's Political Analysis: As a statistician and expert in US
politics, Nate Silver combines data with explanatory insights to
provide meaningful analyses, such as in his blog post "How
Romney’s Pick of a Running Mate Could Sway the Outcome."

The domain expert in data science asks, "What is important about the problem
we are solving?" and "What exactly should our customers know about our
findings?"

Assignment/Exercise Summary
Objective: Familiarize yourself with the R programming environment.

Steps:

1. Form Study Groups: Get into groups of 3 to 4 students.


2. Collaborative Learning: Work together in study sessions, explaining concepts to each other
and helping each other understand the material.
3. Google's R Style Guide:
o Print and read over the guide.
o Keep it for future reference as it will make more sense over time.

https://web.stanford.edu/class/cs109l/unrestricted/resources/google-style.html
4. Online Resources:
o Search for "introduction to R," "R tutorial," "R basics," and "list of R commands."
o Choose 4-5 websites and work through the first few examples on each site.
o Switch to another site if the current one becomes too confusing.

https://www.w3schools.com/r/default.asp

https://www.tutorialspoint.com/r/index.htm

https://www.codecademy.com/learn/learn-r

https://www.programiz.com/r

5. R Commands:
o Try the following commands in R:

R
Copier le code
library(help="utils")
library(help="stats")
library(help="datasets")
library(help="graphics")
demo()
demo(graphics)
demo(persp)

6. Short Program:
o Write a short program (5-7 lines) that executes without errors.
o Include the names of all contributors in the comment section.

7. Documentation:
o List the websites used, indicating which was the most helpful.
o List the top 10 unanswered questions the team has at the end of the study session.
The Impact of Data Science:
This chapter highlights the revolutionary impact of data science on different
sectors such as baseball, health, and robotics.

Moneyball

 Background: "Moneyball" is a book by Michael Lewis (2003) and a film


(2011) about the Oakland Athletics' baseball team and its general
manager, Billy Beane.
 Concept: The team used a sabermetric approach, relying on rigorous
statistical analysis to evaluate players, focusing on on-base and slugging
percentages over traditional metrics.
 Impact: Despite a limited budget, the Oakland A's competed successfully
with richer teams. This approach challenged traditional baseball wisdom
and led other MLB teams to adopt similar strategies.
 Themes: Insider vs. outsider dynamics, information democratization, and
the drive for efficiency in capitalism.
23 and Me

 Background: 23andMe is a personal genomics and biotechnology


company that provides genetic testing and analysis.
 Services: Customers submit a saliva sample for DNA analysis, receiving
information on traits, genealogy, and health risks.
 Impact: The company has a significant database, aiding in research
initiatives and correlations between genetics and personal/social
behaviors. It has contributed to advances in understanding genetic
predispositions to diseases like Parkinson's.

Google's Driverless Car

 Background: Google's project, led by engineer Sebastian Thrun, aims to


develop autonomous vehicles.
 Technologies: The system integrates data from Google Street View, AI
software, video cameras, LIDAR, radar, and position sensors.
 Achievements: Autonomous vehicles have driven thousands of miles with
minimal human intervention. This project has influenced legislation in
states like Nevada, Florida, and California.
 Challenges: The rapid advancement of technology outpaces existing laws,
necessitating new regulations for autonomous vehicles.

Assignment/Exercise

 Task: In groups, watch "Moneyball" and take notes on the impact of data
science in the film.
 Brainstorm: Discuss other areas where data science could be impactful
and consider potential counter-arguments.
 Presentation: Create a 4-slide presentation covering:
1. Chosen area of life.
2. How data science would make a difference.
3. Counter-arguments.
4. Group's conclusion on the viability of data science in that area.
Section 1

00:00:02

The speaker in the video is introducing the concept of data science for
beginners, emphasizing the importance of gathering data, which includes
both numerical and categorical information. They explain the distinction
between numbers (amounts or counts) and names (categorical variables),
highlighting that even slight changes in numbers may still be close to the
original value, while changing a name slightly can result in a completely
different entity. The speaker also discusses the complexity of data that blurs
the line between numbers and names, such as phone numbers and zip codes.
They mention the significance of identification numbers and the ability to
convert names with order into numbers for machine learning algorithms. The
speaker encourages viewers to explore tools for data collection and analysis,
referencing the Cortana analytics process. They stress the importance of
asking precise questions that can be answered with specific data, ensuring
that the target information is included in the dataset. If the target is not
present, they advise obtaining more data. Additionally, they explain the
process of organizing data into a table with one target value per row for
analysis.

Section 2

00:07:15

In this part of the video, the speaker discusses the process of organizing data
to have one instance of the target variable for each row. They explain how
data that doesn't naturally occur once per day, such as total users or
quantities that remain constant for a period, needs to be aggregated or
distributed to align with the rows. The speaker also mentions the importance
of computing values like days since a specific event, gathering external data,
estimating missing information, and checking data quality. They provide an
example of cleaning up data related to superheroes and super villains,
ensuring that all values are formatted consistently for machine learning
algorithms to interpret correctly. The speaker emphasizes the need to
thoroughly review and understand each column of the data to ensure its
accuracy and quality.

Section 3

00:14:09

The speaker discusses the process of cleaning and interpreting data for
machine learning algorithms. They mention how they clean up data columns
to ensure uniform representation, such as identifying secret identities as yes
or no and categorizing an individual's ability to fly based on numerical values.
The speaker emphasizes the importance of unifying data standards for
effective interpretation by machine learning algorithms. They also touch on
feature engineering, which involves manipulating existing features to improve
predictive capabilities. An example is given where combining departure and
arrival times of a subway train helps predict the maximum speed reached
between stops. The speaker highlights the significance of data interaction in
enhancing predictive models and the concept of coefficient of determination
to evaluate model performance.

Section 4

00:21:40

The speaker discusses the process of feature engineering in machine


learning, where they create a new feature by subtracting two existing
features to improve the predictive power of their model. They explain that
transforming features can help algorithms extract the necessary information
from the data. The video also touches on different ways of feature
engineering, including data-specific and domain-specific techniques. The
speaker briefly mentions deep learning and its ability to automatically learn
features from data. They emphasize the importance of asking sharp
questions, ensuring data quality, and using machine learning to answer
specific types of questions, such as regression for "how much" or "how many"
questions, classification for assigning categories, and grouping data into
similar clusters. Lastly, the speaker mentions the importance of identifying
anomalies or unusual patterns in the data.

Section 5

00:29:01

The video discusses the importance of analyzing unusual patterns in credit


card transactions to detect potential issues. It also explains how
reinforcement learning can help machines make decisions in low-consequence
scenarios. The example of predicting the cost of a diamond based on its
weight is used to illustrate the concept of creating a model through linear
regression. The model simplifies the data by fitting a line through the data
points to make predictions.

Section 6

00:36:23

The video segment discusses the process of creating a linear regression


model without using math or computer programs to estimate the cost of a
1.35 karat diamond. It explains the concept of a confidence interval and how
it helps in making more accurate predictions. The video emphasizes the
importance of having enough data to make informed decisions and how
interpreting the data can lead to making better choices. It also highlights the
importance of using the model or analysis results in practical ways, such as
creating a web service, making decisions, setting prices, publishing code,
writing reports, or building dashboards. Additionally, it mentions the need to
be cautious of potential gaps in using machine learning algorithms, as they
assume the world doesn't change, and if it does, the data may become
invalid.

Section 7

00:43:20

The speaker discusses three key gaps in machine learning. The first gap
highlights the importance of ensuring that data remains relevant in a
changing world, using the example of the impact of the September 11th
attacks on predictions made just before the event. The second gap
emphasizes the challenge of collecting sufficient data for certain complex
phenomena, such as global climate change. The third gap points out that
machine learning cannot determine causation, using examples like the
correlation between cheese consumption and deaths by bedsheet
entanglement. The speaker concludes by highlighting the role of human
insight and judgment in filling these gaps and making intuitive leaps in data
analysis.

WHAT IS CORTANA ANALYTICS PROCESS?

The Cortana Analytics Process (CAP) is a comprehensive framework provided by Microsoft


for developing and deploying advanced analytics solutions. It leverages various Microsoft
technologies and services to create data-driven applications and insights. The process is
designed to help organizations turn data into intelligent action. Here are the key components
and steps involved in the Cortana Analytics Process:

1. Information Management

This stage focuses on collecting, storing, and managing data from various sources. It includes:

 Data Ingestion: Collecting data from various sources, such as databases, IoT devices, and
external APIs.
 Data Storage: Storing the ingested data in scalable and reliable storage solutions like Azure
Data Lake, SQL Database, or Blob Storage.
 Data Preparation: Cleaning, transforming, and organizing data for analysis using tools like
Azure Data Factory or Azure Databricks.

2. Big Data Stores

This component involves storing and processing large volumes of data. Key technologies
include:

 Azure Data Lake: A hyper-scale repository for big data analytics workloads.
 Azure SQL Data Warehouse: A fully managed, petabyte-scale data warehouse service.
 Azure Cosmos DB: A globally distributed, multi-model database service.
3. Machine Learning and Analytics

In this stage, advanced analytics and machine learning models are developed and applied to
the data. It includes:

 Azure Machine Learning: A service for building, training, and deploying machine learning
models.
 Azure HDInsight: A fully managed, full-spectrum, open-source analytics service for
enterprises.
 Azure Databricks: An Apache Spark-based analytics platform optimized for Azure.

4. Dashboards and Visualization

This involves creating interactive dashboards and visualizations to present insights derived
from the data. Tools include:

 Power BI: A suite of business analytics tools to analyze data and share insights.
 Azure Synapse Analytics: Integrates big data and data warehousing to offer dashboards and
interactive reports.

5. Intelligence and Insights

This stage focuses on deriving actionable insights and embedding intelligence into
applications. It includes:

 Azure Cognitive Services: A collection of APIs for adding cognitive features like vision,
speech, and language understanding to applications.
 Cortana Intelligence Suite: Integrates various analytics services to deliver comprehensive
intelligence solutions.

6. Action and Automation

The final stage involves automating actions based on insights and integrating them into
business processes. Key services include:

 Azure Logic Apps: A cloud service for automating workflows and integrating apps, data, and
services.
 Microsoft Flow: Now called Power Automate, it automates workflows between apps and
services to synchronize files, get notifications, and collect data.

Example Use Case

Consider a retail company wanting to improve its customer experience. Using the Cortana
Analytics Process, the company can:

1. Ingest customer interaction data from various touchpoints.


2. Store the data in Azure Data Lake.
3. Prepare the data using Azure Data Factory.
4. Analyze customer behavior and predict trends using Azure Machine Learning.
5. Visualize the insights using Power BI dashboards.
6. Automate personalized marketing campaigns using Azure Logic Apps based on the insights.

References

 Microsoft Cortana Intelligence Suite


 Azure Machine Learning
 Azure Data Lake
 Power BI
What's a typical Data
Science process?

The process starts with an


interesting question, often
aligned to business goals.
Available data is then cleaned
and filtered. This may also
involve collecting new data as
relevant to the question. Data is
analyzed to discover patterns
and outliers. A model is built
and validated, often using
machine learning algorithms.
Model is often refined in an
iterative manner. The final step
is to communicate the results.
The results may inspire the data
scientist to ask and investigate
further questions.
 Data science is about answering scientific questions with the help of
data. Don't focus just on the aspect of handling data, dataset size or
the tools.
 Understand the business, its products, customers and strategies. This
will help you ask the right questions. Have constant interaction with
business counterparts. Communicate with them in a language they can
understand.
 Consider alternative approaches before selecting one that suits the
problem. Likewise, select a suitable metric. Sometimes derived metrics
may yield better prediction compared to available metrics.
 Understand the pros and cons of various ML algorithms before selecting
one for your problem.
 Craft machine learning models from scratch. Don't just rely on premade
templates and libraries. Test them to their limits to understand what's
going to work, & where.
 Find a compromise between speed and perfection. On-time delivery
should be preferred over extreme accuracy.
 Useful data is more important than lots of data. Use multiple data
sources to better understand data and its discrepancies.
 Be connected with the data science community, be it via blogs,
meetups, conferences or hackathons.
 Practice with open datasets. Learn from the solutions of others.
As a beginner, what should be my learning path to become a data scientist?

One approach is to be practical and hands-on from the outset. Pick a topic
in which you're passionate and curious. Research available datasets.
Tweet and discuss so that you get clarity. Start coding. Explore. Analyze.
Build data pipelines for large datasets. Communicate your results. Repeat
this with other datasets and build a public portfolio. Along the way, pick up
all the skills you need.
You may instead prefer a more formal approach. You can learn the basics
of languages such as R and Python. Follow this with additional
packages/libraries particular to data science: (R) dplyr, ggplot2; (Python)
NumPy, Pandas, matplotlib. Get introduced to statistics. From this
foundation, start your journey into Machine Learning. To relate these to
business goals, some recommend the book Data Science for Business by
Provost and Fawcett. But you should put all this knowledge into practice
by taking up projects with datasets and problems that interest you.At the
University of Wisconsin, statistics is covered first before programming. To
become inter-disciplinary, you may choose to learn aspects of data
engineering (data warehousing, Big Data) and ethics.

Could you give some tips for a budding data scientist?


The following tips might help:
Data science is about answering scientific questions with the help of data.
Don't focus just on the aspect of handling data, dataset size or the tools.
Understand the business, its products, customers and strategies. This will
help you ask the right questions. Have constant interaction with business
counterparts. Communicate with them in a language they can understand.
Consider alternative approaches before selecting one that suits the
problem. Likewise, select a suitable metric. Sometimes derived metrics
may yield better prediction compared to available metrics.
Understand the pros and cons of various ML algorithms before selecting
one for your problem.
Craft machine learning models from scratch. Don't just rely on premade
templates and libraries. Test them to their limits to understand what's
going to work, & where.
Find a compromise between speed and perfection. On-time delivery should
be preferred over extreme accuracy.
Useful data is more important than lots of data. Use multiple data sources
to better understand data and its discrepancies.
Be connected with the data science community, be it via blogs, meetups,
conferences or hackathons.
Practice with open datasets. Learn from the solutions of others.

Asking a Question
Asking questions is central to data science, as different questions require
different analyses. For example, "How have house prices changed over
time?" differs from "How will this new law affect house prices?".
Understanding the research question determines the necessary data, the
patterns to look for, and how to interpret results. This book focuses on
three broad categories of questions: exploratory, inferential, and
predictive.

Exploratory Questions: These aim to uncover information about existing


data. For instance, using environmental data to ask if average global
temperatures have risen in the past 40 years is exploratory. The goal is to
summarize and interpret trends in the data without predicting future
trends.

Inferential Questions: These quantify whether observed trends in a


sample hold true for a larger population. For example, using data from a
sample of hospitals to ask if air pollution correlates with lung disease for
the entire US population. Inferential questions seek to infer trends beyond
the sample data.
Predictive Questions: These aim to forecast trends for individuals based
on patterns observed in data. For example, predicting how likely someone
is to vote based on their income and education.

As data analysis progresses, refining research questions is common. Each


refinement requires reconsidering the type of question being asked.

Obtaining Data
This stage involves acquiring and understanding how the data were
collected. The type of research questions that can be answered depends
significantly on the data collection method. When data are costly and
difficult to gather, a precise research question is defined first. When data
are abundant and easily accessible, the analysis might start with obtaining
data, exploring it, and then formulating questions.

Understanding the Data


After obtaining data, the next step is to understand it through exploratory
data analysis (EDA). This includes creating plots to identify patterns and
summarizing the data visually. It also involves identifying and addressing
data issues such as missing or anomalous values. This stage is iterative,
often leading back to revising research questions or obtaining additional
data.

Understanding the World


In this stage, conclusions are drawn about the larger population. This
involves using statistical techniques like A/B testing, confidence intervals,
and regression models. The goal is to quantify the extent to which trends
observed in the sample data can be generalized to the entire population.
This stage is essential for answering inferential and predictive research
questions.

Google Flu Trends:


Digital Epidemiology and GFT: Digital epidemiology utilizes data
generated outside the traditional public health system to analyze disease
patterns and health dynamics. Google Flu Trends (GFT) was an early
example, launched in 2007. By analyzing search queries related to flu
symptoms, GFT aimed to estimate flu cases in real time. Initially, GFT
generated excitement about the potential of big data in public health.

Failures of GFT: Despite early success, GFT struggled to maintain


accuracy. During the 2011-2012 flu season, it frequently overestimated flu
cases compared to traditional data from the CDC, which collects data from
labs nationwide. Over 108 weeks, GFT overestimated CDC figures 100
times. A model using three-week-old CDC data and seasonal trends proved
more reliable than GFT’s real-time estimates.

Lessons Learned:

 Data Quality: GFT's reliance solely on search queries overlooked


critical information available through traditional methods.
 Integration of Data Sources: Combining GFT data with CDC data
improved predictions, highlighting the value of integrating multiple
data sources.
 Importance of Data Scope: The GFT example underscores the
need to understand the relationship between data, the subject of
investigation, and the research question. Misalignment can lead to
inaccurate conclusions and overstated findings.

Key Takeaways:

 Big Data Limitations: More data does not always mean better
insights. Data scope and quality are crucial.
 Combining Approaches: Integrated methods often yield better
results than single data sources.
 Framework Understanding: Properly aligning data with the
research question is essential to avoid misleading conclusions.

Questions and Data Scope: Target Population, Access


Frame, Sample:
Key Points:

1. Initial Steps in Data Life Cycle: Begin by expressing the question


of interest in the context of the subject area and consider its
connection to the collected data. This step is essential before
analysis or modeling to avoid misalignment between the question
and the data.
2. Target Population: The group you aim to describe and draw
conclusions about. An element or unit in this population can be a
person, tweet, voter, etc.
3. Access Frame: The accessible subset of the target population.
Ideally, it aligns perfectly with the target population, but often it
doesn't. Some units in the access frame may not belong to the
population, and some units in the population may not be accessible.
4. Sample: A subset of units from the access frame used for
measurement and analysis. The sample should be representative of
the target population to ensure accurate conclusions.
5. Representativeness: The alignment between the access frame
and the target population, and the method of selecting units, are
crucial for representativeness. Bias in sampling methods can lead to
unrepresentative data.
6. Time and Place Considerations: Temporal and spatial patterns in
data need consideration. For example, drug trial effectiveness or
environmental health data may vary by location and time, affecting
conclusions.
7. Data Collection Purpose: Understand who collected the data and
why, especially with passively collected data. This helps in assessing
the suitability of the data for the question at hand.

Examples:

1. Wikipedia Contributors:
o Question: Do informal awards increase the activity of
Wikipedia contributors?
o Target Population: Active contributors to Wikipedia (top 1%
of contributors).
o Access Frame: Contributors who hadn't received an informal
incentive recently.
o Sample: 200 randomly selected contributors from the access
frame, observed for 90 days.

2. Election Polling:
o Question: Who will win the election?
o Target Population: Voters in the 2016 US presidential
election.
o Access Frame: Likely voters with landline or mobile phones.
o Sample: People randomly selected via a dialing scheme.

3. Environmental Health:
o Question: How do environmental hazards impact health?
o Target Population: Residents of California.
o Access Frame: Census tracts in California.
o Sample: Census tracts with aggregated data.

Conclusion:

Understanding the connection between the target population, access


frame, and sample is essential for valid data analysis. Consider the scope
and limitations of your data, including potential biases,
representativeness, and the context of data collection, to ensure accurate
and meaningful conclusions.

Here's a concise explanation of the differences between the target


population, access frame, and sample:

1. Target Population:
o Definition: The entire group of individuals or elements about
which you want to draw conclusions.
o Example: All voters in a country, all residents in a city, or all
users of a social media platform.
o Purpose: The target population is the primary focus of your
study, the group you want to understand or make predictions
about.

2. Access Frame:
o Definition: The subset of the target population that is
accessible for data collection. It includes all the units that you
can realistically reach or measure.
o Example: Voters with registered phone numbers, residents
who visit a specific clinic, or users who are active on the social
media platform in the past month.
o Purpose: The access frame defines the practical boundary
within which you can collect data. It may not perfectly match
the target population due to limitations in data collection
methods.

3. Sample:
o Definition: A subset of units selected from the access frame
to be measured or observed. The sample is used to infer
conclusions about the entire target population.
o Example: 1,000 randomly selected voters from the registered
phone numbers, 500 patients visiting the clinic in a month, or
10,000 active social media users in the past month.
o Purpose: The sample provides the actual data points for
analysis. It should be representative of the access frame to
ensure that conclusions drawn are valid for the target
population.

Visualizing the Relationships

 Target Population: Broadest scope (all potential units of interest).


 Access Frame: Narrower scope (those units you can actually
reach).
 Sample: Narrowest scope (units you actually measure or observe).

Example Scenario

 Target Population: All voters in a country.


 Access Frame: Voters with registered phone numbers.
 Sample: 1,000 randomly selected voters from the registered phone
numbers.

Importance of Each Component

 Ensuring the access frame closely matches the target population


helps minimize biases and improves the representativeness of the
sample.
 A sample that is representative of the access frame allows for
accurate inferences about the target population.
Understanding these differences is crucial for designing robust data
collection methods and ensuring the validity of your analysis and
conclusions.

Questions and Data Scope: Accuracy Summary:


In data analysis, ensuring accuracy is crucial, especially when a census or
measurement aims to capture an entire population. Ideal scenarios, such
as perfect questionnaires or flawless instruments, are rare. Most situations
require quantifying the accuracy of measurements to generalize findings.
Accuracy is divided into bias and variance (precision). Bias represents
systematic errors, while variance indicates the spread of measurements
around the true value. Reducing bias and variance enhances data
accuracy.

Types of Bias:

1. Coverage Bias: Occurs when the access frame doesn't include the
entire target population. For example, a survey via cell-phone calls
excludes those without phones.
2. Selection Bias: Happens when the sampling mechanism favors
certain units. Convenience samples are a common example.
3. Non-response Bias: Involves unit non-response (when selected
individuals don't participate) or item non-response (when specific
questions are unanswered).
4. Measurement Bias: Results from systematic errors in
measurement tools or survey questions.

Accuracy in Polls: The 2016 US Presidential Election highlighted non-


response and measurement biases, leading to inaccurate predictions.
Over-representation of college-educated voters and last-minute
preference changes skewed poll results.

Types of Variation:

1. Sampling Variation: Arises from using chance to select a sample.


2. Assignment Variation: Results from random assignment of units
in experiments.
3. Measurement Error: Involves variability in repeated
measurements of the same object.

Urn Model Analogy: This model helps estimate variation by using a


chance mechanism to draw samples and assign treatments, as
demonstrated in a Wikipedia experiment.

Understanding bias and variation, along with employing robust protocols,


enhances the accuracy and reliability of data analysis.
Summary: Questions and Data Scope:
Before engaging in data cleaning, exploration, and analysis, it's crucial to
consider the data's source. If you didn't collect the data yourself, ask:

Who collected the data?


Why were the data collected?
These questions help determine if the data are suitable for your analysis.

Scope of the Data


Understanding the temporal and spatial aspects of data collection is
essential:

When were the data collected?


Where were the data collected?
This ensures your findings are relevant and comparable to your context.

Core Questions about Data Scope


What is the target population (or unknown parameter value)?
How was the target accessed?
What methods were used to select samples/take measurements?
What instruments were used and how were they calibrated?
Answering these questions helps evaluate the reliability and
generalizability of your findings.

Framework and Concepts


This chapter introduces terminology and frameworks to assess data
quality, identify bias, and understand variance:

Scope Diagram: Shows the overlap between target population, access


frame, and sample.

Dart Board Analogy: Describes an instrument's bias and variance.


Urn Model: Helps in scenarios involving chance mechanisms for sampling,
dividing experimental groups, or taking measurements.
These tools assist in identifying data limitations and judging their
usefulness. The next chapter will delve deeper into the urn model to
quantify accuracy and design simulation studies.

Simulation and Data Design: Simulating Election Polls:


Context
In the U.S. presidential election, the president is chosen by the Electoral
College, with each state having a certain number of votes. Polls help
identify battleground states where the results could be close. The 2016
election saw many incorrect predictions despite accurate outcomes in
most states, highlighting the impact of small biases in polling.
Bias refers to a systematic error or deviation from the true value or the expected result due to some
influence or prejudice. It can affect data collection, analysis, interpretation, and generalization of
results in various fields.

Key Points
Electoral Process: The Electoral College votes determine the president, not
the popular vote. States usually award all their electoral votes to the
candidate who wins the popular vote within the state.

2016 Election Analysis:


Pollsters correctly predicted 46 of 50 states.
The four battleground states (Florida, Michigan, Pennsylvania, Wisconsin)
had narrow margins.
Education bias in polling underestimated Trump's support.

Simulation Study:
Scenario 1: No bias. Polls are representative, with each of the 1,500
sampled voters reflecting actual voter preferences.
Scenario 2: Slight education bias favoring Clinton by 0.5 percentage
points.
Using the urn model, simulations showed how often polls predicted the
correct outcome.

Urn Model:
Simulates election polls by drawing a sample of voters (marbles) from an
urn representing the entire population of voters.
Results are calculated using multivariate hypergeometric distribution.

Results:
Without bias, Trump was predicted to win about 60% of the time.
With a small bias, the prediction accuracy dropped, and Trump was
predicted to win only 45% of the time.
Larger samples (12,000 voters) reduced sampling error but did not
eliminate the effect of bias.

Implications:
Bigger polls reduce sampling error but do not address bias.
Pollsters need to improve methods to reduce bias.
Polls remain useful but must account for potential biases.
Conclusion:
Simulation studies can help understand polling accuracy and the effects of
biases. They show that even small biases can significantly impact
predictions, and larger sample sizes do not necessarily overcome these
biases. Improving polling methodologies to account for biases is crucial for
accurate predictions.
Simulation and Data Design: Simulating a Randomized Trial:
Vaccine Efficacy
Randomized Controlled Trials (RCTs) Overview

 In RCTs, participants are randomly assigned to treatment or control


groups.
 This random assignment helps to control for bias and allows for a
clear comparison between the groups.

Detroit Mayor's Decision on Vaccine Efficacy

 In March 2021, Detroit Mayor Mike Duggan declined a shipment of


Johnson & Johnson vaccines due to its reported 66% efficacy,
compared to Moderna and Pfizer's 95% efficacy.
 The CDC, however, considers a 66% efficacy rate effective, hence
the vaccine received emergency approval.

Scope of Clinical Trials

 Johnson & Johnson trial:


o Conducted with 43,738 participants from eight countries.
o Included a diverse group with about 40% having comorbidities.
o Conducted during a period when a more contagious variant
was spreading.
 Moderna and Pfizer trials:
o Primarily conducted in the US.
o Also had about 40% participants with comorbidities.
o Conducted earlier in the pandemic during a period of lower
infection rates.

The Urn Model in Simulations

 An urn model is used to simulate the assignment of treatment and


placebo in the trial.
 In the J&J trial, 468 participants contracted COVID-19 (117 in the
treatment group and 351 in the control group).
 An urn model with 43,738 marbles (468 labeled "sick" and the rest
"healthy") is used to simulate this process.
 Drawing half the marbles simulates the assignment to treatment and
placebo groups.

Simulation Results

 The simulation showed that drawing 117 or fewer "sick" marbles out
of 21,869 in 500,000 trials was extremely rare if the vaccine were
ineffective.
 The rarity of this outcome suggests the vaccine's efficacy.
Calculating Vaccine Efficacy (VE)

 VE is calculated using the formula:


VE=Risk among unvaccinated group−Risk among vaccinated groupR
isk among unvaccinated group\text{VE} = \frac{\text{Risk among
unvaccinated group} - \text{Risk among vaccinated group}}{\
text{Risk among unvaccinated
group}}VE=Risk among unvaccinated groupRisk among unvaccinate
d group−Risk among vaccinated group
 For the J&J trial: VE=351−117351=0.6667 or 66.67%\text{VE} = \
frac{351 - 117}{351} = 0.6667 \text{ or }
66.67\%VE=351351−117=0.6667 or 66.67%
 The CDC's standard for VE is 50%. For J&J, none of the simulations
resulted in 156 or fewer cases in the treatment group, affirming its
efficacy.

Efficacy Against Severe Cases

 J&J vaccine showed over 80% efficacy in preventing severe COVID-


19 cases.
 No deaths were observed in the treatment group.

Conclusion

 The urn model and random assignment in clinical trials help assess
the efficacy of treatments.
 Considering the scope and context of data is crucial for accurate
comparisons between different studies.
 After understanding these factors, Mayor Duggan retracted his
statement, acknowledging the efficacy and safety of the J&J vaccine.

This simulation example illustrates the importance of randomization, data


scope, and appropriate statistical models in evaluating vaccine efficacy
and making informed public health decisions.
Assignment/Exercise :
Assignment/Exercise :
Assignment/Exercise :
Now, let's replace that first plot command with the following.

Now, let's try changing the colors.


And, now let's change the size of the points in the plot

Finally, let's draw a line through the points.


Assignment/Exercise :
1. Find Fisher's Iris Data Set in the Wikipedia.
2. Copy the data table and paste it into Microsoft Excel, Apple Numbers,
or Google Docs Spreadsheet
3. Save the dataset in Comma Separated Value (CSV) format on your
desktop, with a filename of "iris.csv"
4. Read the dataset into R
5. Inspect the data, make sure it is all there, then look at the data using
the summary(), table(), and plot() functions
The random fct gives random values between 0 and 1, while seed fct
determines the initial state of the pseudo-random number generator.

You might also like