CS250
CS250
1- History:
Data science is a discipline that incorporates varying degrees of Data
Engineering, Scientific Method, Math, Statistics, Advanced Computing,
Visualization, Hacker mindset, and Domain Expertise. A practitioner of Data
Science is called a Data Scientist. Data Scientists solve complex data analysis
problems.
Origins
The term "Data Science" was coined at the beginning of the 21st Century. It is
attributed to William S. Cleveland who, in 2001, wrote "Data Science: An
Action Plan for Expanding the Technical Areas of the Field of Statistics".
Development:
During the dot-com bubble (1998-2000), hard drives became inexpensive, leading
corporations and governments to buy many. As per a corollary of Parkinson's Law, data
expands to fill available disk space, creating a cycle of buying more disks and accumulating
more data, resulting in big data. Big data is vast and complex, requiring special management
tools. Companies like Google, Yahoo!, and Amazon developed cloud computing to handle
this, with MapReduce and Hadoop being key innovations. Hadoop's complexity led to the
creation of mass analytic tools with simpler interfaces, like recommender systems and
machine learning, requiring specialized knowledge. This specialization gave rise to data
scientists who analyze big data for new insights. Data science, ideally done in teams, tackles
large-scale problems that single individuals cannot manage alone. In summary: cheap disks →
big data → cloud computing → mass analytic tools → data scientists → data science teams
→ new analytic insights
(The "dot-com" bubble of 1998-2000 was a period of excessive speculation and investment in internet-
based companies, fueled by the rapid growth and adoption of the internet. Many investors poured
money into startups with ".com" in their names, leading to a surge in stock prices. However, many of
these companies had unsustainable business models and eventually failed. The bubble burst in 2000,
leading to a significant stock market crash and substantial financial losses for investors.)
(Parkinson's Law is an adage that states, "Work expands to fill the time available for its completion."
This means that if you allocate more time to a task, it will take longer to complete, often due to
procrastination, inefficient work habits, or unnecessary complexities.
A corollary of Parkinson's Law applies to data storage: "Data expands to fill the available disk space."
This means that as more storage becomes available, the amount of data stored increases accordingly,
often leading to more data accumulation than initially expected or necessary. The law highlights how
resources, whether time or storage space, tend to get fully utilized, often leading to inefficiency.)
Data Engineering:
Data Engineering is a key component of data science that involves acquiring,
ingesting, transforming, storing, and retrieving data, often accompanied by
adding metadata. A data engineer must manage these interconnected tasks as a
whole, understanding how data storage and retrieval impact ingestion and
processing.
Key Processes in Data Engineering:
1. Acquiring: Identifying data sources and obtaining data, which can come from
various places and in different formats, such as text, images, or sensor data.
2. Ingesting: Moving data into computer systems for analysis, considering data
volume, speed, and storage capacity.
3. Transforming: Converting raw data into a usable format for analysis, often
from CSV to structured formats like spreadsheets.
4. Metadata: Adding data about data, such as collection time, location, and other
relevant information, to enhance understanding and usability.
5. Storing: Choosing the appropriate storage system, like file systems for speed
or databases for functionality, based on data and analysis needs.
6. Retrieving: Extracting and querying data for analysis and visualization,
ensuring storage strategies align with retrieval requirements.
Example: For highway data, sensors might collect speed data in CSV format. This
data is ingested, transformed into a structured format, metadata is added, stored
in a database, and retrieved for analysis, such as calculating average speeds
during rush hours.
2. Empirical Evidence:
- Data obtained from observation or experiment, as opposed to logical
arguments or myths.
- Example: Galileo's telescope observations supporting Copernicus's
heliocentric theory versus Aristotle's geocentric model.
3. Hypothesis Testing:
- Involves two propositions: the null hypothesis (current understanding) and the
alternative hypothesis (new proposition).
- Example: In a trial, "the defendant is not guilty" (null hypothesis) and "the
defendant is guilty" (alternative hypothesis).
4. Repeatable Experiments:
- Methodical procedures that verify, falsify, or establish the validity of a
hypothesis, relying on repeatable methods and logical analysis.
- Example: Galileo's inclined plane experiment disproving Aristotle's theory of
falling bodies.
Role in Data Science:
Data scientists use the Scientific Method to critically evaluate evidence,
understand reasoning behind conclusions, test hypotheses, and ensure
experiments can be replicated to validate results.
Math:
Mathematics, alongside statistics, forms the intellectual core of data science,
focusing on the study of quantity, structure, space, and change, especially when
applied to practical problems.
1. Quantity:
3. Space:
4. Change:
Data scientists use mathematics to quantify and analyze data, understand its
structure, represent spatial relationships, and describe changes over time or
distance, enabling them to solve complex practical problems.
Statistics:
Statistics, together with mathematics, forms the intellectual foundation of data
science. It involves the collection, organization, analysis, and interpretation of
data to discover patterns, create models, and make future predictions.
1. Collection:
- Designing Research: Creating research and experimental designs to ensure
data is collected in a way that allows valid conclusions.
- Example: Working with data engineers to develop procedures for data
generation.
2. Organization:
- Coding and Archiving Data: Ensuring data is coded, archived, and documented
appropriately for analysis and sharing.
- Example: Creating a data dictionary to specify variables, valid values, and
data formats, which data engineers use to develop a database schema.
3. Analysis:
- Summarizing and Modeling: Using descriptive and inferential statistics to
summarize data, test hypotheses, and create models.
- Example: Analyzing data to determine if there are significant differences
between groups or to identify correlations.
4. Interpretation:
- Reporting Results: Collaborating with subject matter experts and visual artists
to present data in comprehensible ways.
- Example: Creating tables and graphs to report results to stakeholders in an
understandable manner.
Advanced computing:
Advanced computing is the heavy lifting of data science, encompassing the
design, coding, testing, debugging, and maintenance of software to perform
specific operations.
1. Software Design:
o Process: Transforming software purpose and specifications into a
detailed plan, including components and algorithms.
o Example: Using modeling languages like UML to create software
designs, which programmers implement by writing source code.
2. Programming Language:
o Definition: Artificial languages designed to communicate
instructions to computers, controlling their behavior and external
devices.
o Example: Choosing between low-level languages (e.g., assembly)
and high-level languages (e.g., Java, Python, C++) to solve specific
problems.
3. Source Code:
o Definition: Collections of computer instructions written in human-
readable languages, translated into machine code for execution.
o Example: Using IDEs to type, debug, and execute source code, such
as the traditional "Hello World" program in Java and Python.
Visualization:
Visualization is the "pretty face" of data science, focusing on the visual
representation of abstract data to enhance human understanding and cognition.
1. Creative Process:
- Definition: Creating something original and worthwhile through divergent
thinking, conceptual blending, and honing.
- Role: Visual artists in data science explore multiple ways to present data and
refine visualizations through iterations.
2. Data Abstraction:
- Definition: Handling data meaningfully by visualizing manipulations like
aggregations, summarizations, correlations, and predictions, rather than raw
data.
- Role: Simplifying data content to make visualizations meaningful in the
context of the problem being addressed.
3. Informationally Interesting:
- Definition: Creating visuals that are not only informative but also aesthetically
pleasing and engaging, often incorporating elements of beauty such as symmetry
and harmony, with touches of surprise.
- Role: Making visualizations attractive to capture and retain human attention,
enhancing the communication of data insights.
Example:
A partial map of the Internet from early 2005 demonstrates effective
visualization. Each line represents connections between two IP addresses,
abstracting a subset of internet data. Through numerous iterations, a harmonious
color scheme and overall symmetry with surprising details (bright "stars") were
achieved, making the map both informative and visually engaging in the context
of understanding the World Wide Web.
Domain Expertise:
Domain Expertise is the glue that holds data science together. It involves having
proficiency and special knowledge in a particular area, known as subject matter
expertise (SME). Any field, such as medicine, politics, sciences, marketing,
information security, demographics, and literature, can be subject to data
science inquiry. A successful data science team must include at least one domain
expert.
The domain expert in data science asks, "What is important about the problem
we are solving?" and "What exactly should our customers know about our
findings?"
Assignment/Exercise Summary
Objective: Familiarize yourself with the R programming environment.
Steps:
https://web.stanford.edu/class/cs109l/unrestricted/resources/google-style.html
4. Online Resources:
o Search for "introduction to R," "R tutorial," "R basics," and "list of R commands."
o Choose 4-5 websites and work through the first few examples on each site.
o Switch to another site if the current one becomes too confusing.
https://www.w3schools.com/r/default.asp
https://www.tutorialspoint.com/r/index.htm
https://www.codecademy.com/learn/learn-r
https://www.programiz.com/r
5. R Commands:
o Try the following commands in R:
R
Copier le code
library(help="utils")
library(help="stats")
library(help="datasets")
library(help="graphics")
demo()
demo(graphics)
demo(persp)
6. Short Program:
o Write a short program (5-7 lines) that executes without errors.
o Include the names of all contributors in the comment section.
7. Documentation:
o List the websites used, indicating which was the most helpful.
o List the top 10 unanswered questions the team has at the end of the study session.
The Impact of Data Science:
This chapter highlights the revolutionary impact of data science on different
sectors such as baseball, health, and robotics.
Moneyball
Assignment/Exercise
Task: In groups, watch "Moneyball" and take notes on the impact of data
science in the film.
Brainstorm: Discuss other areas where data science could be impactful
and consider potential counter-arguments.
Presentation: Create a 4-slide presentation covering:
1. Chosen area of life.
2. How data science would make a difference.
3. Counter-arguments.
4. Group's conclusion on the viability of data science in that area.
Section 1
00:00:02
The speaker in the video is introducing the concept of data science for
beginners, emphasizing the importance of gathering data, which includes
both numerical and categorical information. They explain the distinction
between numbers (amounts or counts) and names (categorical variables),
highlighting that even slight changes in numbers may still be close to the
original value, while changing a name slightly can result in a completely
different entity. The speaker also discusses the complexity of data that blurs
the line between numbers and names, such as phone numbers and zip codes.
They mention the significance of identification numbers and the ability to
convert names with order into numbers for machine learning algorithms. The
speaker encourages viewers to explore tools for data collection and analysis,
referencing the Cortana analytics process. They stress the importance of
asking precise questions that can be answered with specific data, ensuring
that the target information is included in the dataset. If the target is not
present, they advise obtaining more data. Additionally, they explain the
process of organizing data into a table with one target value per row for
analysis.
Section 2
00:07:15
In this part of the video, the speaker discusses the process of organizing data
to have one instance of the target variable for each row. They explain how
data that doesn't naturally occur once per day, such as total users or
quantities that remain constant for a period, needs to be aggregated or
distributed to align with the rows. The speaker also mentions the importance
of computing values like days since a specific event, gathering external data,
estimating missing information, and checking data quality. They provide an
example of cleaning up data related to superheroes and super villains,
ensuring that all values are formatted consistently for machine learning
algorithms to interpret correctly. The speaker emphasizes the need to
thoroughly review and understand each column of the data to ensure its
accuracy and quality.
Section 3
00:14:09
The speaker discusses the process of cleaning and interpreting data for
machine learning algorithms. They mention how they clean up data columns
to ensure uniform representation, such as identifying secret identities as yes
or no and categorizing an individual's ability to fly based on numerical values.
The speaker emphasizes the importance of unifying data standards for
effective interpretation by machine learning algorithms. They also touch on
feature engineering, which involves manipulating existing features to improve
predictive capabilities. An example is given where combining departure and
arrival times of a subway train helps predict the maximum speed reached
between stops. The speaker highlights the significance of data interaction in
enhancing predictive models and the concept of coefficient of determination
to evaluate model performance.
Section 4
00:21:40
Section 5
00:29:01
Section 6
00:36:23
Section 7
00:43:20
The speaker discusses three key gaps in machine learning. The first gap
highlights the importance of ensuring that data remains relevant in a
changing world, using the example of the impact of the September 11th
attacks on predictions made just before the event. The second gap
emphasizes the challenge of collecting sufficient data for certain complex
phenomena, such as global climate change. The third gap points out that
machine learning cannot determine causation, using examples like the
correlation between cheese consumption and deaths by bedsheet
entanglement. The speaker concludes by highlighting the role of human
insight and judgment in filling these gaps and making intuitive leaps in data
analysis.
1. Information Management
This stage focuses on collecting, storing, and managing data from various sources. It includes:
Data Ingestion: Collecting data from various sources, such as databases, IoT devices, and
external APIs.
Data Storage: Storing the ingested data in scalable and reliable storage solutions like Azure
Data Lake, SQL Database, or Blob Storage.
Data Preparation: Cleaning, transforming, and organizing data for analysis using tools like
Azure Data Factory or Azure Databricks.
This component involves storing and processing large volumes of data. Key technologies
include:
Azure Data Lake: A hyper-scale repository for big data analytics workloads.
Azure SQL Data Warehouse: A fully managed, petabyte-scale data warehouse service.
Azure Cosmos DB: A globally distributed, multi-model database service.
3. Machine Learning and Analytics
In this stage, advanced analytics and machine learning models are developed and applied to
the data. It includes:
Azure Machine Learning: A service for building, training, and deploying machine learning
models.
Azure HDInsight: A fully managed, full-spectrum, open-source analytics service for
enterprises.
Azure Databricks: An Apache Spark-based analytics platform optimized for Azure.
This involves creating interactive dashboards and visualizations to present insights derived
from the data. Tools include:
Power BI: A suite of business analytics tools to analyze data and share insights.
Azure Synapse Analytics: Integrates big data and data warehousing to offer dashboards and
interactive reports.
This stage focuses on deriving actionable insights and embedding intelligence into
applications. It includes:
Azure Cognitive Services: A collection of APIs for adding cognitive features like vision,
speech, and language understanding to applications.
Cortana Intelligence Suite: Integrates various analytics services to deliver comprehensive
intelligence solutions.
The final stage involves automating actions based on insights and integrating them into
business processes. Key services include:
Azure Logic Apps: A cloud service for automating workflows and integrating apps, data, and
services.
Microsoft Flow: Now called Power Automate, it automates workflows between apps and
services to synchronize files, get notifications, and collect data.
Consider a retail company wanting to improve its customer experience. Using the Cortana
Analytics Process, the company can:
References
One approach is to be practical and hands-on from the outset. Pick a topic
in which you're passionate and curious. Research available datasets.
Tweet and discuss so that you get clarity. Start coding. Explore. Analyze.
Build data pipelines for large datasets. Communicate your results. Repeat
this with other datasets and build a public portfolio. Along the way, pick up
all the skills you need.
You may instead prefer a more formal approach. You can learn the basics
of languages such as R and Python. Follow this with additional
packages/libraries particular to data science: (R) dplyr, ggplot2; (Python)
NumPy, Pandas, matplotlib. Get introduced to statistics. From this
foundation, start your journey into Machine Learning. To relate these to
business goals, some recommend the book Data Science for Business by
Provost and Fawcett. But you should put all this knowledge into practice
by taking up projects with datasets and problems that interest you.At the
University of Wisconsin, statistics is covered first before programming. To
become inter-disciplinary, you may choose to learn aspects of data
engineering (data warehousing, Big Data) and ethics.
Asking a Question
Asking questions is central to data science, as different questions require
different analyses. For example, "How have house prices changed over
time?" differs from "How will this new law affect house prices?".
Understanding the research question determines the necessary data, the
patterns to look for, and how to interpret results. This book focuses on
three broad categories of questions: exploratory, inferential, and
predictive.
Obtaining Data
This stage involves acquiring and understanding how the data were
collected. The type of research questions that can be answered depends
significantly on the data collection method. When data are costly and
difficult to gather, a precise research question is defined first. When data
are abundant and easily accessible, the analysis might start with obtaining
data, exploring it, and then formulating questions.
Lessons Learned:
Key Takeaways:
Big Data Limitations: More data does not always mean better
insights. Data scope and quality are crucial.
Combining Approaches: Integrated methods often yield better
results than single data sources.
Framework Understanding: Properly aligning data with the
research question is essential to avoid misleading conclusions.
Examples:
1. Wikipedia Contributors:
o Question: Do informal awards increase the activity of
Wikipedia contributors?
o Target Population: Active contributors to Wikipedia (top 1%
of contributors).
o Access Frame: Contributors who hadn't received an informal
incentive recently.
o Sample: 200 randomly selected contributors from the access
frame, observed for 90 days.
2. Election Polling:
o Question: Who will win the election?
o Target Population: Voters in the 2016 US presidential
election.
o Access Frame: Likely voters with landline or mobile phones.
o Sample: People randomly selected via a dialing scheme.
3. Environmental Health:
o Question: How do environmental hazards impact health?
o Target Population: Residents of California.
o Access Frame: Census tracts in California.
o Sample: Census tracts with aggregated data.
Conclusion:
1. Target Population:
o Definition: The entire group of individuals or elements about
which you want to draw conclusions.
o Example: All voters in a country, all residents in a city, or all
users of a social media platform.
o Purpose: The target population is the primary focus of your
study, the group you want to understand or make predictions
about.
2. Access Frame:
o Definition: The subset of the target population that is
accessible for data collection. It includes all the units that you
can realistically reach or measure.
o Example: Voters with registered phone numbers, residents
who visit a specific clinic, or users who are active on the social
media platform in the past month.
o Purpose: The access frame defines the practical boundary
within which you can collect data. It may not perfectly match
the target population due to limitations in data collection
methods.
3. Sample:
o Definition: A subset of units selected from the access frame
to be measured or observed. The sample is used to infer
conclusions about the entire target population.
o Example: 1,000 randomly selected voters from the registered
phone numbers, 500 patients visiting the clinic in a month, or
10,000 active social media users in the past month.
o Purpose: The sample provides the actual data points for
analysis. It should be representative of the access frame to
ensure that conclusions drawn are valid for the target
population.
Example Scenario
Types of Bias:
1. Coverage Bias: Occurs when the access frame doesn't include the
entire target population. For example, a survey via cell-phone calls
excludes those without phones.
2. Selection Bias: Happens when the sampling mechanism favors
certain units. Convenience samples are a common example.
3. Non-response Bias: Involves unit non-response (when selected
individuals don't participate) or item non-response (when specific
questions are unanswered).
4. Measurement Bias: Results from systematic errors in
measurement tools or survey questions.
Types of Variation:
Key Points
Electoral Process: The Electoral College votes determine the president, not
the popular vote. States usually award all their electoral votes to the
candidate who wins the popular vote within the state.
Simulation Study:
Scenario 1: No bias. Polls are representative, with each of the 1,500
sampled voters reflecting actual voter preferences.
Scenario 2: Slight education bias favoring Clinton by 0.5 percentage
points.
Using the urn model, simulations showed how often polls predicted the
correct outcome.
Urn Model:
Simulates election polls by drawing a sample of voters (marbles) from an
urn representing the entire population of voters.
Results are calculated using multivariate hypergeometric distribution.
Results:
Without bias, Trump was predicted to win about 60% of the time.
With a small bias, the prediction accuracy dropped, and Trump was
predicted to win only 45% of the time.
Larger samples (12,000 voters) reduced sampling error but did not
eliminate the effect of bias.
Implications:
Bigger polls reduce sampling error but do not address bias.
Pollsters need to improve methods to reduce bias.
Polls remain useful but must account for potential biases.
Conclusion:
Simulation studies can help understand polling accuracy and the effects of
biases. They show that even small biases can significantly impact
predictions, and larger sample sizes do not necessarily overcome these
biases. Improving polling methodologies to account for biases is crucial for
accurate predictions.
Simulation and Data Design: Simulating a Randomized Trial:
Vaccine Efficacy
Randomized Controlled Trials (RCTs) Overview
Simulation Results
The simulation showed that drawing 117 or fewer "sick" marbles out
of 21,869 in 500,000 trials was extremely rare if the vaccine were
ineffective.
The rarity of this outcome suggests the vaccine's efficacy.
Calculating Vaccine Efficacy (VE)
Conclusion
The urn model and random assignment in clinical trials help assess
the efficacy of treatments.
Considering the scope and context of data is crucial for accurate
comparisons between different studies.
After understanding these factors, Mayor Duggan retracted his
statement, acknowledging the efficacy and safety of the J&J vaccine.