0% found this document useful (0 votes)

28 views

CS250

Uploaded by

weigix chehhal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

28 views

CS250

Uploaded by

weigix chehhal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 55

CS250: Python for Data Science :

his course includes the following units:

 Unit 1: What is Data Science?

 Unit 2: Python for Data Science
 Unit 3: The numpy Module
 Unit 4: Applied Statistics in Python
 Unit 5: The pandas Module
 Unit 6: Visualization
 Unit 7: Data Mining I – Supervised Learning
 Unit 8: Data Mining II – Clustering Techniques
 Unit 9: Data Mining III - Statistical Modeling
 Unit 10: Time Series Analysis

Upon successful completion of this course, you will be able to:

 use Google Colaboratory notebooks to implement and test Python

programs;
 explain how Python programming is relevant to data science;
 construct and operate on arrays using the numpy module;
 apply Python modules for basic statistical computation;
 construct and operate on dataframes using the pandas module;
 apply the pandas module to interact with spreadsheet software;
 implement Python scripts for visualization using arrays and dataframes;
 apply the scikit-learn module to perform data mining;
 explain techniques for supervised and unsupervised learning;
 apply supervised learning techniques;
 apply unsupervised learning techniques;
 apply the scikit-learn module to build statistical models;
 implement Python scripts to perform regression analyses;
 apply the statsmodels module to build and analyze models for time
series analysis; and
 explain similarities and differences between AR, MA, and ARIMA
models.

1- History:
Data science is a discipline that incorporates varying degrees of Data
Engineering, Scientific Method, Math, Statistics, Advanced Computing,
Visualization, Hacker mindset, and Domain Expertise. A practitioner of Data
Science is called a Data Scientist. Data Scientists solve complex data analysis
problems.
Origins
The term "Data Science" was coined at the beginning of the 21st Century. It is
attributed to William S. Cleveland who, in 2001, wrote "Data Science: An
Action Plan for Expanding the Technical Areas of the Field of Statistics".

Development:
During the dot-com bubble (1998-2000), hard drives became inexpensive, leading
corporations and governments to buy many. As per a corollary of Parkinson's Law, data
expands to fill available disk space, creating a cycle of buying more disks and accumulating
more data, resulting in big data. Big data is vast and complex, requiring special management
tools. Companies like Google, Yahoo!, and Amazon developed cloud computing to handle
this, with MapReduce and Hadoop being key innovations. Hadoop's complexity led to the
creation of mass analytic tools with simpler interfaces, like recommender systems and
machine learning, requiring specialized knowledge. This specialization gave rise to data
scientists who analyze big data for new insights. Data science, ideally done in teams, tackles
large-scale problems that single individuals cannot manage alone. In summary: cheap disks →
big data → cloud computing → mass analytic tools → data scientists → data science teams
→ new analytic insights

(The "dot-com" bubble of 1998-2000 was a period of excessive speculation and investment in internet-
based companies, fueled by the rapid growth and adoption of the internet. Many investors poured
money into startups with ".com" in their names, leading to a surge in stock prices. However, many of
these companies had unsustainable business models and eventually failed. The bubble burst in 2000,
leading to a significant stock market crash and substantial financial losses for investors.)

(Parkinson's Law is an adage that states, "Work expands to fill the time available for its completion."
This means that if you allocate more time to a task, it will take longer to complete, often due to
procrastination, inefficient work habits, or unnecessary complexities.

A corollary of Parkinson's Law applies to data storage: "Data expands to fill the available disk space."
This means that as more storage becomes available, the amount of data stored increases accordingly,
often leading to more data accumulation than initially expected or necessary. The law highlights how
resources, whether time or storage space, tend to get fully utilized, often leading to inefficiency.)
Data Engineering:
Data Engineering is a key component of data science that involves acquiring,
ingesting, transforming, storing, and retrieving data, often accompanied by
adding metadata. A data engineer must manage these interconnected tasks as a
whole, understanding how data storage and retrieval impact ingestion and
processing.
Key Processes in Data Engineering:

1. Acquiring: Identifying data sources and obtaining data, which can come from
various places and in different formats, such as text, images, or sensor data.
2. Ingesting: Moving data into computer systems for analysis, considering data
volume, speed, and storage capacity.
3. Transforming: Converting raw data into a usable format for analysis, often
from CSV to structured formats like spreadsheets.
4. Metadata: Adding data about data, such as collection time, location, and other
relevant information, to enhance understanding and usability.
5. Storing: Choosing the appropriate storage system, like file systems for speed
or databases for functionality, based on data and analysis needs.
6. Retrieving: Extracting and querying data for analysis and visualization,
ensuring storage strategies align with retrieval requirements.
Example: For highway data, sensors might collect speed data in CSV format. This
data is ingested, transformed into a structured format, metadata is added, stored
in a database, and retrieved for analysis, such as calculating average speeds
during rush hours.

The Scientific Method

The Scientific Method is the scientific foundation of data science, involving the
acquisition of new knowledge through reasoning and empirical evidence from
testing hypotheses via repeatable experiments.
Key Elements of the Scientific Method:
1. Reasoning Principles:
- Inductive Reasoning: Deriving general principles from specific observations.
- Deductive Reasoning: Drawing specific conclusions from general principles.
- Example: "All known life depends on liquid water" (inductive) and "Socrates is
mortal" (deductive).

2. Empirical Evidence:
- Data obtained from observation or experiment, as opposed to logical
arguments or myths.
- Example: Galileo's telescope observations supporting Copernicus's
heliocentric theory versus Aristotle's geocentric model.

3. Hypothesis Testing:
- Involves two propositions: the null hypothesis (current understanding) and the
alternative hypothesis (new proposition).
- Example: In a trial, "the defendant is not guilty" (null hypothesis) and "the
defendant is guilty" (alternative hypothesis).

4. Repeatable Experiments:
- Methodical procedures that verify, falsify, or establish the validity of a
hypothesis, relying on repeatable methods and logical analysis.
- Example: Galileo's inclined plane experiment disproving Aristotle's theory of
falling bodies.
Role in Data Science:
Data scientists use the Scientific Method to critically evaluate evidence,
understand reasoning behind conclusions, test hypotheses, and ensure
experiments can be replicated to validate results.

Math:
Mathematics, alongside statistics, forms the intellectual core of data science,
focusing on the study of quantity, structure, space, and change, especially when
applied to practical problems.

Key Elements of Mathematics in Data Science:

1. Quantity:

- Numbers: Representing data with various types of numbers (integers,

fractions, real numbers, complex numbers).

- Example: Measuring highway lengths in miles (integers) and using arithmetic

to analyze these quantities.
2. Structure:

- Internal Structure: Identifying and analyzing the internal structure of data

through equations and relationships.

- Example: Understanding the structure of speed limits or lane widths on

highways using algebra.

3. Space:

- Spatial Components: Investigating and representing the spatial aspects of

data in two- or three-dimensional space.

- Example: Mapping highway segments' locations using latitude and longitude

or analyzing the smoothness of highway surfaces with geometry and
trigonometry.

4. Change:

- Dynamic Relationships: Describing how relationships between data points

change over time or distance.

- Example: Studying how the sharpness of curves in a highway changes with

speed limits or how asphalt depth affects traffic flow using calculus.

Role in Data Science:

Data scientists use mathematics to quantify and analyze data, understand its
structure, represent spatial relationships, and describe changes over time or
distance, enabling them to solve complex practical problems.

Statistics:
Statistics, together with mathematics, forms the intellectual foundation of data
science. It involves the collection, organization, analysis, and interpretation of
data to discover patterns, create models, and make future predictions.

Key Elements of Statistics in Data Science:

1. Collection:
- Designing Research: Creating research and experimental designs to ensure
data is collected in a way that allows valid conclusions.
- Example: Working with data engineers to develop procedures for data
generation.
2. Organization:
- Coding and Archiving Data: Ensuring data is coded, archived, and documented
appropriately for analysis and sharing.
- Example: Creating a data dictionary to specify variables, valid values, and
data formats, which data engineers use to develop a database schema.
3. Analysis:
- Summarizing and Modeling: Using descriptive and inferential statistics to
summarize data, test hypotheses, and create models.
- Example: Analyzing data to determine if there are significant differences
between groups or to identify correlations.
4. Interpretation:
- Reporting Results: Collaborating with subject matter experts and visual artists
to present data in comprehensible ways.
- Example: Creating tables and graphs to report results to stakeholders in an
understandable manner.

Role in Data Science:

Statisticians in data science ensure data is collected, organized, analyzed, and
interpreted correctly, enabling the discovery of insights, patterns, and
relationships that inform decision-making and predictions.

Advanced computing:
Advanced computing is the heavy lifting of data science, encompassing the
design, coding, testing, debugging, and maintenance of software to perform
specific operations.

Key Elements of Advanced Computing in Data Science:

1. Software Design:
o Process: Transforming software purpose and specifications into a
detailed plan, including components and algorithms.
o Example: Using modeling languages like UML to create software
designs, which programmers implement by writing source code.

2. Programming Language:
o Definition: Artificial languages designed to communicate
instructions to computers, controlling their behavior and external
devices.
o Example: Choosing between low-level languages (e.g., assembly)
and high-level languages (e.g., Java, Python, C++) to solve specific
problems.

3. Source Code:
o Definition: Collections of computer instructions written in human-
readable languages, translated into machine code for execution.
o Example: Using IDEs to type, debug, and execute source code, such
as the traditional "Hello World" program in Java and Python.

Role in Data Science:

Programmers in data science create, optimize, and maintain software that
processes data, leveraging their expertise in programming languages and
software design to solve complex computational problems efficiently.

Visualization:
Visualization is the "pretty face" of data science, focusing on the visual
representation of abstract data to enhance human understanding and cognition.

Key Elements of Visualization in Data Science:

1. Creative Process:
- Definition: Creating something original and worthwhile through divergent
thinking, conceptual blending, and honing.
- Role: Visual artists in data science explore multiple ways to present data and
refine visualizations through iterations.

2. Data Abstraction:
- Definition: Handling data meaningfully by visualizing manipulations like
aggregations, summarizations, correlations, and predictions, rather than raw
data.
- Role: Simplifying data content to make visualizations meaningful in the
context of the problem being addressed.

3. Informationally Interesting:
- Definition: Creating visuals that are not only informative but also aesthetically
pleasing and engaging, often incorporating elements of beauty such as symmetry
and harmony, with touches of surprise.
- Role: Making visualizations attractive to capture and retain human attention,
enhancing the communication of data insights.

Example:
A partial map of the Internet from early 2005 demonstrates effective
visualization. Each line represents connections between two IP addresses,
abstracting a subset of internet data. Through numerous iterations, a harmonious
color scheme and overall symmetry with surprising details (bright "stars") were
achieved, making the map both informative and visually engaging in the context
of understanding the World Wide Web.

The hacker mindset:

The hacker mindset is the "secret sauce" of data science, emphasizing creativity,
boldness, and persistence in solving data problems. According to Wikipedia,
hacking involves modifying, building, or creating software and hardware to
enhance performance or add new features. For data scientists, this mindset
extends to inventing new models, exploring new data structures, and creatively
combining multiple disciplines.

Key aspects of the hacker mindset include:

 Innovation: Creating and modifying systems and tools to improve

functionality and solve unique problems.
 DIY Approach: Embracing a do-it-yourself attitude to develop
unconventional solutions.
 Collaboration: Working in hacker-like spaces to share ideas and develop
new analytic solutions collectively.
 Examples:
o Steve Wozniak's Apple I: A hand-made computer built from
surplus parts, leading to the formation of Apple Inc.
o Carnegie Mellon Internet Coke Machine: An internet-connected
vending machine that allowed students to check the temperature of
sodas remotely.
The hacker part of a data scientist asks, "Do we need to modify our tools or
create something new to solve our problem?" and "How can we combine different
disciplines to reach an insightful conclusion?"

Domain Expertise:
Domain Expertise is the glue that holds data science together. It involves having
proficiency and special knowledge in a particular area, known as subject matter
expertise (SME). Any field, such as medicine, politics, sciences, marketing,
information security, demographics, and literature, can be subject to data
science inquiry. A successful data science team must include at least one domain
expert.

Key points about domain expertise:

 Importance of Problem Identification: Knowing which problems are

significant to solve.
 Defining Sufficient Answers: Understanding what constitutes an
adequate solution.
 Customer Insight: Knowing what information customers want and how to
present it effectively.
 Examples:
o Geographic Distribution of Soft Drink Terms: Edwin Chen at
Twitter visualized terms like "soda," "pop," and "coke" used in
different US regions. Understanding why these linguistic differences
exist requires insights from sociology, linguistics, history, and
anthropology.
o Nate Silver's Political Analysis: As a statistician and expert in US
politics, Nate Silver combines data with explanatory insights to
provide meaningful analyses, such as in his blog post "How
Romney’s Pick of a Running Mate Could Sway the Outcome."

The domain expert in data science asks, "What is important about the problem
we are solving?" and "What exactly should our customers know about our
findings?"

Assignment/Exercise Summary
Objective: Familiarize yourself with the R programming environment.

Steps:

1. Form Study Groups: Get into groups of 3 to 4 students.

2. Collaborative Learning: Work together in study sessions, explaining concepts to each other
and helping each other understand the material.
3. Google's R Style Guide:
o Print and read over the guide.
o Keep it for future reference as it will make more sense over time.

https://web.stanford.edu/class/cs109l/unrestricted/resources/google-style.html
4. Online Resources:
o Search for "introduction to R," "R tutorial," "R basics," and "list of R commands."
o Choose 4-5 websites and work through the first few examples on each site.
o Switch to another site if the current one becomes too confusing.

https://www.w3schools.com/r/default.asp

https://www.tutorialspoint.com/r/index.htm

https://www.codecademy.com/learn/learn-r

https://www.programiz.com/r

5. R Commands:
o Try the following commands in R:

R
Copier le code
library(help="utils")
library(help="stats")
library(help="datasets")
library(help="graphics")
demo()
demo(graphics)
demo(persp)

6. Short Program:
o Write a short program (5-7 lines) that executes without errors.
o Include the names of all contributors in the comment section.

7. Documentation:
o List the websites used, indicating which was the most helpful.
o List the top 10 unanswered questions the team has at the end of the study session.
The Impact of Data Science:
This chapter highlights the revolutionary impact of data science on different
sectors such as baseball, health, and robotics.

Moneyball

 Background: "Moneyball" is a book by Michael Lewis (2003) and a film

(2011) about the Oakland Athletics' baseball team and its general
manager, Billy Beane.
 Concept: The team used a sabermetric approach, relying on rigorous
statistical analysis to evaluate players, focusing on on-base and slugging
percentages over traditional metrics.
 Impact: Despite a limited budget, the Oakland A's competed successfully
with richer teams. This approach challenged traditional baseball wisdom
and led other MLB teams to adopt similar strategies.
 Themes: Insider vs. outsider dynamics, information democratization, and
the drive for efficiency in capitalism.
23 and Me

 Background: 23andMe is a personal genomics and biotechnology

company that provides genetic testing and analysis.
 Services: Customers submit a saliva sample for DNA analysis, receiving
information on traits, genealogy, and health risks.
 Impact: The company has a significant database, aiding in research
initiatives and correlations between genetics and personal/social
behaviors. It has contributed to advances in understanding genetic
predispositions to diseases like Parkinson's.

Google's Driverless Car

 Background: Google's project, led by engineer Sebastian Thrun, aims to

develop autonomous vehicles.
 Technologies: The system integrates data from Google Street View, AI
software, video cameras, LIDAR, radar, and position sensors.
 Achievements: Autonomous vehicles have driven thousands of miles with
minimal human intervention. This project has influenced legislation in
states like Nevada, Florida, and California.
 Challenges: The rapid advancement of technology outpaces existing laws,
necessitating new regulations for autonomous vehicles.

Assignment/Exercise

 Task: In groups, watch "Moneyball" and take notes on the impact of data
science in the film.
 Brainstorm: Discuss other areas where data science could be impactful
and consider potential counter-arguments.
 Presentation: Create a 4-slide presentation covering:
1. Chosen area of life.
2. How data science would make a difference.
3. Counter-arguments.
4. Group's conclusion on the viability of data science in that area.
Section 1