0% found this document useful (0 votes)
12 views56 pages

Module 4.1 - Data Science

The document discusses the significance of data science in both commercial and scientific contexts, highlighting the vast growth of data and the need for effective analysis. It outlines the essential skills for data scientists, the differences between computer scientists and scientists, and the importance of asking the right questions. Additionally, it presents various applications of data science, including predictive modeling, clustering, and anomaly detection, along with potential career paths in the field.

Uploaded by

alaramadan06
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views56 pages

Module 4.1 - Data Science

The document discusses the significance of data science in both commercial and scientific contexts, highlighting the vast growth of data and the need for effective analysis. It outlines the essential skills for data scientists, the differences between computer scientists and scientists, and the importance of asking the right questions. Additionally, it presents various applications of data science, including predictive modeling, clustering, and anomaly detection, along with potential career paths in the field.

Uploaded by

alaramadan06
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 56

Data Science

Large-scale Data is Everywhere!


 There has been enormous data
growth in both commercial and
scientific databases due to
advances in data generation
and collection technologies E-Commerce
Cyber Security
 New mantra
 Gather whatever data you
can whenever and wherever
possible.
 Expectations
 Gathered data will have value Social Networking: Twitter
Traffic Patterns
either for the purpose
collected or for a purpose not
envisioned.

Sensor Networks Computational Simulations


2
Why Data Science? Commercial Viewpoint

 Lots of data is being collected


and warehoused
– Web data
Googlehas Peta Bytes of web data
Facebook has billions of active users

– purchases at department/
grocery stores, e-commerce
 Amazon handles millions of visits/day
– Bank/Credit Card transactions
 Computers have become cheaper and more powerful
 Competitive Pressure is Strong
– Provide better, customized services for an edge (e.g. in
Customer Relationship Management)

3
Why Data Science? Scientific Viewpoint

 Data collected and stored at


enormous speeds
– remote sensors on a satellite
 NASA EOSDIS archives over
petabytes of earth science data / year fMRI Data from Brain Sky Survey Data

– telescopes scanning the skies


 Sky survey data

– High-throughput biological data


– scientific simulations
 terabytes of data generated in a few hours Gene Expression Data

 Data science helps scientists


– in automated analysis of massive datasets
– In hypothesis formation
Surface Temperature of Earth
4
Great Opportunities to Solve Society’s Major Problems

Improving health care and reducing costs Predicting the impact of climate change

Finding alternative/ green energy sources Reducing hunger and poverty by


increasing agriculture production
5
What is Data Science?
Like any emerging field, it isn’t yet well defined,
but incorporates elements of:
● Exploratory Data Analysis and Visualization
● Machine Learning and Statistics
● High-Performance Computing technologies
for dealing with scale.
Skill Sets for Data Science
Appreciating Data
Computer Scientists do not naturally appreciate
data: it’s just stuff to run through a program.
The usual way to test algorithm performance is
to run the implementation on “random data”.
But interesting data sets are a scarce resource,
which requires hard work and imagination to
obtain.
Computer vs. Real Scientists (1)
● Scientists strive to understand the
complicated and messy natural world, while
computer scientists build their own clean and
organized virtual worlds. Thus:
● Nothing is ever completely true or false in
science, while everything is either true or
false in Computer Science / Mathematics.
Computer vs. Real Scientists (2)
● Scientists are data-driven, while computer
scientists are algorithm-driven.
● Scientists obsess about discovering things,
which computer scientists invent rather than
discover.
● Scientists are comfortable with the idea that
data has errors; computer scientists are not.
Genius vs. Wisdom
Software developers are hired to produce code.
Data Scientists are hired to produce insights.
Genius shows in finding the right answer!!!
Wisdom shows in avoiding the wrong answers.
Data science (like most things) benefits more
from wisdom than from genius.
Developing Wisdom
● Wisdom comes from experience.
● Wisdom comes from general knowledge.
● Wisdom comes from listening to others.
● Wisdom comes from humility, observing how
often you have been wrong and why/how.
I seek pass on wisdom, through my experience
on the difficulty of making good predictions.
Developing Curiosity
● The good data scientist develops a curiosity
about the domain/application they are
working in.
● They talk shop with the people whose data
they are working on.
● They read the newspaper every day, to get a
broader perspective on the world.
Asking Good Questions
Software developers are not encouraged to ask
questions, but data scientists are:
● What exciting things might you be able to
learn from a given data set?
● What things do you/your people really want
to know?
● What data sets might get you there?
Let’s Practice Asking Questions!
Who, What, Where, When, and Why on the
following datasets:
● Baseball-reference.com
● Google ngrams
● NYC taxi cab records
Baseball-Reference.com: biosketch
Statistical Record of Play
Summary
statistics of each
years batting,
pitching, and
fielding record,
with teams and
awards.
Baseball Questions
● How to best measure individual player’s skill,
value or performance?
● How fair do trades between teams work out?
● What is the trajectory of player’s
performances as they mature and age?
● To what extent does batting performance
correlate with the position played?
Demographic Questions
● Do left-handed people have shorter lifespans
than right-handers?
● How often do people return to where they
were born?
● Do player salaries reflect past, present, or
future performance?
● Are heights and weights increasing in the
population?
Google Ngrams
● Presents an annual time series of the
frequency of every “popular” word/phrase
with 1 to 5 words occurs in scanned books.
● `Popular’ means appears >40 times in total.
● Google has scanned about 15% of all books
ever published, making this resource quite
comprehensive.
Google Ngram Viewer
Ngram Questions
● How has the amount of cursing changed
over time?
● What is the lifespan of fame and
technologies? Is it increasing/decreasing?
● How often do new words emerge? Do they
stay in common usage?
● What words are associated with other words,
i.e. can you build a language model?
NYC Taxi Cab Data
● Gives driver/owner, pickup/dropoff location,
and fare data for every taxi trip taken.
● Data obtained from NYC via Freedom of
Information Act Request (FOA)
Taxicab Questions
● How much do drivers make each night?
● How far do they travel?
● How much slower is traffic during rush hour?
● Where are people traveling to/from at
different times of the day?
● Do faster drivers get tipped better?
● Where should drivers go to pick up their next
fare?
Machine Learning Tasks …

Data

Milk

25
Predictive Modeling: Classification
 Find a model for class attribute as a function of
the values of other attributes Model for predicting credit
worthiness

Class

26
Classification Example

Test
Set

Training
Learn
Model
Set Classifier

Introduction to Data Mining, 2nd Edition Tan,


27
Steinbach, Karpatne, Kumar
Examples of Classification Task

 Classifying credit card transactions


as legitimate or fraudulent

 Classifying land covers (water bodies, urban areas,


forests, etc.) using satellite data

 Categorizing news stories as finance,


weather, entertainment, sports, etc

 Identifying intruders in the cyberspace

 Predicting tumor cells as benign or malignant

 Classifying secondary structures of protein


as alpha-helix, beta-sheet, or random coil

09/09/2020 28
Classification: Application 1

 Fraud Detection
– Goal: Predict fraudulent cases in credit card
transactions.
– Approach:
 Use credit card transactions and the information
on its account-holder as attributes.
– When does a customer buy, what does he buy, how
often he pays on time, etc
 Label past transactions as fraud or fair
transactions. This forms the class attribute.
 Learn a model for the class of the transactions.
 Use this model to detect fraud by observing credit
card transactions on an account.
29
Classification: Application 2

 Churn prediction for telephone customers


– Goal: To predict whether a customer is likely to be lost to a
competitor.
– Approach:
 Use detailed record of transactions with each of the past and present
customers, to find attributes.
– How often the customer calls, where he calls, what time-of-the day he calls most,
his financial status, marital status, etc.
 Label the customers as loyal or disloyal.
 Find a model for loyalty.

30
Classification: Application 3
 Sky Survey Cataloging
– Goal: To predict class (star or galaxy) of sky objects,
especially visually faint ones, based on the telescopic
survey images (from Palomar Observatory).
– 3000 images with 23,040 x 23,040 pixels per image.
– Approach:
 Segment the image.
 Measure image attributes (features) - 40 of them per
object.
 Model the class based on these features.
 Success Story: Could find 16 new high red-shift
quasars, some of the farthest objects that are
difficult to find!

31
Classifying Galaxies
Courtesy: http://aps.umn.edu

Early Class: Attributes:


• Stages of Formation • Image features,
• Characteristics of light
waves received, etc.
Intermediate

Late

Data Size:
• 72 million stars, 20 million galaxies
• Object Catalog: 9 GB
• Image Database: 150 GB

32
Regression

 Predict a value of a given continuous valued variable


based on the values of other variables, assuming a
linear or nonlinear model of dependency.
 Extensively studied in statistics, neural network fields.
 Examples:
– Predicting sales amounts of new product based on
advertising expenditure.
– Predicting wind velocities as a function of
temperature, humidity, air pressure, etc.
– Time series prediction of stock market indices.

33
Clustering

 Finding groups of objects such that the objects in a


group will be similar (or related) to one another and
different from (or unrelated to) the objects in other
groups
Inter-cluster
Intra-cluster distances are
distances are maximized
minimized

34
Applications of Cluster Analysis
 Understanding
– Custom profiling for targeted
marketing
– Group related documents for
browsing
– Group genes and proteins that
have similar functionality
– Group stocks with similar price
fluctuations
 Summarization
– Reduce the size of large data
sets Courtesy: Michael Eisen

Clusters for Raw SST and Raw NPP


90

Use of K-means to
60

Land Cluster 2
partition Sea Surface
30 Temperature (SST)
Land Cluster 1
and Net Primary
latitude

Ice or No NPP
Production (NPP) into
-30 clusters that reflect
Sea Cluster 2 the Northern and
-60
Southern
Sea Cluster 1

-90
-180 -150 -120 -90 -60 -30 0 30 60 90 120 150 180
Hemispheres. 35
Cluster
longitude
Clustering: Application 1

 Market Segmentation:
– Goal: subdivide a market into distinct subsets of
customers where any subset may conceivably be
selected as a market target to be reached with a
distinct marketing mix.
– Approach:
 Collect different attributes of customers based on
their geographical and lifestyle related
information.
 Find clusters of similar customers.
 Measure the clustering quality by observing
buying patterns of customers in same cluster vs.
those from different clusters.
36
Clustering: Application 2

 Document Clustering:

– Goal: To find groups of documents that are similar to each other based on
the important terms appearing in them.

– Approach: To identify frequently occurring terms in each document. Form


a similarity measure based on the frequencies of different terms. Use it to
cluster.

Enron email dataset

37
Deviation/Anomaly/Change Detection

 Detect significant deviations from


normal behavior
 Applications:
– Credit Card Fraud Detection
– Network Intrusion
Detection
– Identify anomalous behavior from
sensor networks for monitoring and
surveillance.
– Detecting changes in the global
forest cover.

38
Motivating Challenges

 Scalability

 High Dimensionality

 Heterogeneous and Complex Data

 Data Ownership and Distribution

 Non-traditional Analysis

39
DS Career path

Introduction to Data Mining, 2nd Edition Tan, Steinbach,


09/09/2020 40
Karpatne, Kumar
Introduction
• Graduates of data science program
will mostly, and preferably, work as
Data Scientists
• Data Scientists can work in any type
of organization:
– Private
– Governmental
– Non-for-Profit

9/3/20XX Presentation Title 41


Industries
• Any organization can benefit from the
data they have, so data scientists can
work in any industry:
– Financial Institutions (E.g., Banks)
– Government agencies (E.g., Civil Status and
Passports Department and Police Department)
– Healthcare (E.g., Hospitals)
– Online platforms (E.g., Uber)
– Large Retailers (E.g., Carrefour and Amazon)
– Agricultural Companies
– And much more …

9/3/20XX Presentation Title 42


Data Scientist
Responsibilities
• Data scientists usually need to build
models of verified and validated data
sets
• These models will be used by the
employer to predict, recommend, or
evaluate any future business decision

9/3/20XX Presentation Title 43


Data Scientist
Responsibilities
• For example, a data scientist, working
for a hospital, can build a data model
that predicts the best treatment for a
specific patient
• The data scientist will use the data
that was collected by the hospital
about the patients and the treatments
that worked and did not work for them
in the past.

9/3/20XX Presentation Title 44


Data Scientist
Responsibilities
• Another example could be a data
scientist, working for the police
department, can build a data model
that predicts the location and time of
the next crime before it happens
• The data scientist will use the data
that was collected by the police
department about the previous crimes
to build the proposed model

9/3/20XX Presentation Title 45


Data Scientist
Responsibilities
• Another example could be a data
scientist, working for a large retailer,
can build a data model that predicts
the demand for certain products and
services
• The data scientist will use the data
that was collected by the retailer about
the previous purchasing transactions
• The data scientist may use data that is
provided by external entities

9/3/20XX Presentation Title 46


Data Scientist
Responsibilities
• Before building the model, data scientist
usually need to clean and normalize the
data
• Data could be collected from internal
sources or/and external sources
• Data scientists need to communicate
with data management guys to make
sure that necessary data is being
collected
– Data compliance department should be
involved to make sure that data collection is
properly handled from a legal perspective

9/3/20XX Presentation Title 47


More Opportunities
• In addition to working as data
scientists, graduates of data science
program can work as software
development engineers
• In this field, they will mostly specialize
in developing platforms that help data
scientists in their jobs
• They also can develop dashboards
that present business intelligence
charts and reports to users

9/3/20XX Presentation Title 48


CIS Career Path CIS Career path
Introduction
• Graduates of Computer Information
Systems (CIS) program can pursue a
job in of the following fields:
– Business Analysis
– Software Development
– System Implementation

9/3/20XX Presentation Title 50


Introduction
• CIS is an interdisciplinary program
that encompasses technology and
business courses
• This makes the graduates of this
program knowledgeable about how
business works and how technology
can make businesses more efficient
and more effective

9/3/20XX Presentation Title 51


Introduction
• People who have knowledge about
the technology only will have the
following issues while working in the
software development field:
– Difficulty in developing a software that
satisfies the business requirements
– Difficulty in architecting the software
systems according to the international
standards
– Difficulty in maintaining existing systems
due to lack of knowledge about the
business behind them

9/3/20XX Presentation Title 52


Example
• CIS program exposes students to
healthcare information systems
• When a CIS graduate joins a software
development team that is responsible
for developing an electronic health
record (EHR), he/she will be already
aware of the features and functionality
of the proposed system

9/3/20XX Presentation Title 53


You as a Business Analyst
• You will help customers define their
requirements of any proposed
software system
• Because you are already aware of
how existing systems work, you can
make notes and suggestions on how
the proposed software system should
look like
• Also, It is less likely you will
misinterpret the requirements provided
by customers
9/3/20XX Presentation Title 54
You as a Software
Developer
• You will write code to make a software
system
• Because you are already aware of
how business works, you will be able
to choose the right architecture for the
system
• The right architecture is one that
supports any future improvements
without making radical changes to the
existing architecture
9/3/20XX Presentation Title 55
You as a System
Implementer
• You will help users use the software
system the right way
• Because you are already aware of
how business works, you will be able
to provide a very helpful advice on
how the software should be used and
utilized

9/3/20XX Presentation Title 56

You might also like