0% found this document useful (0 votes)

20 views12 pages

2 DS # 1 Introduction To DS

Data sceince 2nd lecture inroduction to data science pdf

Uploaded by

mussaratk485

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views12 pages

2 DS # 1 Introduction To DS

Data sceince 2nd lecture inroduction to data science pdf

Uploaded by

mussaratk485

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

9/14/2019

Data Science “Core Components”

Data Science
23

Data Science “Use-case Implementation”

Data Science
24

1
9/14/2019

Data Science “Process Flow Diagram”

Data Science
25

Media Use-case

Data Science
26

2
9/14/2019

Example “K-Means Clustering”

Data Science
27

Market Basket Use-case

Data Science
28

3
9/14/2019

Example “Association Rule Mining”

Data Science
29

Health Care Use-case

Data Science
30

4
9/14/2019

Example “Parallel Processing”

Data Science
31

Social Media Use-case

Data Science
32

5
9/14/2019

Example “Naïve Bayes Classifier”

Data Science
33

Data Scientists
 Data Scientist is
 A practitioner who has sufficient knowledge of the
overlapping regimes of expertise in;
 Business needs,
 Domain knowledge,
 Analytical skills and
 Programming expertise
 To manage the end-to-end scientific method in the big data
lifecycle to bring
 Structure to it,
 Find compelling patterns in it, and
 Advise executives on the implications for products,
processes, and decisions
Data Science
34

6
9/14/2019

Introduction to https://hadoop.apache.org/

 The Apache Hadoop project develops open-source

software for
 Reliable, Scalable, Distributed computing

 The Hadoop framework allows

 Distributed processing of large data sets across clusters of
computers using simple programming models
 Can parallel process small data sets
 Use MapReduce framework
 Introduced by Google in 2004, provide
 Parallel processing and
 Associated implementation
 Adopted by apache
Data Science
35

Hadoop Key Characteristics

To experience the power of

Hadoop, one need to have data in
TBs, where RDBMS fails or take
hours and Hadoop does it in couple
of minutes.

Work both on
Large and Small Data sets

Data Science
36

7
9/14/2019

Introduction to R
 R
 Is an open source programming language
 Freely available
 Has GUI support and easy to learn
 Is a Software environment for statistical computing and
graphics
 Has advanced graphics for information representation
 Widely used among statisticians and data miners
 Has a lot of packages
 Allow multiple ways to do same thing
 Customization need command line
 Can be connected to many database engines

Data Science
37

Introduction to R (Cont…)

 R provide support for Email-list, twitter, etc…

 User can create functions easily
 R allow to interface procedures from C, C++ and Fortran
easily for efficiency
 R & Hadoop Integration
 Rhadoop support Hadoop functionalities
 Developed by Revolution Analytics

 Contain three main packages

 rmr : for Hadoop MapReduce functionality
 rhdfs : for HDF file management functionality
 rhbase : for Hbase database management functionality
Data Science
38

8
9/14/2019

Introduction to https://mahout.apache.org/

 Mahout is
 Used to create
 Scalable
 Performant (efficient)
 Machine learning applications

 Can create intelligent applications

 Support implementation for
 Clustering
 Classification
 Collaborative filtering
Data Science
39

Database vs. Data Science

Databases Data Science
Data Value “Precious” “Cheap”
Data Volume Modest Massive
Examples Bank records, Online clicks,
Personnel records, GPS logs,
Census, Medical records Tweets, Building sensor readings
Priorities Consistency, Speed,
Error recovery, Availability,
Auditability Query richness
Structured Strongly (Schema) Weakly or none (Text)
Properties Transactions, ACID* CAP* theorem (2/3),
eventual consistency
Realizations SQL NoSQL: Riak, Memcached,
Apache River, MongoDB, CouchDB,
Hbase, Cassandra,…
Data Science
40

9
9/14/2019

Machine Learning vs. Data Science

Machine Learning Data Science

Develop new (individual) models Explore many models, build and tune
hybrids
Prove mathematical properties of Understand empirical properties of
models models
Improve/validate on a few, relatively Develop/use tools that can handle
clean, small datasets massive datasets

Publish a paper Take action!

Data Science
41

Concentration of Data Science

 Computer Science
 Mathematics and Applied Mathematics
 Applied Statistics
 Data Analysis
 Solid Programming Skills
 R, Python, Julia, SQL, etc…
 Data Mining
 Data Base Storage and Management
 Machine Learning and discovery

Data Science
42

10
9/14/2019

Tools for Data Scientists

 Cloud infrastructure
 Such as Apache Hadoop, Spark, Cloudera, Amazon Web
Services, Unix shell/awk/gawk, 1010data, Hortonworks,
Pivotal, and MapR. Most traditional IT vendors have migrated
their services and platforms to support cloud.
 Data/application integration
 Including Ab Initio, Informatica, IBM InfoSphere DataStage,
Oracle Data Integrator, SAP Data Integrator, Apatar,
CloverETL, Information Builders, Jitterbit, Adeptia Integration
Suite, DMExpress Syncsort, Pentaho Data Integration, and
Talend[Review 2016].
 Master data management
 Typical software and platforms include IBM InfoSphere Master
Data Management Server, Informatica MDM,
Data Science
43

Tools for Data Scientists (Cont…)

 Microsoft Master Data Services, Oracle Master Data
Management Suite, SAPNetWeaver Master Data
Management tool, Teradata Warehousing, TIBCO MDM,
Talend MDM, Black Watch Data.
 Data preparation and processing
 29 data preparation tools and platforms were listed, such as
Platfora, Paxata, Teradata Loom, IBM SPSS, Informatica
Rev, Omniscope, Alpine Chorus, Knime, and Wrangler
Enterprise and Wrangler.
 Analytics
 In addition to well-recognized commercial tools including
SAS Enterprise Miner, IBM SPSS Modeler and SPSS
Statistics, MatLab, and RapidMiner
Data Science
44

11
9/14/2019

References
 Some references for this chapter are;
 www.edureka.in/data-science

Data Science
45

Data Science & Its Applications
No ratings yet
Data Science & Its Applications
59 pages
Unit 2 Data Science
No ratings yet
Unit 2 Data Science
53 pages
Unit 1 DS BCA NOTES
No ratings yet
Unit 1 DS BCA NOTES
7 pages
Py Spark
No ratings yet
Py Spark
427 pages
Unit-1 IDS
No ratings yet
Unit-1 IDS
26 pages
Fundamentals of Data Science
100% (3)
Fundamentals of Data Science
62 pages
ChatGPT - MyLearning On Big Data, Data Science and Machine Learning
No ratings yet
ChatGPT - MyLearning On Big Data, Data Science and Machine Learning
44 pages
CO1 1 Introduction To Data Science, Evolution of Data SciencE
No ratings yet
CO1 1 Introduction To Data Science, Evolution of Data SciencE
24 pages
Unit 1-FDS
100% (2)
Unit 1-FDS
18 pages
Abdul Kadir
No ratings yet
Abdul Kadir
97 pages
Fds Module 1
No ratings yet
Fds Module 1
65 pages
Kadir
No ratings yet
Kadir
80 pages
Data Science
100% (2)
Data Science
52 pages
Introduction To Datasciecne
No ratings yet
Introduction To Datasciecne
50 pages
Ids Unit 1,2,3,4 & 5
No ratings yet
Ids Unit 1,2,3,4 & 5
117 pages
Ids PPT and PDF
No ratings yet
Ids PPT and PDF
493 pages
Kadir
No ratings yet
Kadir
84 pages
Dia 1
No ratings yet
Dia 1
88 pages
PSD02 - Data Science Overview
No ratings yet
PSD02 - Data Science Overview
64 pages
Module 1 Applied Data Science 1.1 and 1.2
No ratings yet
Module 1 Applied Data Science 1.1 and 1.2
104 pages
(Ebook PDF) Modern Database Management 12th Global Edition Instant Download
100% (5)
(Ebook PDF) Modern Database Management 12th Global Edition Instant Download
57 pages
Data Science
No ratings yet
Data Science
244 pages
Machine Learning Unit-1.1
No ratings yet
Machine Learning Unit-1.1
43 pages
FODS Full Notes
No ratings yet
FODS Full Notes
217 pages
DSOST1
No ratings yet
DSOST1
91 pages
Lecture 2-Quick Overview of Data Science
No ratings yet
Lecture 2-Quick Overview of Data Science
18 pages
Himadev
No ratings yet
Himadev
37 pages
Unit-3 Intr Data Science
No ratings yet
Unit-3 Intr Data Science
150 pages
01 - Introduction To Data Analytics
100% (2)
01 - Introduction To Data Analytics
58 pages
Data Science - Unit 1 MDM
No ratings yet
Data Science - Unit 1 MDM
64 pages
Notes Unit1 Unit2
No ratings yet
Notes Unit1 Unit2
83 pages
Introductiontodatascience 230122140841 B90a0856 1
No ratings yet
Introductiontodatascience 230122140841 B90a0856 1
44 pages
Data Science Presentation Enhanced
No ratings yet
Data Science Presentation Enhanced
34 pages
Industrial Training Report
No ratings yet
Industrial Training Report
24 pages
Unit 3 Part 1
No ratings yet
Unit 3 Part 1
43 pages
Big Data Analytics: Data Scientists Are in High Demand
No ratings yet
Big Data Analytics: Data Scientists Are in High Demand
32 pages
Question Bank Syllbuswise
No ratings yet
Question Bank Syllbuswise
16 pages
347 862932 Introduction
No ratings yet
347 862932 Introduction
35 pages
Machine Learning Unit-1.1
No ratings yet
Machine Learning Unit-1.1
29 pages
Data Science Introduction
No ratings yet
Data Science Introduction
22 pages
Unit I Introduction To Data Science 9
No ratings yet
Unit I Introduction To Data Science 9
20 pages
FDS - Lecture Notes - III AIML, CSM
No ratings yet
FDS - Lecture Notes - III AIML, CSM
101 pages
DA-1,2,3 (1) Merged
No ratings yet
DA-1,2,3 (1) Merged
39 pages
IDS Complete Notes
No ratings yet
IDS Complete Notes
126 pages
Data Science
No ratings yet
Data Science
40 pages
Dsbda Unit 1
No ratings yet
Dsbda Unit 1
119 pages
Introductiontodatascience 230122140841 B90a0856
No ratings yet
Introductiontodatascience 230122140841 B90a0856
44 pages
DSC Unit 1
No ratings yet
DSC Unit 1
59 pages
AIDS C04-Session-19
No ratings yet
AIDS C04-Session-19
29 pages
1) Data-Sci Chapter-1
No ratings yet
1) Data-Sci Chapter-1
17 pages
DS B&V-1
No ratings yet
DS B&V-1
30 pages
Data Science
No ratings yet
Data Science
14 pages
M 1 FDS Notes
No ratings yet
M 1 FDS Notes
19 pages
Session 1819
No ratings yet
Session 1819
47 pages
Data Science Intro
No ratings yet
Data Science Intro
52 pages
Basics of Data Science KPK
No ratings yet
Basics of Data Science KPK
38 pages
Mrcet R20 Iv 1 QB
No ratings yet
Mrcet R20 Iv 1 QB
79 pages
Big Data Architecture Basics
No ratings yet
Big Data Architecture Basics
24 pages
02 - Campus Network Intelligent O&M and CampusInsight
No ratings yet
02 - Campus Network Intelligent O&M and CampusInsight
59 pages
A201 Topic 5 - Laudon Mis16 PPT Ch06 KL CE
No ratings yet
A201 Topic 5 - Laudon Mis16 PPT Ch06 KL CE
50 pages
A Review On Data Science Technologies
No ratings yet
A Review On Data Science Technologies
3 pages
Data Science BluePrint
No ratings yet
Data Science BluePrint
12 pages
The Field of Data Science
No ratings yet
The Field of Data Science
4 pages
Cloudera Distribution of Apache Kafka
No ratings yet
Cloudera Distribution of Apache Kafka
56 pages
HADOOP PPT
No ratings yet
HADOOP PPT
21 pages
Spring Cloud Stream Reference
No ratings yet
Spring Cloud Stream Reference
120 pages
Unit 3 Introduction To Hadoop Syllabus
No ratings yet
Unit 3 Introduction To Hadoop Syllabus
22 pages
Unit 6 Spark
No ratings yet
Unit 6 Spark
43 pages
IOT Analytics - AI361
No ratings yet
IOT Analytics - AI361
3 pages
CDE Sample Interview Questions
No ratings yet
CDE Sample Interview Questions
10 pages
21CS72 Solutions
No ratings yet
21CS72 Solutions
30 pages
pkdp-203 0
No ratings yet
pkdp-203 0
23 pages
File System Basics: Hadoop Distributed
No ratings yet
File System Basics: Hadoop Distributed
22 pages
Big Data Analytics in Agriculture
No ratings yet
Big Data Analytics in Agriculture
9 pages
Big Data Lab Material
No ratings yet
Big Data Lab Material
45 pages
VP Software Product Development CTO in San Francisco Bay CA Resume Ahmed Ezzat
No ratings yet
VP Software Product Development CTO in San Francisco Bay CA Resume Ahmed Ezzat
3 pages
Resume
100% (1)
Resume
7 pages
HDFS
100% (2)
HDFS
6 pages
Big Data Survey 2014
No ratings yet
Big Data Survey 2014
39 pages
SITA1603 - Big Data - UNIT 1 - Material
No ratings yet
SITA1603 - Big Data - UNIT 1 - Material
23 pages
MCA 5th Year 2021-22
No ratings yet
MCA 5th Year 2021-22
24 pages
Sailfish - A Framework For Large Scale Data Processing
No ratings yet
Sailfish - A Framework For Large Scale Data Processing
14 pages
BD Unit3 Summary
No ratings yet
BD Unit3 Summary
6 pages
Bda Index
No ratings yet
Bda Index
3 pages
PYQ Big Data Analytics 1 SEC May 2024
No ratings yet
PYQ Big Data Analytics 1 SEC May 2024
2 pages
Varun Vydyula: Phone: (M) +91-9052291000
No ratings yet
Varun Vydyula: Phone: (M) +91-9052291000
3 pages
Developing Analytic Talent: Becoming a Data Scientist
From Everand
Developing Analytic Talent: Becoming a Data Scientist
Vincent Granville
3/5 (7)
Modin for Scalable Data Science: The Complete Guide for Developers and Engineers
From Everand
Modin for Scalable Data Science: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Hadoop Ecosystem for Big Data
From Everand
Hadoop Ecosystem for Big Data
Dr. Zemelak Goraga
No ratings yet
Comprehensive Guide to Glue for Scientific Data Exploration: Definitive Reference for Developers and Engineers
From Everand
Comprehensive Guide to Glue for Scientific Data Exploration: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet

2 DS # 1 Introduction To DS

Uploaded by

2 DS # 1 Introduction To DS

Uploaded by

9/14/2019

Data Science “Core Components”

Data Science “Use-case Implementation”

Data Science “Process Flow Diagram”

Example “K-Means Clustering”

Market Basket Use-case

Example “Association Rule Mining”

Health Care Use-case

Example “Parallel Processing”

Social Media Use-case

Example “Naïve Bayes Classifier”

 The Apache Hadoop project develops open-source

 The Hadoop framework allows

Hadoop Key Characteristics

To experience the power of

 R provide support for Email-list, twitter, etc…

 Contain three main packages

 Can create intelligent applications

Database vs. Data Science

Machine Learning vs. Data Science

Machine Learning Data Science

Publish a paper Take action!

Concentration of Data Science

Tools for Data Scientists

Tools for Data Scientists (Cont…)

You might also like