0% found this document useful (0 votes)
20 views12 pages

2 DS # 1 Introduction To DS

Data sceince 2nd lecture inroduction to data science pdf

Uploaded by

mussaratk485
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views12 pages

2 DS # 1 Introduction To DS

Data sceince 2nd lecture inroduction to data science pdf

Uploaded by

mussaratk485
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

9/14/2019

Data Science “Core Components”

Data Science
23

Data Science “Use-case Implementation”

Data Science
24

1
9/14/2019

Data Science “Process Flow Diagram”

Data Science
25

Media Use-case

Data Science
26

2
9/14/2019

Example “K-Means Clustering”

Data Science
27

Market Basket Use-case

Data Science
28

3
9/14/2019

Example “Association Rule Mining”

Data Science
29

Health Care Use-case

Data Science
30

4
9/14/2019

Example “Parallel Processing”

Data Science
31

Social Media Use-case

Data Science
32

5
9/14/2019

Example “Naïve Bayes Classifier”

Data Science
33

Data Scientists
 Data Scientist is
 A practitioner who has sufficient knowledge of the
overlapping regimes of expertise in;
 Business needs,
 Domain knowledge,
 Analytical skills and
 Programming expertise
 To manage the end-to-end scientific method in the big data
lifecycle to bring
 Structure to it,
 Find compelling patterns in it, and
 Advise executives on the implications for products,
processes, and decisions
Data Science
34

6
9/14/2019

Introduction to https://hadoop.apache.org/

 The Apache Hadoop project develops open-source


software for
 Reliable, Scalable, Distributed computing

 The Hadoop framework allows


 Distributed processing of large data sets across clusters of
computers using simple programming models
 Can parallel process small data sets
 Use MapReduce framework
 Introduced by Google in 2004, provide
 Parallel processing and
 Associated implementation
 Adopted by apache
Data Science
35

Hadoop Key Characteristics

To experience the power of


Hadoop, one need to have data in
TBs, where RDBMS fails or take
hours and Hadoop does it in couple
of minutes.

Work both on
Large and Small Data sets

Data Science
36

7
9/14/2019

Introduction to R
 R
 Is an open source programming language
 Freely available
 Has GUI support and easy to learn
 Is a Software environment for statistical computing and
graphics
 Has advanced graphics for information representation
 Widely used among statisticians and data miners
 Has a lot of packages
 Allow multiple ways to do same thing
 Customization need command line
 Can be connected to many database engines

Data Science
37

Introduction to R (Cont…)

 R provide support for Email-list, twitter, etc…


 User can create functions easily
 R allow to interface procedures from C, C++ and Fortran
easily for efficiency
 R & Hadoop Integration
 Rhadoop support Hadoop functionalities
 Developed by Revolution Analytics

 Contain three main packages


 rmr : for Hadoop MapReduce functionality
 rhdfs : for HDF file management functionality
 rhbase : for Hbase database management functionality
Data Science
38

8
9/14/2019

Introduction to https://mahout.apache.org/

 Mahout is
 Used to create
 Scalable
 Performant (efficient)
 Machine learning applications

 Can create intelligent applications


 Support implementation for
 Clustering
 Classification
 Collaborative filtering
Data Science
39

Database vs. Data Science


Databases Data Science
Data Value “Precious” “Cheap”
Data Volume Modest Massive
Examples Bank records, Online clicks,
Personnel records, GPS logs,
Census, Medical records Tweets, Building sensor readings
Priorities Consistency, Speed,
Error recovery, Availability,
Auditability Query richness
Structured Strongly (Schema) Weakly or none (Text)
Properties Transactions, ACID* CAP* theorem (2/3),
eventual consistency
Realizations SQL NoSQL: Riak, Memcached,
Apache River, MongoDB, CouchDB,
Hbase, Cassandra,…
Data Science
40

9
9/14/2019

Machine Learning vs. Data Science

Machine Learning Data Science


Develop new (individual) models Explore many models, build and tune
hybrids
Prove mathematical properties of Understand empirical properties of
models models
Improve/validate on a few, relatively Develop/use tools that can handle
clean, small datasets massive datasets

Publish a paper Take action!

Data Science
41

Concentration of Data Science


 Computer Science
 Mathematics and Applied Mathematics
 Applied Statistics
 Data Analysis
 Solid Programming Skills
 R, Python, Julia, SQL, etc…
 Data Mining
 Data Base Storage and Management
 Machine Learning and discovery

Data Science
42

10
9/14/2019

Tools for Data Scientists


 Cloud infrastructure
 Such as Apache Hadoop, Spark, Cloudera, Amazon Web
Services, Unix shell/awk/gawk, 1010data, Hortonworks,
Pivotal, and MapR. Most traditional IT vendors have migrated
their services and platforms to support cloud.
 Data/application integration
 Including Ab Initio, Informatica, IBM InfoSphere DataStage,
Oracle Data Integrator, SAP Data Integrator, Apatar,
CloverETL, Information Builders, Jitterbit, Adeptia Integration
Suite, DMExpress Syncsort, Pentaho Data Integration, and
Talend[Review 2016].
 Master data management
 Typical software and platforms include IBM InfoSphere Master
Data Management Server, Informatica MDM,
Data Science
43

Tools for Data Scientists (Cont…)


 Microsoft Master Data Services, Oracle Master Data
Management Suite, SAPNetWeaver Master Data
Management tool, Teradata Warehousing, TIBCO MDM,
Talend MDM, Black Watch Data.
 Data preparation and processing
 29 data preparation tools and platforms were listed, such as
Platfora, Paxata, Teradata Loom, IBM SPSS, Informatica
Rev, Omniscope, Alpine Chorus, Knime, and Wrangler
Enterprise and Wrangler.
 Analytics
 In addition to well-recognized commercial tools including
SAS Enterprise Miner, IBM SPSS Modeler and SPSS
Statistics, MatLab, and RapidMiner
Data Science
44

11
9/14/2019

References
 Some references for this chapter are;
 www.edureka.in/data-science

Data Science
45

12

You might also like