0% found this document useful (0 votes)
26 views

1.introduction To Bigdata Chap1

This document provides an overview of an introductory course on big data. The course will be taught over 16 weeks and cover topics such as the definition of big data, data visualization, statistical modeling, machine learning algorithms, and trends in big data applications. Students will learn computational approaches through hands-on exercises using the R programming language. Evaluation will be based on exams, quizzes, assignments, and attendance. Reference materials will be provided online.

Uploaded by

Snoussi Oussama
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views

1.introduction To Bigdata Chap1

This document provides an overview of an introductory course on big data. The course will be taught over 16 weeks and cover topics such as the definition of big data, data visualization, statistical modeling, machine learning algorithms, and trends in big data applications. Students will learn computational approaches through hands-on exercises using the R programming language. Evaluation will be based on exams, quizzes, assignments, and attendance. Reference materials will be provided online.

Uploaded by

Snoussi Oussama
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

DCCS208(02) Korea University 2019 Fall

Introduction
to Big Data
Chapter 1 & 2 (Week 1)
Course overview & introduction
Asst. Prof. Minseok Seo
[email protected]
Course Overview
Introduction to Big Data 01
Contents

1. Course Overview
 Brief introduction of professor & course
 Object & Aim of the course
 Assignments & Quiz
 Evaluation

2. Introduction to Big Data


 Definition of Big Data
 Key techniques in Data Science
 Core technology of Informatics
Course Overview
Course information

Introduction to Big Data, DCCS208(02), Fall 2019.

 Lecture time: Wed. (6,7) and Thu. (6)

 Location: Wed. (7-310) and Thu. (7-315)

 Completion division: Major elective subject

 Level: Junior / Senior

copyrightⓒ 2018 All rights reserved by Korea University 4 / 20


Course Overview
Definition of Big Data (Cont.)

VS.

Which is bigger, elephant or rat?

copyrightⓒ 2018 All rights reserved by Korea University 5 / 20


Course Overview
Definition of Big Data (Cont.)

 What is Data?

Attributes (Dimension; Features; Variables)


Objects (Samples, Individuals)

ID Height Weight Age


Student 1 189 cm 81 kg 24
Student 2 210 cm 90 kg 26
Student 3 191 cm 92 kg 27
… … … …
Student N 162 cm 71 kg 21

copyrightⓒ 2018 All rights reserved by Korea University 6 / 20


Course Overview
Definition of Big Data (Cont.)

 In a narrow sense, Big Data means only sample size.

 In a broad sense, Big Data represents both sample size and dimensionality.

copyrightⓒ 2018 All rights reserved by Korea University 7 / 20


Course Overview
Definition of Big Data (Cont.)

 3V’s (Volume, Velocity, and Variety)

copyrightⓒ 2018 All rights reserved by Korea University 8 / 20


Course Overview
Definition of Big Data (Cont.)

 5V’s (Volume, Velocity, Variety, Veracity, and Value)

 Volume: Data size


 Velocity: Data production speed
 Variety: Data oriented from various things
 Veracity: Data accuracy (Trustworthy)
 Value: Data value

Value*

copyrightⓒ 2018 All rights reserved by Korea University 9 / 20


Course Overview
Relationship between Big-data & Data Science

 The amount of data and information is not directly correlated with


knowledge generation.

 But the demand for data scientists will be growing.

copyrightⓒ 2018 All rights reserved by Korea University 10 / 20


Course Overview
Job market of Big data

Furht B., Villanustre F. (2016) Introduction to Big Data. In: Big Data Technologies and Applications. Springer, Cham

It is the time to prepare for an academic course to cultivate data analysts
commensurate with demand.

copyrightⓒ 2018 All rights reserved by Korea University 11 / 20


Course Overview
Object & Aim of the course

 Students who have taken this course expect to be able to learn:

Concept of
Big Data

Computational
Basic Skill in
approaches for
Data Science
Big Data

Introduction to
Big Data

Statistical
R
approaches for
programming
Big Data

Visualization
for Big Data

copyrightⓒ 2018 All rights reserved by Korea University 12 / 20


Course Overview
Course schedule (Before Mid-term exam)

Week Period Study Contents

1 09.02 - 09.08 Introduction to Big Data & Data Science

Overall workflow, Computer Software issues, and applications in the


2 09.09 - 09.15 Big Data era

3 09.16 - 09.22 Introduction to R programming

4 09.23 - 09.29 Descriptive & Fundamental Statistics

5 09.30 - 10.06 Understanding Data Structures (Types of random variable)

6 10.07 - 10.13 Data Visualization

7 10.14 - 10.20 Preprocessing of Big Data (Quality Control and Prescreening)

8 10.21 - 10.27 Mid-term Exam

copyrightⓒ 2018 All rights reserved by Korea University 13 / 20


Course Overview
Course schedule (After Mid-term exam)

Week Period Study Contents

9 10.28 - 11.03 Parallel and Distributed Processing for Big Data

10 11.04 - 11.10 Statistical Estimation & Modeling

11 11.11 - 11.17 Computational approach for statistical modeling with robustness

12 11.18 - 11.24 Clustering analysis (Unsupervised learning methods)

13 11.25 - 12.01 Classification analysis (Supervised learning methods)

14 11.02 - 12.08 Algorithms of Dimensionality Reduction for Big Data

15 12.09 - 12.15 Trends in various academic & industrial fields for application of Big Data

16 12.16 - 12.22 Final Exam

copyrightⓒ 2018 All rights reserved by Korea University 14 / 20


Course Overview
Two types of lectures per week

Wed. day Thu. Day


2hrs 1hr
Lecture for Theory Hands-on lecture

The methodology learned in theory class will be exercised in the computer lab. on Thursday.

 There are two representative computer language for Big data analysis, R and
Python.

 R will be used in this class.

 It is not required any prior knowledge of the R language because I plan to provide
example code for student's practice.

https://cran.r-project.org/

copyrightⓒ 2018 All rights reserved by Korea University 15 / 20


Course Overview
Exam, Quiz, and Homework

Midterm and Final exams


 There will be two exams.

 I will ask you to understand the basic computational/statistical algorithm.

Quiz
 There will be two simple quizzes in class to check the student's learning
progress of the course (before and after midterm respectively).

Homework
 There will be 4 times assignments.

 This will be a report on the theory and practice of data analysis learned in
class.

copyrightⓒ 2018 All rights reserved by Korea University 16 / 20


Course Overview
Evaluation plan

Midterm Final Quiz Assignment Attendance

10%
30%
20%

10%
30%

 Absolute grading system


Score ≥ 95, you will get A+
Score ≥ 90, you will get A
Score ≥ 85, you will get B+
and...

copyrightⓒ 2018 All rights reserved by Korea University 17 / 20


Course Overview
Textbook

 No Textbook

 This course will be proceed based on the presentation slide

 I will upload presentation slide in Blackboard & my homepage


Homepage: https://scholar.harvard.edu/msseo
Teaching >> Introduction to Big Data >> Related Materials

 Reference 1 (Kor. Version)


R for Practical Data Analysis
(online textbook and free)
http://r4pda.co.kr/pdf/r4pda_2014_03_02.pdf
 Reference 2 (Eng. Version)
Introduction to Data Science by Rafael A. Irizarry, 2019.
(online textbook and free)
https://rafalab.github.io/dsbook/
 Reference 3 (Eng. Version)
R for Data Science by Garrett Grolemund.
(online textbook and free)
https://r4ds.had.co.nz/

copyrightⓒ 2018 All rights reserved by Korea University 18 / 20


Course Overview
Contact information

 Prof. Minseok Seo


Location: 7-203
Tel: 044-860-1379
Email: [email protected]

 TA. Heechan Chae


Location: 7-328
Email: [email protected]

 If you have any questions about the course please email me and I will reply as
soon as I see it.

 If you need to meet in person, please make an appointment by email first.

 I will be available at Mon: 12:00 - 17:00 | Wed: 10:00 - 13:00 | Thu: 10:00 - 13:00.

copyrightⓒ 2018 All rights reserved by Korea University 19 / 20


End of
Orientation
Contents

1. Course Overview
 Brief introduction of professor & course
 Object & Aim of the course
 Assignments & Quiz
 Evaluation

2. Introduction to Big Data


 Concept of Big Data
 Key techniques in Data Science for Big data
Characteristics of Big Data
Remind concept of Big Data

 5V’s (Volume, Velocity, Variety, Veracity, and Value)

 Volume: Data size


 Velocity: Data production speed
 Variety: Data oriented from various things
 Veracity: Data accuracy (Trustworthy)
 Value: Data value

Value*

copyrightⓒ 2018 All rights reserved by Korea University 22 / 20


Petabyte era

1 PB = 1000000000000000B = 1015bytes = 1000terabytes

1000 PB = 1 exabyte (EB)

 transferred about 197 PB of data thorough its network each data (2018)

 processed about 24 petabytes daily (2009)

In fact, we can say that we have already entered the exabyte


era.

copyrightⓒ 2018 All rights reserved by Korea University 23 / 20


Characteristics of Big Data
How do you recognize if it's big data or not?

Computer Scientist

My computer is low on memory for


handling this data!!
That is Big Data

No!!!! This data is over 2TB. Where do I


store it?????
That is Big Data

In short, if you’re having trouble with data processing on your computer (멘붕에
빠지면), it will be due to the Big Data.

copyrightⓒ 2018 All rights reserved by Korea University 24 / 20


Characteristics of Big Data
How do you recognize if it's big data or not?

Statistician

When does this calculation end? I was


only waiting for 10 years ...

Dimensionality is too high!!!! I can’t build


statistical model using this data!!!

That is Big Data

In short, if you’re having trouble with data analysis on your computer (멘붕에 빠지
면), it will be due to the Big Data.

copyrightⓒ 2018 All rights reserved by Korea University 25 / 20


Core technologies of Big Data era
IT technologies to resolve issue derived from the Big data

Software Hardware

Prescreening techniques

Data Visualization

Feature selection

Parallel processing

Clouding computing

Distributed processing

Difficulties arise in both hardware and software.

But students can approach software difficulties.

copyrightⓒ 2018 All rights reserved by Korea University 26 / 20


Computational language for Big Data
R and Python

Wed. day Thu. Day


2hrs 1hr
Lecture for Theory Hands-on lecture

 There are two representative computer language for Big data analysis, R and
Python.

 R programming language (free and relatively easy) for hands-on lecture.

 Let’s connect R homepage

https://cran.r-project.org/

copyrightⓒ 2018 All rights reserved by Korea University 27 / 20


Install R
(Step 1) Download the R installer

copyrightⓒ 2018 All rights reserved by Korea University 28 / 20


Install R
(Step 2) Download the RStudio

 Download Rstudio from https://www.rstudio.com/products/rstudio/download/

copyrightⓒ 2018 All rights reserved by Korea University 29 / 20


Install R
(Step 3) Install R and Rstudio

copyrightⓒ 2018 All rights reserved by Korea University 30 / 20


What is R
 R is an interpreted computer language.

 It is possible to interface procedures written in C, C+, and etc., languages for


efficiency.

 System commands can be called from within R

 R is used for data manipulation, statistics, and graphics.

copyrightⓒ 2018 All rights reserved by Korea University 31 / 20


R, S, and S-plus (History of R)
 S: an interactive environment for data analysis developed at Bell Laboratories since
1976
1988 - S2: RA Becker, JM Chambers, A Wilks
1992 - S3: JM Chambers, TJ Hastie
1998 - S4: JM Chambers

 Exclusively licensed by AT&T/Lucent to Insightful Corporation, Seattle WA. Product


name: “S-plus”.
Implementation languages C, Fortran.

 R: initially written by Ross Ihaka and Robert Gentleman at Dep. of Statistics of U of


Auckland, New Zealand during 1990s.

 Since 1997: international “R-core” team of ca. 15 people with access to common
CVS archive.

copyrightⓒ 2018 All rights reserved by Korea University 32 / 20


What R does and does not
 Possible
(1) data handling and storage: numeric, textual
(2) matrix algebra
(3) has tables and regular expressions
(4) high-level data analytic and statistical functions
(5) OOP (classes)
(6) Graphic
(7) Programming language: loops, branching, subroutines, and etc.,

 Impossible
(1) R is not a database, but connects to DBMSs
(2) R has no GUI, but connect to Java, TclTk
(3) R is fundamentally very slow, but allows to call own C/C++ code
(4) R is no spreadsheet view of data, but connects to Excel/MsOffice
(5) R is no professional & commercial support

 But all R users in the world are developers (Power of Collective intelligence; 집단지성).

 If you make a meaningful package at any time, you can publish it within 1 second.

 Therefore, applying latest algorithms are faster than any programming language.

copyrightⓒ 2018 All rights reserved by Korea University 33 / 20


Install R
(Step 3) Install R and Rstudio

copyrightⓒ 2018 All rights reserved by Korea University 34 / 20


End of Slide

You might also like