0% found this document useful (0 votes)

25 views

Module 1 Glossary What Is Big Data

This document provides definitions for over 30 terms related to big data and tools used for working with large datasets. It includes definitions for Apache Spark, Hadoop, HDFS, Hive, MapReduce, NoSQL databases, and other common big data terms.

Uploaded by

Hafiszan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views

Module 1 Glossary What Is Big Data

Uploaded by

Hafiszan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 2

Introduction to Big Data with Spark and Hadoop

Module 1 Glossary: What Is Big Data?

Welcome! This alphabetized glossary contains many of the terms in this course. This comprehensive glossary also includes additional industry-recognized terms not used in course videos. These terms are essential for you
to recognize when working in the industry, participating in user groups, and in other professional certificate programs.

Estimated reading time: 12 minutes

Term Definition

Apache Spark An open-source, in-memory application framework used for distributed data processing and iterative analysis of large data sets.

Apache HBase A robust NoSQL datastore that efficiently manages storage and computation resources independently of the Hadoop ecosystem.

Business intelligence
Encompasses various tools and methodologies designed to convert data into actionable insights efficiently.
(BI)

Data sets whose volume, velocity, or variety exceeds the capacity of conventional relational databases to effectively manage, capture, and process with minimal
Big data
latency. Key characteristics of big data include substantial volume, high velocity, and diverse variety.

Uses advanced analytic techniques against large, diverse big data sets that include structured, semi-structured, and unstructured data from different sources and
Big data analytics
sizes, from terabytes to zettabytes. It helps companies gain insights from the data collected by IoT devices.

Programming tools are the final component of big data commercial tools. These programming tools perform large-scale analytical tasks and operationalize big
Big data programming
data. They also provide all necessary functions for data collection, cleaning, exploration, modeling, and visualization. Some popular tools you can use for
tools
programming include R, Python, SQL, Scala, and Julia.

Most open-source projects have formal processes for contributing code and include various levels of influence and obligation to the project: Committer,
Committer
contributor, user, and user group. Typically, committers can modify the code directly.

Allows customers to access infrastructure and applications over the internet without needing on-premises installation and maintenance. By leveraging cloud
Cloud computing computing, companies can utilize server capacity on-demand and rapidly scale up to handle the extensive computational requirements of processing large data
sets and executing complex mathematical models.

Offer essential infrastructure and support, providing shared computing resources encompassing computing power, storage, networking, and analytical software.
Cloud providers These providers also offer software as a service model featuring specific solutions, enabling enterprises to gather, process, and visualize data efficiently.
Prominent examples of cloud service providers include AWS, IBM, GCP, and Oracle.

Extract, transform, and A systematic approach that involves extracting data from various sources, transforming it to meet specific requirements, and loading it into a data warehouse or
load (ETL) process another centralized data repository.

An open-source software framework that provides dependable distributed processing for large data sets through the utilization of simplified programming
Hadoop
models.
Term Definition

A file system distributed on multiple file servers, allowing programmers to access or store files from any network or computer. It is the storage layer of Hadoop. It
Hadoop Distributed File
works by splitting the files into blocks, creating replicas of the blocks, and storing them on different machines. It is built to access streaming data seamlessly. It
System (HDFS)
uses a command-line interface to interact with Hadoop.

A data warehouse infrastructure employed for data querying and analysis, featuring an SQL-like interface. It facilitates report generation and utilizes a declarative
Hive
programming language, enabling users to specify the data they want to retrieve.

A system of physical objects connected through the internet. A thing or device can include a smart device in our homes or a personal communication device such
Internet of Things (IoT) as a smartphone or computer. These collect and transfer massive amounts of data over the internet without manual intervention by using embedded
technologies.

Refers to information generated by various sources, including the Internet of Things (IoT) sensors embedded in industrial equipment, as well as weblogs that
Machine data
capture user behavior and interactions.

Map MapReduce converts a set of data into another set of data, and the elements are fragmented into tuples (key or value pairs).

A program model and processing technique used in distributed computing based on Java. It splits the data into smaller units and processes big data. It is the first
MapReduce
method used to query data stored in HDFS. It allows massive scalability across hundreds or thousands of servers in a Hadoop cluster.

NoSQL databases are built from the ground up to store and process vast amounts of data at scale and support a growing number of modern businesses. NoSQL
NoSQL databases databases store data in documents rather than relational tables. Types of NoSQL databases include pure document databases, key-value stores, wide-column
databases, and graph databases such as MongoDB, CouchDB, Cassandra, and Redis.

Not only is the runnable version of the code free, but the source code is also completely open, meaning that every line of code is available for people to view,
Open-source software
use, and reuse as needed.

Price analytics Helps understand market segmentation, identify the best price points for a product line, and perform margin analysis for maximum profitability.

Data is structured in the form of tables, with rows and columns, collectively forming a relational database. These tables are interconnected using primary and
Relational databases
foreign keys to establish relationships across the data set.

Utilizes social media conversations to gain insights into consumer opinions about a product. It is used to develop effective marketing strategies and establish
Sentiment analysis
customer connections based on their sentiments and preferences.

Comes from the likes, tweets and retweets, comments, video uploads, and general media that are uploaded and shared via the world's favorite social media
Social data
platforms. Machine-generated data and business-generated data are data that organizations generate within their own operations.

Transactional data Generated from all the daily transactions that take place both online and offline, such as invoices, payment orders, storage records, and delivery receipts.

Velocity The speed at which data arrives. Velocity is one of the four main components used to describe the dimensions of big data.

Volume The increase in the amount of data stored over time. Volume is one of the four main components used to describe the dimensions of big data.

Variety The diversity of data or the various data forms that need to be stored. Variety is one of the four main components used to describe the dimensions of big data.

The certainty of data, as with a large amount of data available, makes it difficult to determine if the data collected is accurate. Veracity is one of the four main
Veracity
components used to describe the dimensions of big data.

Serves as the resource manager bundled with Hadoop and is typically the default resource manager for numerous big data applications, such as HIVE and Spark.
Yet Another Resource
While it remains a robust resource manager, it's important to note that more contemporary container-based resource managers, such as Kubernetes, are
Negotiator (YARN)
gradually emerging as the new standard practices in the field

PP-4 MS
No ratings yet
PP-4 MS
14 pages
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
cTAKES 4.0 User Install Guide: Contents of This Page
No ratings yet
cTAKES 4.0 User Install Guide: Contents of This Page
14 pages
Glossary
No ratings yet
Glossary
11 pages
Cours BI 23 24 Session 4 2
No ratings yet
Cours BI 23 24 Session 4 2
46 pages
Big Data Analytics 1-5
100% (1)
Big Data Analytics 1-5
63 pages
Open Source Technologies
No ratings yet
Open Source Technologies
19 pages
Cloud & Big Data
No ratings yet
Cloud & Big Data
5 pages
Big Data Overview
No ratings yet
Big Data Overview
18 pages
A Guide For Beginners: Big Data Glossary
No ratings yet
A Guide For Beginners: Big Data Glossary
1 page
biggdata
No ratings yet
biggdata
24 pages
Big Data Analytics - Unit 2
No ratings yet
Big Data Analytics - Unit 2
10 pages
big data unit 1
No ratings yet
big data unit 1
24 pages
DA U2
No ratings yet
DA U2
17 pages
Big Data Unit 1 Notes - 240311 - 100703
No ratings yet
Big Data Unit 1 Notes - 240311 - 100703
15 pages
Learn Hadoop in 24 Hours
From Everand
Learn Hadoop in 24 Hours
Alex Nordeen
No ratings yet
BDA
No ratings yet
BDA
8 pages
Types of Digital Data: Unit 1 Big Data KCS-061
No ratings yet
Types of Digital Data: Unit 1 Big Data KCS-061
12 pages
Big Data: Spot Business Trends, Prevent Diseases, C Ombat Crime and So On"
No ratings yet
Big Data: Spot Business Trends, Prevent Diseases, C Ombat Crime and So On"
8 pages
Big Data
100% (2)
Big Data
20 pages
Big Data
No ratings yet
Big Data
20 pages
15 Big Data Tools and Technologies To Know About in 2021
No ratings yet
15 Big Data Tools and Technologies To Know About in 2021
7 pages
The Power of Big Data: Transforming Industries and Shaping the Future
From Everand
The Power of Big Data: Transforming Industries and Shaping the Future
Tom Henricksen
No ratings yet
Big Data Tools and Techniques
No ratings yet
Big Data Tools and Techniques
12 pages
Hadoop Ecosystem for Big Data
From Everand
Hadoop Ecosystem for Big Data
Dr. Zemelak Goraga
No ratings yet
Chapter 6 - Big Data Architecture Part 1
No ratings yet
Chapter 6 - Big Data Architecture Part 1
41 pages
Data Science
No ratings yet
Data Science
87 pages
226 Unit-7
No ratings yet
226 Unit-7
26 pages
Lect - 11 - BIG DATA
No ratings yet
Lect - 11 - BIG DATA
42 pages
Big Data Lab Manual
No ratings yet
Big Data Lab Manual
36 pages
BDA Unit 2 1
No ratings yet
BDA Unit 2 1
42 pages
BDA - Unit 4
No ratings yet
BDA - Unit 4
18 pages
ACFrOgAo1SpYCo1YmTJeiGbHKH22nYKAL3GLgRtzpk4R3gRbHCAsTnCSMxfKm0SFBNYGz7keG7rfZN Y3QVo gdxiQyqG - 6KLsY2icn
No ratings yet
ACFrOgAo1SpYCo1YmTJeiGbHKH22nYKAL3GLgRtzpk4R3gRbHCAsTnCSMxfKm0SFBNYGz7keG7rfZN Y3QVo gdxiQyqG - 6KLsY2icn
14 pages
Big Data Architecture
No ratings yet
Big Data Architecture
17 pages
I Jcs It 20150605100
No ratings yet
I Jcs It 20150605100
4 pages
Big Data Analytics_Lecture Slides
No ratings yet
Big Data Analytics_Lecture Slides
72 pages
Big Data Course Agenda
No ratings yet
Big Data Course Agenda
3 pages
BDTools
No ratings yet
BDTools
15 pages
The Age OF: Every Minute
No ratings yet
The Age OF: Every Minute
47 pages
Hadoop PPT
No ratings yet
Hadoop PPT
25 pages
Introduction To Big Dat1
No ratings yet
Introduction To Big Dat1
6 pages
BDA Module-2 Notes PDF
100% (1)
BDA Module-2 Notes PDF
14 pages
Lect7 IoT BigData1
No ratings yet
Lect7 IoT BigData1
28 pages
Hadoop
No ratings yet
Hadoop
562 pages
BDA Report
No ratings yet
BDA Report
11 pages
Hadoop Vs Apache Spark
No ratings yet
Hadoop Vs Apache Spark
6 pages
Replication-Based Query Management For Resource Allocation Using Hadoop and MapReduce Over Big Data
No ratings yet
Replication-Based Query Management For Resource Allocation Using Hadoop and MapReduce Over Big Data
13 pages
Chapter 6 - Big Data Architecture Part 1
No ratings yet
Chapter 6 - Big Data Architecture Part 1
41 pages
Hadoop Job Runner UI Tool
No ratings yet
Hadoop Job Runner UI Tool
10 pages
Lab Manual BDA
No ratings yet
Lab Manual BDA
36 pages
Terminologies Used in Big Data Environments
No ratings yet
Terminologies Used in Big Data Environments
3 pages
Chapter 2-Data Science
No ratings yet
Chapter 2-Data Science
23 pages
Bigdata Analysis: Streaming Twitter Data With Apache Hadoop and V Isualizing Using Biginsights
No ratings yet
Bigdata Analysis: Streaming Twitter Data With Apache Hadoop and V Isualizing Using Biginsights
5 pages
Unit 4
No ratings yet
Unit 4
4 pages
Chap3_OverviewOfBigDataEcosystem
No ratings yet
Chap3_OverviewOfBigDataEcosystem
91 pages
Hadoop & BigData (UNIT - 2)
No ratings yet
Hadoop & BigData (UNIT - 2)
22 pages
BIG DATA_UNIT-I
No ratings yet
BIG DATA_UNIT-I
17 pages
Unit 1 Big Data Notes
No ratings yet
Unit 1 Big Data Notes
48 pages
Overview of Security Issues
No ratings yet
Overview of Security Issues
5 pages
Practical Data Strategies and Recipes
From Everand
Practical Data Strategies and Recipes
Tom Henricksen
No ratings yet
Big Data Mining: A Challenge and How To Manage It: Csa Deptt. Pdmce Jitender Csa Deptt. Pdmce
No ratings yet
Big Data Mining: A Challenge and How To Manage It: Csa Deptt. Pdmce Jitender Csa Deptt. Pdmce
3 pages
Unit 5
No ratings yet
Unit 5
68 pages
8 - File Processing
No ratings yet
8 - File Processing
12 pages
4.SQL Queries DML
No ratings yet
4.SQL Queries DML
47 pages
Shabarish Kesa h1b Resume
No ratings yet
Shabarish Kesa h1b Resume
3 pages
SAP Note 3221581
No ratings yet
SAP Note 3221581
4 pages
Supervised Vs Unsupervised Learning
No ratings yet
Supervised Vs Unsupervised Learning
9 pages
Soal Quiz Section 4 Oracle
No ratings yet
Soal Quiz Section 4 Oracle
4 pages
BI Chapter 3 - SP2020 PDF
No ratings yet
BI Chapter 3 - SP2020 PDF
13 pages
Post Graduate Diploma in Data Science (PGDDS)
No ratings yet
Post Graduate Diploma in Data Science (PGDDS)
2 pages
Database Management System Exams
No ratings yet
Database Management System Exams
8 pages
Api Class
No ratings yet
Api Class
49 pages
Dbms U2 One Shot Bcs501
No ratings yet
Dbms U2 One Shot Bcs501
71 pages
Database Indexing
No ratings yet
Database Indexing
4 pages
Online Document Management System in Spring Boot and Hibernate With Source Code - Codebun
No ratings yet
Online Document Management System in Spring Boot and Hibernate With Source Code - Codebun
14 pages
Data Camp SQL
No ratings yet
Data Camp SQL
18 pages
Dbms LAB File Ritesh
No ratings yet
Dbms LAB File Ritesh
17 pages
Pedro Alvarado Chavarría: Pedroalvaradochavarria@protonmail - CH / (506) 8833-9280
No ratings yet
Pedro Alvarado Chavarría: Pedroalvaradochavarria@protonmail - CH / (506) 8833-9280
1 page
MySQL Command
No ratings yet
MySQL Command
7 pages
Satya Sagar: Career Objective
No ratings yet
Satya Sagar: Career Objective
4 pages
DMW Merged
No ratings yet
DMW Merged
454 pages
MIS
No ratings yet
MIS
55 pages
SAP HANA's Defining Capabilities
No ratings yet
SAP HANA's Defining Capabilities
11 pages
Distributed Computer System (Final Exam)
No ratings yet
Distributed Computer System (Final Exam)
18 pages
WA Implementation Guide
100% (1)
WA Implementation Guide
58 pages
Get Hands-On Database 2nd Edition Steve Conger Solutions Manual free all chapters
100% (20)
Get Hands-On Database 2nd Edition Steve Conger Solutions Manual free all chapters
42 pages
BerryMill A Level IT
No ratings yet
BerryMill A Level IT
27 pages
BA-MOOC1-Course Guide-Final
No ratings yet
BA-MOOC1-Course Guide-Final
7 pages
Hydraulic Modeling and GIS PDF
No ratings yet
Hydraulic Modeling and GIS PDF
21 pages
Hotel Management System Database Design
No ratings yet
Hotel Management System Database Design
25 pages

Module 1 Glossary What Is Big Data

Uploaded by

Module 1 Glossary What Is Big Data

Uploaded by

Introduction to Big Data with Spark and Hadoop

Module 1 Glossary: What Is Big Data?

Estimated reading time: 12 minutes

You might also like