Data Science & Big Data Analysis Module 1,2,3,4,5
Data Science & Big Data Analysis Module 1,2,3,4,5
# Data Mining:-Data mining is the process of searching and analyzing a large number of
raw data in order to identify patterns and extract useful information.
-Companies use data mining software to learn more about their customers.
-It can help them to develop more effective marketing strategies, increase sales, and
decrease costs.
-Data mining Sometimes referred to as "knowledge discovery in databases” (KDD).
-It is also used in credit risk management, fraud detection, and spam filtering.
-The knowledge discovery process includes Data cleaning, Data integration, Data selection,
Data transformation, Data mining, Pattern evaluation, and Knowledge presentation.
→Steps of Data mining/ KDD process:-The process begins with determining the KDD
objectives.
-and ends with the implementation of the discovered knowledge.
-Steps Involved in KDD Process are;
● Data Cleaning:- This is also known as data cleansing.
-This is a phase in which noisy, inconsistent and irrelevant data are removed from the
collection.
”noise is a random or variance error.”
● Data integration:- In this step ,multiple data source may be combined and put into a
single source. (DataWarehouse)
● Data selection:-in this step , data relevant to the analysis is decided and retrieved
from the data collection.
● Data transformation:-This is a phase in which the selected data is transformed into
appropriate forms for the mining procedure
Eg; performing summary or aggregation operation.
● Data mining:-It is the crucial step in which intelligent techniques are applied to
extract patterns ,which are potentially useful.
● Pattern evaluation:- It Identifies strictly increasing patterns representing knowledge
based on given measures.
● Knowledge presentation:-This is the final phase in which the discovered knowledge
is visually represented to the user.
-where visualization and knowledge representation techniques are used to present
the mined knowledge to the user.
or
*Types of data that can be mined/types of Databases used in Data
Mining:-The most basic forms of data for mining applications are given below.
-Or Different kind of data can be mine. Some of the examples are mentioned below.
1. Database Data :-A database system, also called a database management system
(DBMS).
-Every DBMS stores data that are related to each other in a way or the other.
-It also has a set of software programs that are used to manage data and provide
easy access to it.
- These software programs serve a lot of purposes, including defining structure for
database, making sure that the stored information remains secured and consistent.
(relational database um ethintakooda ullathaa)
-A relational database has tables that have different names, attributes, and can store
rows or records of large data sets.
-Every record stored in a table has a unique key. Entity-relationship model is created
to provide a representation of a relational database that features entities and the
relationships that exist between them.
2. Data Warehouses:- A data warehouse is a single data storage location that collects
data from multiple sources and then stores it in the form of a unified plan.
-When data is stored in a data warehouse, it undergoes cleaning, integration,
loading, and refreshing.
-Data stored in a data warehouse is organized in several parts.
- If you want information on data that was stored 6 or 12 months back, you will get it
in the form of a summary.
-For example, rather than storing the details of each sales transaction, the data
warehouse may store a summary of the transactions per item.
4. Other Kinds of Data:-We have a lot of other types of data as well that are known for
their structure, semantic meanings, and versatility.
-They are used in a lot of applications. Here are a few of those data types: data
streams, engineering design data, sequence data, graph data, spatial data,
multimedia data, and more.
*data mining Technologies:-Data mining has incorporated many techniques from other
domain fields like machine learning, statistics, information retrieval, data warehouse,
pattern recognition, algorithms, and high-performance computing.
-The major technologies utilized in data mining are;
1. Machine Learning: -It canautomatically learn based on the given input data and
make intelligent decisions.
-There are similarities and interrelations between machine learning and data mining.
-For classification and clustering approaches, machine learning is often applied to
predict accuracy.
-This are the problems in machine learning that are highly related to data mining.
● Supervised learning
● Unsupervised learning
● Semi-supervised learning
● Active learning
2. Information Retrieval: -The technique searches for the information in the document,
which may be in text, multimedia, or residing on the Web.
-The most widely used information retrieval approach is the probabilistic model.
-Information retrieval combined with data mining techniques is used for finding out
any relevant topic in the document or web.
3. Statistics:-Data mining has a natural connection with statistics.
-Statistics are useful for pattern mining.
-When the statistical model is used on large data set, it increases the complexity cost.
-When data mining is used to handle large real-time and streamed data, computation
costs increase dramatically.
4. Database System & Data warehouse:- google
*Data mining Applications:-Here is the list of areas where data mining is widely used
● Healthcare:-Data mining has a lot of promise for improving healthcare systems.
-It identifies best practices for improving treatment and lowering costs using data
and analytics.
-machine learning, soft computing, data visualization, and statistics are among the
data mining techniques used by researchers.
Patients receive appropriate care at the correct place and at the right time thanks to
the development of processes.
-Healthcare insurers can employ data mining to detect fraud and misuse.
● Banking and Finance:-The banking industry is now dealing with and managing
massive volumes of data and transaction information as a result of digitalization.
-With its capacity to detect patterns, casualties, market risks, and other connections
that are critical for managers to be aware of, data mining applications in banking can
easily be the suitable answer.
● Market Basket Analysis:-Market Basket Analysis is a method for analyzing the
purchases made by a consumer in a supermarket.
-This notion identifies a customer's habit of regular purchases.
● Criminal Investigation:-Data mining activities are also used in Criminology, which is a
study of crime characteristics.
-First, text-based crime reports need to be converted into word processing files.
-Then, the identification and crime-machining process would take place by
discovering patterns in massive stores of data.
1.2 Noisy Data: Noisy data is a meaningless data that can’t be interpreted by
machines.
-It can be generated due to faulty data collection, data entry errors etc.
-It can be handled in following ways :
● Binning Method: This method works on
sorted data in order to smooth it.
-The whole data is divided into segments
of equal size and then various methods
are performed to complete the task.
- Each segmented is handled separately.
-One can replace all data in a segment by
its mean or boundary values can be used
to complete the task.
OR
-This method is to smooth or handle noisy data.
-First, the data is sorted then, and then the sorted values are
separated and stored in the form of bins.
-There are three methods for smoothing data in the bin.
-Smoothing by bin mean method: In this method, the values in the
bin are replaced by the mean value of the bin;
-Smoothing by bin median: In this method, the values in the bin are
replaced by the median value;
-Smoothing by bin boundary: In this method, the using minimum and
maximum values of the bin values are taken, and the closest boundary
value replaces the values.
● Regression: Here data can be made smooth by fitting it to a regression
function.
-The regression used may be linear (having one independent variable)
or multiple (having multiple independent variables).
● Clustering: This approach groups the similar data in a cluster.
-The outliers may be undetected or it will fall outside the clusters.
2. data integration:-Data integration combines data from multiple sources to form a
coherent data store.
-These sources may include multiple databases, data cubes, or flat files.
-There are multiple issues to consider during data integration. They are;
2.1 Schema integration and object matching:-Schema integration is used to
merge two or more existing database schemas into a single schema.
-Schema integration and object matching can be complex.
- For example, matching the entity identification (emp_id in one database and
emp_no in another database), such issues can be prevented using metadata.
2.2 Redundancy:-Redundancy is another issue.
-An attribute may be redundant if it can be derived or obtaining from another
Attribute.
-Some redundancies can be detected by correlation analysis.
2.3 Detection and resolution of data value conflicts:-This is the third
important issue in data integration.
-Attribute values from different sources may differ for the same entity.
3. Data reduction:--Data reduction is a process that reduces the volume of original data
and represents it in a much smaller volume.
-Data reduction techniques are used to obtain a reduced representation of the
dataset that
is much smaller in volume by maintaining the integrity of the original data.
-By reducing the data, the efficiency of the data mining process is improved.
-Data reduction does not affect the result obtained from data mining.
-That means the result obtained from data mining before and after data reduction is
the
same or almost the same.
-Strategies or techniques or methods of data reduction in data mining, they are;
3.1 Data cube aggregation:-This technique is used to aggregate data in a
simpler form.
-Aggregation operations are applied to the data in the construction of a data
cube.
3.2 Attribute subset selection:-where irrelevant, weakly relevant, or
redundant attributes or dimensions may be detected and removed.
3.3 Dimensionality reduction:-This mechanisms are used to reduce the data
set size.
3.4 Numerosity reduction:-In this reduction technique the actual data is
replaced with mathematical models or smaller representation of the data
instead of actual data.
-This technique includes two types parametric and non-parametric
numerosity reduction.
-Parametric numerosity reduction storing only data parameters instead of the
original data.
-non-parametric method such as clustering, histogram, sampling.
->sampling method they are;
1.Simple random sample without replacement (SRSWOR)
2.Simple random sample with replacement (SRSWR)
3.Cluster sample
4.Stratified sample
● Binning:-it is a data smoothing technique and its helps to group a huge number of
continuous values into a smaller number of bins.
-For example, if we have data about a group of students, and we want to arrange
their marks into a smaller number of marks intervals by making the bins of grades.
One bin for grade A, one for grade B, one for C, one for D, and one for F Grade.
*Data objects:-
*Differentiate data warehouse and database
___________________________________________________________________________
Module -2
➢ CONFIDENCE:-It is been calculated for whether the product sales are popular
on individual sales or through combined sales.
-That is calculated with combined transactions/individual transactions.
Confidence = freq(A,B)/freq(A)
-Scan database D for creation of c1 and compare candidate support count (sup.count)
With minimum support count (which is 2).
-so there is no support count (sup.count) which is less than 2.
-so collecting all itemset than is greater than or equal to minimum support( which is 2)
denoted by L1.
-Generate table c2 from L1 (with two item set).
-Scan the table c2 with database D for support count.
(eg:the combination of {l1,l2} is in database D is 4 )
-compare support count of c2 with minimum support count
(which is 2)
-There are {l1,l4},{l3,l4},{l3,l5} and {l4,l5} which support count
(sup-count)is less than minimum count 2.
-so remove it and form L2.
-The first step is to scan the database to find the occurrences of the itemsets in the
database.
-This step is the same as the first step of Apriori.
- The count of 1-itemsets in the database is called support count or frequency of 1-itemset.
-according to the count ,which have highest number must be first in the transaction items
Eg: consider first transaction items {b,a} here a value is 8 and b value is 7 .
-so rearrange them to {b,a} to {a,b}
-we need to rearrange them all according to there count number.
-Then onwards we do not use transaction items column but instead we use transaction
rearrange items column.
-After the completion of all translations in the translation table we get FP tree.
-we can draw arrows to between corresponding nodes.
-so once we obtain the FP tree and we find the frequent itemset which is ended with e.
-To find the itemset which are ending the e ,we need to consider the path in the FP tree
which are ending with e.
-This are the paths that ending with e:
{acd:1},{ad:1},{bc:1}
-with the path we count the items in the path
For example we have 3 a in the path
-and 1 b,2 c,and 2 d.
-our support cout is 2 so we cancel less than
2 items ,here b:1 so we cancel it.
-in next column we try all the possibilities from
The “count of each item in path” column that
Ends with e.
-This are the combinations that can be formed
From column 3.
-and we cancel all the items which are less than support count.
And at last we form frequent itemset which is ending with e.
-next we form a frequent item set which is ending with d,c,b and a.
*Cluster:-group of data into clusters so that the objects belong to the same group.
-Clustering helps to splits data into several subsets.
-Each of these subsets contains data similar to each other, and these subsets are called
clusters.
-clustering analysis is widely used, such as data analysis, market research, pattern
recognition, and image processing.
-Clustering analysis has been an solve problem in data mining due to its variety of
applications.
→Application of clustering:- Clustering analysis is broadly used in many applications
such as market research, pattern recognition, data analysis, and image processing.
-cluster can used In the field of biology, it can be used to derive plant and animal
taxonomies, categorize with similar functionalities.
-It also helps in the identification of groups of houses in a city according to house type,
value, and geographic location.
-Clustering also helps in classifying documents on the web for information discovery.
-Clustering is also used in outlier detection applications such as detection of credit card
fraud.
→Requirement of clusters / properties
● Scalability − We need highly scalable clustering algorithms to deal with large
Databases.
● Ability to deal with different kinds of attributes − Algorithms should be capable to
be applied on any kind of data such as interval-based (numerical) data, categorical,
and binary data.
● Discovery of clusters with attribute shape − The clustering algorithm should be
capable of detecting clusters of arbitrary shape. They should not be bounded to only
distance measures that tend to find spherical cluster of small sizes.
● High dimensionality − The clustering algorithm should not only be able to handle
low-dimensional data but also the high dimensional space.
● Ability to deal with noisy data − Databases contain noisy, missing or erroneous data.
Some algorithms are sensitive to such data and may lead to poor quality clusters.
*Main memory-based clustering algorithms typically operate on either of the following two
data structures.
● Data matrix:-This represent n object , such as person ,
with p variable such as age, height, gender and so on.
-the structure is in the form of a relational table or n-by-p
matrix (n object * p variables)
● Dissimilarity matrix:-This stores a collection of proximities that are available for all
pairs of n objects.
-It is often represented by an n-by-n table:--->
___________________________________________________________________________
Module-3-Introduction to Data Science
*Data science:-Data science is a deep study of the massive amount of data, which
involves extracting meaningful insights from raw, structured, and unstructured data that is
processed using the scientific method, different technologies, and algorithms.
-Data science and big data are used almost everywhere in both commercial and
noncommercial fields.
-In Commercial companies almost every industry use data science and big data to
understanding their customers, processes, staff, completion, and products.
-Many companies use data science to offer customers a better user experience, as well as to
cross-sell, up-sell, and personalize their offerings.
*Facets of data science:-In data science and big data you’ll come across many different
types of data, and each of them tends to require different tools and techniques.
-The main categories of data are these: (7 types)
■ Structured
■ Unstructured
■ Natural language
■ Machine-generated
■ Graph-based
■ Audio, video, and images
■ Streaming
1. Structured:-In structured data means data will be arranged in row and column.
-Structured data is the data that depends on a data model and locate in a fixed field
within a record.
-It’s often easy to store structured data in tables within data bases or Excel files.
-SQL, or Structured Query Language, is the preferred way to manage and locate data
in databases.
-It may also come across structured data that might give you a hard time storing it in
a traditional relational database.
-Hierarchical data such as a family tree is one such example.
2. Unstructured:-Unstructured data is data that isn’t easy to fit into a data model.
-because the content is context-specific or varying.
-One example of unstructured data is your regular email.
-A human-written email, is also a perfect example of natural language data.
3. Natural language:-Natural language is a special type of unstructured data; it’s
challenging to process because it requires knowledge of specific data science
techniques and linguistics.
-for example the conversation making through letters ,email, text,essay etc.. all this
are represented in your natural language.
4. Machine-generated:-Machine-generated data is that’s automatically created by a
computer, process, application or other machine without human intervention.
-Machine-generated data is becoming a major data resource and will continue to do
so.
-Due to the huge amount and speed of machine data, highly scalable technologies
are required for analysis.
5. Graph-based:-“Graph data” can be a confusing because any data can be shown in a
graph.
-The graph structures use nodes, edges, and properties to represent and store
graphical data.
-Graph-based data is a natural way to represent social networks, and its structure
allows you to calculate specific metrics such as the shortest path between two
people etc.
-Examples of graph-based data can be found on many social media websites.
6. Audio, video, and images:-Audio, image, and video are data types that create
specific challenges to a data scientist.
-Tasks that easy for humans, such as recognizing objects in pictures, turn out to be
challenging for computers.
7. Streaming:-streaming data can take almost any of the previous forms, it has an extra
Property.
👆
data preparation.
-Bakki front ill unde
-The steps involved in preprocessing
Are;
● Data cleaning
● Data transformation
● Combining data
-One table contains the observations from the month January and the second table
contains observations from the month February.
The result of appending these tables is a larger one with the observations from
January as well as February.
* Big Data in healthcare (9 marks) :-One of the most notable areas where data analysis
is making big changes is healthcare.
-In fact, healthcare analytics has the potential to reduce costs of treatment, predict
outbreaks of epidemics, avoid preventable diseases, and improve the quality of life in
general.
-The average human lifespan is increasing across the world population, which poses new
challenges to today’s treatment delivery methods.
-Health professionals, just like business entrepreneurs, are capable of collecting massive
amounts of data and looking for the best strategies to use these numbers.
-Big data is revolutionizing the healthcare industry and changing how we think about patient
care.
-In this case, big data refers to the vast amounts of data generated by healthcare systems
and patients, including electronic health records, claims data, and patient-generated data.
-With the ability to collect, manage, and analyze vast amounts of data, healthcare
organizations can now identify patterns, trends, and insights that can inform
decision-making and improve patient outcomes.
-big data also poses challenges and limitations that must be addressed.
-Such challenges and limitations include managing and analyzing vast amounts of data, and
ethical considerations, such as patient privacy.
-Data collection and management are crucial aspects of using big data in healthcare
decision-making.
-The types of data collected in healthcare include electronic health records (EHRs), claims
data, and patient-generated data.
-EHRs contain a wide range of patient information, such as medical history, medications, and
lab results, which can be used to identify patterns and trends in patient care.
-On the other hand, claims data includes information about insurance claims, such as the
cost of treatments, and can be used to identify patterns in healthcare spending.
-Patient-generated data, including data from wearables, surveys, and patient-reported
outcomes, can provide valuable insights into patients’ experiences and preferences.
-To overcome these challenges, healthcare organizations must have robust data
management and security systems in place.
-This includes investing in data integration and warehousing tools to enable data to be easily
integrated and analyzed.
→ The three V's of Big Data (3 marks):-The 3 V's (volume, velocity and variety) are three
defining properties of big data.
-Volume refers to the amount of data,
-velocity refers to the speed of data processing,
-and variety refers to the number of types of data.
→ Explain Big Data and algorithmic trading (3 marks):-Big data is a collection of data from
many different sources and is often describe by v’s : volume, variety and velocity.
-The data in the big data contains greater variety, arriving in increasing volumes and with
more velocity.
-Algorithm trading is the use of computer programs for entering trading orders, in which
computer programs decide on almost every aspect of the order, including the timing, price,
and quantity of the order etc.
→ Applications of Big Data (3) :-The term Big Data is referred to as large amount of
complex and unprocessed data.
-Now a days companies use Big Data to make business more informative and allow them to
make business decisions.
1. Financial and banking sector:-Big data analytics help banks and customer behaviour
on the basis of investment patterns, shopping trends, motivation to invest, and
inputs that are obtained from personal or financial backgrounds.
2. Healthcare:-Big data has started making a massive difference in the healthcare
sector, with the help of predictive analytics, medical professionals, and health care
personnel. It can produce personalized healthcare and solo patients also.
3. Government and Military:-The government and military also used technology at
high rates.
4. E-commerce:-E-commerce is also an application of Big data. It maintains
relationships with customers that is essential for the e-commerce industry.
-E-commerce websites have many marketing ideas to retail merchandise customers,
manage transactions, and implement better strategies of innovative ideas to improve
businesses with Big data.
5. Social Media:-Social Media is the largest data generator. The statistics have shown
that around 500+ terabytes of fresh data generated from social media daily,
particularly on Facebook.
-The data mainly contains videos, photos, message exchanges, etc.
6. Advertisement (3 marks ):-ig data allows your company to accumulate more data on
your visitors so you can target consumers with tailored advertisements that they are
more likely to view
-Advertisers identify their target audience based on demographics, past customers,
and other factors that might suggest the user would be interested in the ad.
___________________________________________________________________________
Module - 4
*Big data technologies :-This technology is primarily designed to analyze, process and
extract information from a large data set and a huge set of extremely complex structures.
-This is very difficult for traditional data processing software to deal with.
-Big Data technology is primarily classified into the following two types:
1. Operational Big Data Technologies:-This type of big data technology mainly includes
the basic day-to-day data that people used to process.
-Typically, the operational-big data includes daily basis data such as online
transactions, social media platforms etc.
Eg:Online ticket booking system, e.g., buses, trains, flights, and movies, etc.
-Online trading or shopping from e-commerce websites like Amazon, Flipkart,
Walmart, etc.
2. Analytical Big Data Technologies:-Analytical Big Data is commonly referred to as an
improved version of Big Data Technologies.
-This type of big data technology is a bit complicated when compared with
operational-big data.
-Analytical big data is mainly used when performance criteria are in use, and
important real-time business decisions are made based on reports created by
analyzing operational-real data.
Eg:-Stock marketing data
👆👇
-Weather forecasting data and the time series analysis
(randum ezhuthanam technology ennu chothichal )
–Top Big Data Technologies:- We can categorize the leading big data
technologies into the following four sections:
● Data Storage:-The leading Big Data Technologies that come under Data Storage are;
➢ Hadoop:-When it comes to handling big data, Hadoop is one of the leading
technologies that come into play.
-This technology is based entirely on map-reduce architecture .
-Hadoop is also best suited for storing and analyzing the data from various
machines with a faster speed and low cost.
-That is why Hadoop is known as one of the core components of big data
technologies.
-Hadoop is written in Java programming language.
➢ MongoDB:-MongoDB is another important component of big data
technologies in terms of storage.
-No relational properties and RDBMS properties apply to MongoDb because it
is a NoSQL database.
-The structure of the data storage in MongoDB is also different from
traditional RDBMS databases.
-This enables MongoDB to hold massive amounts of data.
● Data Mining:-This are the leading Big Data Technologies that come under Data
Mining:
➢ Presto:- Presto is an open-source.
-Presto is a Java-based query engine.
-The size of data sources can vary from gigabytes to petabytes.
- Companies like Repro, Netflix, Airbnb, Facebook and Checkr are using this
big data technology and making good use of it.
● Data Analytics:-This are the leading Big Data Technologies that come under Data
Analytics:
➢ Apache Kafka:-Apache Kafka is a popular streaming platform.
-This streaming platform is primarily known for its three core capabilities:
publisher, subscriber and consumer.
-It is referred to as a distributed streaming platform.
-It is written in Java language.
-Some top companies using the Apache Kafka platform include Twitter,
Spotify, Netflix, Yahoo, LinkedIn etc.
➢ R-Language:-R is defined as the programming language, mainly used in
statistical computing and graphics.
-It is a free software environment used by leading data miners, practitioners
and statisticians.
-Language is primarily beneficial in the development of statistical-based
software and data analytics.
● Data Visualization:-This are the eading Big Data Technologies that come under Data
Visualization:
➢ Tableau:-Tableau is one of the fastest and most powerful data visualization
tools used by leading business intelligence industries.
-It helps in analyzing the data at a very faster speed.
-Tableau is developed and maintained by a company named TableAU.
-It is written using multiple languages, such as Python, C, C++, and Java.
➢ Plotly:-As the name suggests, Plotly is best suited for plotting or creating
graphs and relevant components at a faster speed in an efficient way.
*Explain:-
1. Structuring Big data/Types Of Big Data:-Big data structures can be divided into
following three categories they are;
● Structured:-Any data that can be stored, accessed and processed in the form
of fixed format is termed as a ‘structured’ data.
-It is in a tabular form.
-Structured Data is stored in the relational database management system.
-Examples Of Structured Data
-An ‘Employee’ table in a database is an example of Structured Data
*Hadoop:-Hadoop is an open source framework from Apache and is used to store, process
and analyze data which are very huge in volume.
-Hadoop is written in Java and is not OLAP (online analytical processing).
It is used for offline processing.
-It is being used by Facebook, Yahoo, Google, Twitter, LinkedIn and many more.
-Moreover it can be scaled up just by adding nodes in the cluster.
→Features of hadoop:
● it is fault tolerance.
● it is highly available.
● it’s programming is easy.
● it have huge flexible storage.
● it is low cost.
→Hadoop Architecture:-
-The Hadoop architecture is a package of the file system, MapReduce engine and the HDFS
(Hadoop Distributed File System)
-A Hadoop cluster consists of a single master and multiple slave nodes.
-The master node includes Job Tracker, Task Tracker, NameNode, and DataNode whereas the
slave node includes DataNode and TaskTracker.
-There are three components of Hadoop:
1. Hadoop HDFS - Hadoop Distributed File System (HDFS) is the storage unit.
2. Hadoop MapReduce - Hadoop MapReduce is the processing unit.
3. Hadoop YARN - Yet Another Resource Negotiator (YARN) is a resource management
unit.
● Input splitting:-Map reduce in Big dta job is divide into fixed size pieces called
-Input split is a chunk of input that is consumed buy a single map.
● Mapping:-It is the very first phase in the execution of map-reduce program.
-In this phase each split is passed to a mapping function to produce output values.
● Shuffling:-It contain the output of mapping phase.
-It duty is to gather the appropriate outcomes from the mapping step.
● reducing:-It have output of shuffling phase .
-It combines values from shuffling p[hase and return a single output values.
>MapReduce Architecture:-The entire job is divided into tasks.
-There are two types of tasks namely, Mapping tasks and Reducing tasks.
-Mapping tasks splits the input data and performs mapping while reducing tasks performs
shuffling and aggregates the shuffling values and returns a single output value, thereby
reducing the data.
-The execution of these two tasks is controlled by two entities:
1. Job Tracker - It acts like a master and plays the role of scheduling jobs and tracking
the jobs assigned to Task Tracker.
2. Multiple Task Tracker - It acts like slaves.
-It tracks the jobs and reports the status of the jobs to the master (job tracker).
-In every execution there is only one job tracker and multiple tracker.
● Client: The MapReduce client is the one who brings the Job to the MapReduce for
processing.
-There can be multiple clients available that continuously send jobs for processing to
the Hadoop MapReduce Manager.
● Job: The MapReduce Job is the actual work that the client wanted to do which is
comprised of so many smaller tasks that the client wants to process or execute.
● Hadoop MapReduce Master: It divides the particular job into subsequent job-parts.
● Job-Parts: The task or sub-jobs that are obtained after dividing the main job.
-The result of all the job-parts combined to produce the final output.
● Input Data: The data set that is fed to the MapReduce for processing.
● Output Data: The final result is obtained after the processing.
● Resource Manager: It is the master daemon of YARN and is responsible for resource
assignment and management among all the applications.
-Whenever it receives a processing request, it forwards it to the corresponding node
manager and allocates the resources and fulfill the request .
● Node Manager:-It’s primary job is to keep -up with the resource manager.
→Features of HBase
● HBase is linearly scalable.
● It has automatic failure support.
● It provides consistent read and writes.
● It integrates with Hadoop, both as a source and a destination.
● It has easy java API for client.
● It provides data replication across clusters.
→Applications of HBase
● It is used whenever there is a need to write heavy applications.
● HBase is used whenever we need to provide fast random access to available data.
● Companies such as Facebook, Twitter, Yahoo, and Adobe use HBase internally.
___________________________________________________________________________
Module -5
→Advantage of RDBMS
● Flexibility
● Ease of use
● Collaboration.
● Build - in security.
● Better data integrity.
● Multi user access.
→DisAdvantage of RDBMS
● Expensive
● Difficult to recover lost data
● Complex software.
→Features of RDBMS:
● Offers information to be saved in the tables.
● Numerous users can access it together which is managed by a single user.
● Virtual tables are available for storing the insightful data.
● In order to exclusively find out the rows, the primary key is used.
● The data are always saved in rows and columns.
● To retrieve the information the indexes are used.
● Columns are being shared between tables using the keys.
*CAP theorem :-The three letters in CAP refer to three desirable properties of distributed
systems with replicated data:
● consistency (among replicated copies),
● availability (of the system for read and write operations)
● partition tolerance (in the face of the nodes in the system being partitioned by a
network fault).
-The CAP theorem states that it is not possible to guarantee all three of the desirable
properties – consistency, availability, and partition tolerance at the same time in a
distributed system with data replication.
-The theorem states that networked shared-data systems can only strongly support two of
the following three properties:
1. Consistency:-A system is said to be consistent if all
nodes see the same data at thesame time.
-Simply, if we perform a read operation on a
consistent system, it should return the value of the most
recent write operation.
-This means that, the read should cause all nodes to
return the same data, i.e., the value of the most recent write.
3. Partition Tolerance:-This condition states that the system does not fail,
regardless of if messages are dropped or delayed between
nodes in a system.
-Partition tolerance has become more of a necessity
than an option in distributed systems.
-It is made possible by sufficiently replicating records
across combinations of nodes and networks.
-Features of NoSQL :-
1. Dynamic schema: NoSQL databases do not have a fixed schema and can
accommodate changing data structures without the need for migrations or schema
alterations.
2. Horizontal scalability: NoSQL databases are designed to scale out by adding more
nodes to a database cluster, making them well-suited for handling large amounts of
data and high levels of traffic.
3. Document-based: Some NoSQL databases, such as MongoDB, use a document-based
data model, where data is stored in semi-structured format, such as JSON or BSON.
4. Key-value-based: Other NoSQL databases, such as Redis, use a key-value data model,
where data is stored as a collection of key-value pairs.
5. Column-based: Some NoSQL databases, such as Cassandra, use a column-based data
model, where data is organized into columns instead of rows.
6. Distributed and high availability: NoSQL databases are often designed to be highly
available and to automatically handle node failures and data replication across
multiple nodes in a database cluster.
7. Flexibility: NoSQL databases allow developers to store and retrieve data in a flexible
and dynamic manner, with support for multiple data types and changing data
structures.
8. Performance: NoSQL databases are optimized for high performance and can handle
a high volume of reads and writes, making them suitable for big data and real-time
applications.
3. Fair Scheduler:-The Fair Scheduler is very much similar to that of the capacity
scheduler.
-The priority of the job is kept in consideration.
-With the help of Fair Scheduler, the YARN applications can share the resources in the
large Hadoop Cluster and these resources are maintained dynamically so there is no
need for prior capacity.
-Eg:In the above example there are two queues A and B.
Job 1 is submitted to Queue A and it is observed that the
cluster is empty so Job 1 utilize all the cluster resources.
After sometime Job is submitted in Queue B, then the fair
share preemption occurs and both the jobs 1 and 2
allocated equal resources in their respective queues.
In meanwhile Job 3 is submitted in Queue B and since
one job is already running the scheduler will assigned fair share to both thee Jobs in
Queue B with equal resources. This way fair scheduler ensures that all the jobs are
provided with required resources.
*Hive:-Hive is a data warehouse and an ETL tool which provides an SQL-like interface
between the user and the Hadoop distributed file system (HDFS).
-It is built on top of Hadoop.
-It facilitates reading, writing and handling wide datasets that stored in distributed storage
and queried by Structure Query Language (SQL) syntax.
-It is not built for Online Transactional Processing (OLTP) workloads.
-Hive supports Data Definition Language (DDL), Data Manipulation Language (DML), and
User Defined Functions (UDF).
→Features of Hive:-These are the following features of Hive:
● Hive is fast and scalable.
● It provides SQL-like queries (i.e., HQL) that are implicitly transformed to MapReduce
or Spark jobs.
● It is capable of analyzing large datasets stored in HDFS.
● It allows different storage types such as plain text, RCFile, and HBase.
● It uses indexing to accelerate queries.
● It can operate on compressed data stored in the Hadoop ecosystem.
● It supports user-defined functions (UDFs) where user can provide its functionality.
→built-in functions of Hive:-These are functions that are already available in Hive.
-First, we have to check the application requirement, and then we can use these built-in
functions in our applications.
-We can call these functions directly in our application.
-This are the some function
1. Date Functions:-It is used for performing Date Manipulations and Converting Date
types from one type to another type.
Name Return type Description
hour int
minute int
second int
→ Data types in Hive:-There are two categories of Hive Data types that are primitive data
type and complex data type
-Data Types in Hive specifies the column/field type in the Hive table.
-It specifies the type of values that can be inserted into the specified column.
1. Primitive Type
● Integral Types:-Integer type data can be specified using integral data types,
INT. When the data range exceeds the range of INT, you need to use BIGINT
and if the data range is smaller than the INT, you use SMALLINT. TINYINT is
smaller than SMALLINT.
● String Types:-String type data types can be specified using single quotes (' ')
or double quotes (" "). It contains two data types: VARCHAR and CHAR.
● Date/Time Types:-The Date value is used to specify a particular year, month
and day, in the form YYYY--MM--DD. However, it didn't provide the time of
the day. The range of Date type lies between 0000--01--01 to 9999--12--31.
2. Complex Type
*Variables, properties and queries in Hive:- google
3 mark questions
→issues with relational model and non-relational model
___________________________________________________________________________