0% found this document useful (0 votes)
15 views24 pages

Data Science

The document discusses different machine learning concepts including supervised learning, unsupervised learning, and reinforcement learning. It also covers linear regression, hypothesis testing, normal distribution, evaluation metrics like accuracy and F1 score, and the differences between data science, data engineering, machine learning, and deep learning. Finally, it discusses data visualization techniques and Python plotting libraries like Matplotlib and Seaborn.

Uploaded by

shrutighori007
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views24 pages

Data Science

The document discusses different machine learning concepts including supervised learning, unsupervised learning, and reinforcement learning. It also covers linear regression, hypothesis testing, normal distribution, evaluation metrics like accuracy and F1 score, and the differences between data science, data engineering, machine learning, and deep learning. Finally, it discusses data visualization techniques and Python plotting libraries like Matplotlib and Seaborn.

Uploaded by

shrutighori007
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Data science

1. What are different types of machine learning. Explain with example.


• Machine learning involves showing a large volume of data to a machine so
that it can learn and make predictions, find patterns, or classify data.
• The three machine learning types are supervised, unsupervised, and
reinforcement learning.
1. Supervised learning:- is a type of machine learning where a computer
algorithm learns to make predictions or decisions based on labeled data.
In simple terms, it's like teaching a computer to recognize patterns by
showing it examples with clear labels or instructions.

Here's an example to help illustrate this: Imagine you want to teach a


computer to distinguish between pictures of cats and dogs. You start by
gathering a bunch of photos, and for each photo, you tell the computer
whether it's a picture of a cat or a dog. These labels are the "supervision"
in supervised learning.

2. Unsupervised learning:- is a type of machine learning where a computer


algorithm tries to find patterns or structure in a dataset without any explicit
guidance or labels provided by humans. In other words, it's about the
computer learning on its own by exploring the data.

3. Reinforcement learning:- is a type of machine learning where an agent


(like a computer program or robot) learns to make decisions by interacting
with an environment. The agent takes actions to maximize a reward signal
it receives based on those actions.

Here's an example: Imagine a computer program that plays a game, like


chess. The program is the agent, and the chessboard is the environment.
The program takes actions (moves on the chessboard) and receives a
reward (winning or losing the game) based on those actions.

2. Explain about linear regression with plotted best fit line.


• Linear regression is a type of statistical analysis used to predict
the relationship between two variables. It assumes a linear
relationship between the independent variable and the
dependent variable, and aims to find the best-fitting line that
describes the relationship.

• The regression line is sometimes called the "line of best fit" because it is the line
that fits best when drawn through the points. It is a line that minimizes the
distance of the actual scores from the predicted scores.
Data science

3. What is type 1 error in null hypothesis. Explain with example


• The term type I error is a statistical concept that refers to the incorrect
rejection of an accurate null hypothesis. Put simply, a type I error is a
false positive result. Making a type I error often can't be avoided because
of the degree of uncertainty involved. A null hypothesis is established
during hypothesis testing before a test begins. In some cases, a type I
error assumes there's no cause-and-effect relationship between the
tested item and the stimuli to trigger an outcome to the test.
• Ex:- Criminal Trials:-Type I errors commonly occur in criminal trials,
where juries are required to come up with a verdict of either innocent or
guilty. In this case, the null hypothesis is that the person is innocent, while
the alternative is guilty. A jury may come up with a type I error if the
members find that the person is found guilty and is sent to jail, despite
actually being innocent.

4. Explain normal distribution with its mean and standard deviation.


A normal distribution, often called a "bell curve," is a common statistical pattern that many
things in the world follow. It's a way to describe how data is spread out. Here's how it works
in simple words:

1. The Bell Shape: A normal distribution looks like a symmetrical, bell-shaped curve when
you plot it on a graph. It's called "normal" because it's so commonly seen in nature.
2. Mean (Average): In a normal distribution, the middle of the curve represents the average
or mean value. Most data points are clustered around this average.
3. Standard Deviation: The standard deviation is a measure of how spread out the data is. If
the standard deviation is small, data points are closely packed around the mean. If it's large,
data points are more spread out.

Imagine you're looking at the heights of a group of people. In a normal distribution


Data science
- The peak of the curve represents the most common height, which is the mean (average)
height of the group.
- As you move away from the peak in either direction, the number of people with heights
decreases.
- The standard deviation tells you how much the heights vary. If it's small, most people have
heights very close to the mean. If it's large, some people are very tall, and some are very
short.

5. What is accuracy, precision, recall and F1 score.


1. Accuracy: Accuracy is a measure of how many correct predictions or
decisions you make out of all the predictions or decisions you've made.
It's like asking, "How many times was I right?" High accuracy means
you're making a lot of correct calls.

2. Precision: Precision focuses on how many of the items you identified as


positive (e.g., correctly identifying a disease) were actually correct. It's like
asking, "When I said 'yes,' how often was I right?" High precision means that
when you make a positive prediction, you're usually right.

3. Recall: Recall looks at how many of the actual positive items (e.g., people
who have a disease) you successfully identified. It's like asking, "Out of all
the 'yes' cases, how many did I catch?" High recall means you're good at
finding all the positive cases.

4. F1 Score: The F1 score combines precision and recall into a single number.
It's a way to balance these two aspects. It's like finding a compromise
between being right (precision) and finding as many positives as possible
(recall). A high F1 score means you're good at both identifying the right
cases and catching most of them.
Data science
6. Differentiate between data science, data engineering, machine learning and
deep learning.
1. Data Science: Data science is like being a detective for data. Data
scientists collect, clean, and analyze data to discover insights and make
predictions. They use math and statistics to find patterns in data, which can
help businesses and organizations make informed decisions.

2. Data Engineering: Data engineering is like building the infrastructure for


data scientists to work with. Data engineers design and maintain systems to
collect, store, and process data efficiently. They make sure the data is ready
for analysis, so data scientists can do their work effectively.

3. Machine Learning: Machine learning is about teaching computers to learn


from data. It's like training a dog to perform tricks. Instead of giving explicit
instructions, you provide data and let the computer learn patterns and
make predictions on its own. It's used in recommendation systems, image
recognition, and more.

4. Deep Learning: Deep learning is a type of machine learning that's inspired


by the structure of the human brain. It's like teaching a computer to think in
layers, just like our brain's neurons. Deep learning is excellent for tasks like
understanding natural language (NLP), speech recognition, and complex
image analysis.

7. Detail note on types of data visualisation:


The art of presenting your data and information as graphs, charts, or maps is
known as data visualization.
Data visualization is the graphical representation of information and data. By
using visual elements like charts, graphs, and maps,

a. Line Charts: A line plot is created by connecting a series of data points


with straight lines. The number of periods is on the x-axis.
Data science

b. Tables: This consists of rows and columns used to compare variables.


Tables can show a great deal of information in a structured way.

c. Pie Charts: The “pie chart” is also known as a “circle chart”, dividing the
circular statistical graphic into sectors or sections to illustrate the numerical
problems.

d. Scatter Plots and bar charts: Show relationships between two variables.
Data science

Heatmaps: A heatmap is a graphical representation of data that uses a system of


color coding to represent different values. Light colors show a high correlation,
while dark colors show a low correlation.

8. Explain any 2 plotting libraries for data visualisation in python.


Matplotlib and Seaborn are python libraries that are used for data visualization. They
have inbuilt modules for plotting different graphs.
1. Matplotlib:- is like a magic tool for creating all sorts of visual representations,
such as graphs and charts, from your data. It's a library in Python that helps
you turn your boring numbers and information into colorful and easy-to-
understand pictures. Whether you want to draw line charts, bar graphs,
scatter plots, or any other kind of visualization, Matplotlib can make it happen
with just a few lines of code. It's like turning your data into beautiful pictures
that tell a story.

2. Seaborn:- is like a stylish assistant to Matplotlib, another data visualization


library in Python. It helps you create even more beautiful and sophisticated
charts and graphs with less effort. Think of it as a decorator for your data
visuals. Seaborn can make your plots look more polished and appealing by
adding color schemes, better default settings, and convenient functions for
specific types of charts. It's like having a designer help you make your data
look more elegant and professional.
Data science
9. Explain various encoding techniques.
Encoding is the process of converting the data or a given sequence of characters, symbols,
alphabets etc., into a specified format, for the secured transmission of data.

1.One-Hot Encoding (or Dummy Encoding):One-hot encoding is like giving each


category its own binary code. Suppose you have a "Color" feature with
categories like "Red," "Blue," and "Green." One-hot encoding turns this into
separate columns for each color, filled with 0s and 1s. For example, "Red"
becomes [1, 0, 0], "Blue" becomes [0, 1, 0], and "Green" becomes [0, 0, 1]. It's
a way to represent categorical data as binary values, making it suitable for
machine learning algorithms.
2. Label Encoding (or Ordinal Encoding): Label encoding is like assigning a
unique number to each category. For instance, if you have a "Size" feature with
categories "Small," "Medium," and "Large," label encoding could transform
them into 1, 2, and 3, respectively. It's used when there's an inherent order or
ranking among categories. However, be careful when using it for non-ordinal
data, as it might mislead some algorithms into thinking there's an order when
there isn't.
3. Target Encoding: Target encoding is like using the outcome (the "target") to
encode categories. Let's say you have a "City" feature and you want to predict
the price of houses. Instead of one-hot encoding or label encoding "City,"
target encoding calculates the average price for each city and assigns that as
the encoding. So, if houses in "City A" have an average price of $300,000, all
instances of "City A" would be replaced with $300,000. It can help capture the
relationship between the category and the target variable, but it should be
used cautiously to avoid data leakage and overfitting.
10.Write a note on application of data science.
Data science is like a superpower for solving real-world problems with the
help of data. It has applications in many areas of our lives, making things
better and more efficient. Here are some simple examples:

1. Healthcare: Data science helps doctors predict diseases early,


recommends personalized treatments, and manages hospital resources
more effectively.

2. Finance: Banks use data science to detect fraud, predict market trends,
and offer personalized financial advice.
Data science

3. Retail: Data science suggests products you might like, optimizes supply
chains, and helps stores decide where to open new branches.

4. Entertainment: Streaming platforms use data to recommend movies and


songs you'd enjoy, and studios use it to predict which movies will be a hit.

5. Transportation: Ride-sharing services use data science for route


optimization, and cities use it to manage traffic and public transportation.
6. Agriculture: Farmers use data to improve crop yields and reduce waste.
7. Weather Forecasting: Meteorologists use data science to make more
accurate weather predictions.

In simple terms, data science is about using data to make things work better
and smarter in almost every aspect of our lives, from healthcare and finance
to shopping and entertainment. It helps us make informed decisions and
improve the way we live and work.

11.Short note on-statistics-Data collection:


Statistics is like a detective tool for making sense of data. It helps us gather,
organize, and understand information about the world.

Data Collection:-is like collecting clues for a detective. It's the process of
gathering information, numbers, or facts from various sources. This can be
done through surveys, observations, experiments, or even just finding
existing data. Think of it as gathering pieces of a puzzle to eventually see the
bigger picture.

Once we have collected the data, we can use statistics to analyze and draw
conclusions from it. It's like the detective putting together all the clues to
solve a mystery. In the end, statistics helps us make sense of the data we
collect, allowing us to make informed decisions and understand the world
better.
Data science
12.What is random forest.
In data science, a Random Forest is a popular machine learning algorithm
used for both classification and regression tasks. It's an ensemble learning
method, which means it combines the predictions of multiple individual
models (decision trees) to make more accurate and robust predictions.

Here's how a Random Forest works in data science:

1. Ensemble of Decision Trees: A Random Forest consists of a collection of


decision trees. Each decision tree is trained on a random subset of the data
and a random subset of the features. This randomness is what makes it
"random."

2. Voting: When you want to make a prediction, each decision tree in the
forest makes its prediction. For classification tasks, it might "vote" for a
particular class, and for regression tasks, it gives a numeric prediction.

3. Majority Vote (or Average): The final prediction or output is determined


by taking a majority vote (in classification) or an average (in regression) of
the predictions made by all the individual decision trees.

Random Forests are known for their ability to handle complex data and
high-dimensional feature sets. They are robust, less prone to overfitting,
and are widely used in various data science applications, including image
classification, recommendation systems, and financial modeling, among
others.
13.Explain any five application of data scienced
Ans:10
14.Describe data science process.
Data Science Lifecycle revolves around the use of machine learning and different
analytical strategies to produce
The complete the method includes a number of steps like data cleaning, preparation,
modelling, model evaluation, etc.
Data science

1. Business Understanding:
- Think of data scientists as the "why" people, making sure decisions
are backed by data.
- Identify the problem you want to solve and set clear project
objectives.

2. Data Mining:
- Start gathering data to work with.
- Consider where the data lives, how to obtain it, and the most
efficient way to store and access it.
- Use tools like MySQL, Beautiful Soup, or Google Analytics for
Firebase.

3. Data Cleaning:
- This step takes time but is crucial.
- Fix inconsistencies, deal with missing data, and ensure your data is
ready for analysis.
- Tools like Pandas and Dplyr can help in this stage.

4. Data Exploration:
- Time to analyze! Explore patterns and biases in your clean data.
- Use tools like Pandas for basic analysis and create visualizations to
understand the story in your data.
Data science

5. Feature Engineering:
- Features are measurable properties in your data.
- Transform raw data into informative features.
- Use domain knowledge to select and create features.
- Check out tools like sklearn for feature selection and engineering.

6. Predictive Modeling:
- This is where machine learning comes in.
- Choose a model based on your problem and data.
- Evaluate the model's success using techniques like k-fold cross-
validation.
- Tools like Azure Cheat Sheet and SAS Cheat Sheet can help pick the
right algorithm.

7. Data Visualization:
- Communicate your insights visually.
- Use data visualization to bridge communication gaps between
different stakeholders.
- Combine skills from communication, psychology, statistics, and art to
present data effectively.

15.Classify any 5 data science toolkit.


Certainly! Here are five data science toolkits in simple words:

1. Python: Python is like the Swiss Army knife of data science. It's a versatile
and powerful programming language with many libraries and tools (e.g.,
NumPy, pandas, scikit-learn) that make it perfect for tasks like data analysis,
machine learning, and data visualization.

2. R: R is like a specialized toolbox for statistics and data analysis. It's known
for its wide range of statistical packages and data visualization capabilities,
making it a favorite among statisticians and data analysts.

3. Jupyter Notebook: Jupyter Notebook is like a digital lab notebook for data
scientists. It allows you to write and run code in an interactive and
organized way, making it great for documenting and sharing your data
analysis.
Data science
4. Tableau: Tableau is like a magic wand for creating beautiful data
visualizations and dashboards. You can turn your data into interactive charts
and graphs without needing to write code.

5. Hadoop: Hadoop is like a massive data storage and processing system. It's
designed to handle and analyze large datasets, making it essential for big
data and distributed computing tasks in data science.

These toolkits are essential for data scientists, helping them collect, analyze,
and visualize data effectively.

16.List all data collection tool and explain any 5 of them


Data collection in data science can involve various tools and methods,
depending on the specific data sources and needs. Here's a list of some
common data collection tools and techniques:

1. **Surveys and Questionnaires:** Tools like Google Forms, SurveyMonkey,


and Typeform are used to create and distribute surveys to collect structured
responses from participants.

2. **Web Scraping:** Tools like Beautiful Soup and Scrapy (Python libraries)
are used to extract data from websites and web pages.

3. **Social Media APIs:** Platforms like Twitter, Facebook, and Instagram


provide APIs (Application Programming Interfaces) for developers to collect
data from their platforms.

4. **IoT Devices:** Internet of Things (IoT) sensors and devices collect data
from the physical world, including temperature sensors, GPS trackers, and
smart home devices.

5. **Mobile Apps and SDKs:** Mobile app analytics tools and software
development kits (SDKs) collect data from app usage, user interactions, and
device information.

6. Logs and Server Data.


7. Databases.
Data science
8. Publicly Available Datasets.
9. Content Management Systems (CMS).
10. Retail and Point of Sale (POS) Systems.
11. Network Traffic Data.
12.Image and Video Capture Devices.
13. Audio Recording Devices.
14. Physical Sensors.
15. **Cloud Services.

Data collection in data science often involves using a combination of these


tools and methods to acquire the necessary data for analysis and modeling.
The choice of tools depends on the specific data sources and objectives of a
data science project.

17.Write a notes on categories of API.


Certainly! In data science, APIs (Application Programming Interfaces) are like
communication channels that allow different software systems to talk to
each other and share information. There are several categories of APIs that
serve specific purposes:

1. Data Retrieval APIs: These APIs are like data fetchers. They help you get
data from various sources, like social media platforms, databases, or
websites. For example, a Twitter API can fetch tweets, and a weather API
can provide weather data for analysis.

2. Data Processing APIs: These APIs are like data transformers. They allow
you to perform operations on data, such as cleaning, filtering, or
transforming it into a different format. This is essential for preparing data
for analysis.

3. Machine Learning APIs: These APIs are like AI assistants. They provide
pre-built machine learning models that you can use for tasks like image
recognition, text analysis, or predictive modeling. For example, Google's
Cloud Vision API can identify objects in images.
Data science
4. Visualization APIs: These APIs are like artists. They help you create
beautiful charts, graphs, and visualizations to represent your data. You can
use libraries like D3.js or Plotly to display data in an understandable way.

5. Geospatial APIs: These APIs are like digital maps. They allow you to work
with location-based data, such as mapping addresses, finding distances, or
analyzing geographic patterns. Google Maps API is a popular example.

6. Natural Language Processing (NLP) APIs: These APIs are like language
interpreters. They help you understand and work with text data, including
tasks like sentiment analysis, language translation, and text summarization.
An example is the Natural Language API by Google.

7. Social Media APIs: These APIs connect to social platforms like Facebook,
Twitter, or Instagram. They enable you to interact with social media data,
such as posting updates, fetching user profiles, or analyzing trends.

Each category of API serves a specific role in the data science process,
making it easier to collect, process, analyze, and present data for various
applications and industries.

18.Write a note on data cleaning.


Data cleaning in data science is like tidying up a messy room before you can
work or play in it. When you collect data, it can be messy and full of errors,
just like a room can be cluttered. Data cleaning is the process of finding and
fixing these issues to make the data usable.

Here's what data cleaning involves:

1. Removing Duplicates: If you have the same data repeated, you should get
rid of the extras. It's like finding and removing identical toys from your
room.

2. Dealing with Missing Data: Sometimes, data can have gaps or missing
pieces. You need to figure out what should go in those gaps, just like finding
the missing pieces of a puzzle.
Data science

3. Correcting Errors: Data can have mistakes, like typos or wrong values. It's
like fixing broken toys or cleaning dirty ones to make them work properly.

4. Handling Outliers: Sometimes, there are data points that are very
different from the rest, like a giant toy in a collection of small ones. You
decide whether to keep or remove them.

5. Formatting and Standardizing: Data should follow a consistent format,


just like organizing your toys in a specific order. You make sure everything is
in the right place.

Data cleaning ensures that the data you use for analysis and decision-
making is accurate, reliable, and ready for action, just like a tidy room sets
the stage for play and productivity.

19.Short note on various methods of data analysis.

Diagnostic Analysis, Predictive Analysis, Prescriptive Analysis, Text Analysis, and


Statistical Analysis are the most commonly used data analytics types.

Certainly! Data analysis in data science involves various methods to make


sense of data. Here are some common methods in simple words:
1.Predictive Analysis: Think of it as forecasting the future. You build models
that use historical data to make predictions about what might happen next,
like predicting sales or customer behavior.

2.Diagnostic Analysis: This method is like finding out why something went
wrong. You investigate data to understand the causes of specific events or
problems, such as identifying the reasons for a drop in website traffic.

3.Prescriptive Analysis: It's like getting advice from your data. You use data
to recommend specific actions or strategies to achieve a desired outcome,
like suggesting changes to improve business performance.

4.Text Analysis (Natural Language Processing): This method is about


understanding and extracting meaning from text data, like analyzing
customer reviews or sentiment in social media posts.
Data science

5.Spatial Analysis (Geospatial Analysis): Think of it as location-based


analysis. You work with geographic data to answer questions related to
location, like optimizing delivery routes or mapping disease outbreaks.

20.Explain decision tree algorithm.


• Decision Tree is a Supervised learning technique that can be used for
both classification and Regression problems
• In a Decision tree, there are two nodes, which are the Decision
Node and Leaf Node. Decision nodes are used to make any decision
and have multiple branches, whereas Leaf nodes are the output of
those decisions and do not contain any further branches.

• It is a graphical representation for getting all the possible solutions to


a problem/decision based on given conditions.
• It is called a decision tree because, similar to a tree, it starts with the
root node, which expands on further branches and constructs a tree-
like structure.
• to build a tree, we use the CART algorithm, which stands
for Classification and Regression Tree algorithm.

Root Node: Root node is from where the decision tree starts. It represents
the entire dataset, which further gets divided into two or more homogeneous
sets.
Leaf Node: Leaf nodes are the final output node, and the tree cannot be
segregated further after getting a leaf node.
Splitting: Splitting is the process of dividing the decision node/root node
into sub-nodes according to the given conditions.
Branch/Sub Tree: A tree formed by splitting the tree.
Pruning: Pruning is the process of removing the unwanted branches from
the tree.
Parent/Child node: The root node of the tree is called the parent node, and
other nodes are called the child nodes.
Data science

21.What are different functions of pandas ,explain with example.

Function Description

• This function is used to retrieve data from CSV files in the


form of a dataframe.

Example:
Pandas import pandas as pd
read_csv() df = pd.read_csv("people.csv")
Function print(df.head())

This function is used to return the top n (5 by default) values of a


data frame or series.
Ex:
Dataframe.head(n=5)
Pandas head() n: integer value, number of rows to be returned
Function

This method is used to return the bottom n (5 by default) rows of


a data frame or series.
Ex:
Dataframe.tail(n=5)
Pandas tail() n: integer value, number of rows to be returned
Function

Pandas copy()
To copy DataFrame in Pandas.
Function
Data science

Function Description

This method is used to generate the summary of the DataFrame,


Pandas info() this will include info about columns with their names, their
Function datatypes, and missing values.

This method returns the number of rows in the Series. Otherwise,


return the number of rows times the number of columns in the
DataFrame.
Ex:
size = data.size

# printing size
print("Size = {}".format(size))
Pandas size() outputp:
Function size = 1224
Data science
22.What is difference between scatter plot and bar chart. Explain with figures.

Scatter:-

Bar:-
Data science
23.How do u do numerical analysis with numpy. what is its role in array
functioning.
NumPy is a Python library used for working with arrays.
Numerical analysis with NumPy involves using the NumPy library in Python to
perform various numerical operations efficiently, especially when dealing with arrays
or large datasets

Array Operations Functions

np.array(), np.zeros(), np.ones(),


Array Creation Functions
np.empty(), etc.

Array Manipulation
np.reshape(), np.transpose(), etc.
Functions

Array Mathematical np.add(), np.subtract(), np.sqrt(),


Functions np.power(), etc.

np.median(), np.mean(), np.std(),


Array Statistical Functions
and np.var().

Array Input and Output np.save(), np.load(), np.loadtxt(),


Functions etc.

24.What is heatmap.what is its important in finding correlation.


A heatmap is like a color-coded table that helps you see relationships and
patterns in data, especially when you're looking for correlations between
different variables. In a heatmap, colors represent the strength of
relationships: darker colors (often red or blue) show strong relationships,
while lighter colors show weak or no relationships.

In data science, heatmaps are important for finding correlations because


they make it easy to spot which variables (like features or attributes) are
closely related and which aren't. For example, in a dataset of students' test
scores, you could use a heatmap to quickly see if there's a strong correlation
between the amount of time students study and their test performance. If
Data science
the correlation is strong, you'd see a dark color, and if it's weak, you'd see a
light color.

Heatmaps provide a visual way to understand data and identify which


variables are worth exploring further or including in predictive models. They
help data scientists and analysts quickly pinpoint potential cause-and-effect
relationships or dependencies within the data.
25.Explain about the architecture of Apache Hadoop with all components.
Apache Hadoop is an open source, Java-based software platform that manages data
processing and storage for big data applications
It is an open source framework from Apache and is used to store process and analyze
data which are very huge in volume.
Hadoop is written in Java and is not OLAP (online analytical processing). It is used for
batch/offline processing.
It is being used by Facebook, Yahoo, Google, Twitter, LinkedIn and many more.
Moreover it can be scaled up just by adding nodes in the cluster.
The Hadoop Architecture Mainly consists of 4 components.
Data science
HDFS - Hadoop Distributed File System. HDFS is a Java-based system that allows large
data sets to be stored across nodes in a cluster in a fault-tolerant manner.

YARN - Yet Another Resource Negotiator. YARN performs 2 operations that are Job
scheduling and Resource Management. The Purpose of Job schedular is to divide a
big task into small jobs so that each job can be assigned to various slaves in a
Hadoop cluster and Processing can be Maximized. Job Scheduler also keeps track of
which job is important, which job has more priority, dependencies between the
jobs and all the other information like job timing, etc. And the use of Resource
Manager is to manage all the resources that are made available for running a
Hadoop cluster.

MapReduce - MapReduce nothing but just like an Algorithm or a data structure that
is based on the YARN framework. The major feature of MapReduce is to perform
the distributed processing in parallel in a Hadoop cluster which Makes Hadoop
working so fast.

MapReduce has mainly 2 tasks which are divided phase-wise:


In first phase, Map is utilized and in next phase Reduce is utilized.

Hadoop Common - Hadoop Common provides a set of services across libraries and
utilities to support the other Hadoop modules.

these utilities are used by HDFS, YARN, and MapReduce for running
the cluster
Data science
26.What are extract , transformation and load process in data
warehouse.explain with stepwise figure.
• ETL stands for Extract, Transform, Load and it is a process used in data
warehousing to extract data from various sources, transform it into a format
suitable for loading into a data warehouse, and then load it into the
warehouse.
• The process of ETL can be broken down into the following three stages:

• Extraction:
The first step of the ETL process is extraction. In this step, data from
various source systems is extracted which can be in various formats
like relational databases, No SQL, XML, and flat files into the staging
area.
• Transformation:
The second step of the ETL process is transformation. In this step, a
set of rules or functions are applied on the extracted data to convert it
into a single standard format.
• Loading:
The third and final step of the ETL process is loading. In this step, the
transformed data is finally loaded into the data warehouse.
Data science

27.Explain two roles of big data in real life with examples.


Certainly! Big Data plays crucial roles in various real-life scenarios, driving
insights and innovations. Here are two examples illustrating its roles:

1. Retail Shopping Magic:

• Role of Big Data:


• Data Collection: Imagine a huge store collecting data on what you buy,
when you buy it, and even what you look at online.
• Analysis: Big Data looks at all this info to find trends, like which
products are super popular and when people like to shop.
• Predicting the Future: Using this data, the store predicts what you
might want to buy next, and when. It's like a shopping crystal ball!
• No More Empty Shelves: Big Data helps make sure the store has just
enough of the right stuff, so shelves are never too empty or too full.
• Example:
• You love buying comfy sweaters in the winter. The store, using Big
Data, knows this and makes sure there are plenty of cozy sweaters in
stock before the chilly season hits. It's like they read your mind!

2. Health Wizardry:

• Role of Big Data:


• Checking Your Health History: Imagine your health records, your genes,
and even how many steps you take—all of this becoming a part of a
giant health database.
• Guessing Future Health: Big Data uses this info to guess if you might get
sick in the future. It's like a health crystal ball!
• Personal Medicine Plan: Based on this, your doctor can create a unique
health plan just for you. It's not a one-size-fits-all; it's made just for you.
• Discovering New Medicines: Big Data helps scientists find new
medicines faster by looking at lots of data and finding patterns.
• Example:
• Let's say Big Data sees that people with a certain gene might be more
likely to have a health issue. Your doctor, using this info, can create a
health plan that suits you perfectly, like a health superhero!

You might also like