Data Science
Data Science
• The regression line is sometimes called the "line of best fit" because it is the line
that fits best when drawn through the points. It is a line that minimizes the
distance of the actual scores from the predicted scores.
Data science
1. The Bell Shape: A normal distribution looks like a symmetrical, bell-shaped curve when
you plot it on a graph. It's called "normal" because it's so commonly seen in nature.
2. Mean (Average): In a normal distribution, the middle of the curve represents the average
or mean value. Most data points are clustered around this average.
3. Standard Deviation: The standard deviation is a measure of how spread out the data is. If
the standard deviation is small, data points are closely packed around the mean. If it's large,
data points are more spread out.
3. Recall: Recall looks at how many of the actual positive items (e.g., people
who have a disease) you successfully identified. It's like asking, "Out of all
the 'yes' cases, how many did I catch?" High recall means you're good at
finding all the positive cases.
4. F1 Score: The F1 score combines precision and recall into a single number.
It's a way to balance these two aspects. It's like finding a compromise
between being right (precision) and finding as many positives as possible
(recall). A high F1 score means you're good at both identifying the right
cases and catching most of them.
Data science
6. Differentiate between data science, data engineering, machine learning and
deep learning.
1. Data Science: Data science is like being a detective for data. Data
scientists collect, clean, and analyze data to discover insights and make
predictions. They use math and statistics to find patterns in data, which can
help businesses and organizations make informed decisions.
c. Pie Charts: The “pie chart” is also known as a “circle chart”, dividing the
circular statistical graphic into sectors or sections to illustrate the numerical
problems.
d. Scatter Plots and bar charts: Show relationships between two variables.
Data science
2. Finance: Banks use data science to detect fraud, predict market trends,
and offer personalized financial advice.
Data science
3. Retail: Data science suggests products you might like, optimizes supply
chains, and helps stores decide where to open new branches.
In simple terms, data science is about using data to make things work better
and smarter in almost every aspect of our lives, from healthcare and finance
to shopping and entertainment. It helps us make informed decisions and
improve the way we live and work.
Data Collection:-is like collecting clues for a detective. It's the process of
gathering information, numbers, or facts from various sources. This can be
done through surveys, observations, experiments, or even just finding
existing data. Think of it as gathering pieces of a puzzle to eventually see the
bigger picture.
Once we have collected the data, we can use statistics to analyze and draw
conclusions from it. It's like the detective putting together all the clues to
solve a mystery. In the end, statistics helps us make sense of the data we
collect, allowing us to make informed decisions and understand the world
better.
Data science
12.What is random forest.
In data science, a Random Forest is a popular machine learning algorithm
used for both classification and regression tasks. It's an ensemble learning
method, which means it combines the predictions of multiple individual
models (decision trees) to make more accurate and robust predictions.
2. Voting: When you want to make a prediction, each decision tree in the
forest makes its prediction. For classification tasks, it might "vote" for a
particular class, and for regression tasks, it gives a numeric prediction.
Random Forests are known for their ability to handle complex data and
high-dimensional feature sets. They are robust, less prone to overfitting,
and are widely used in various data science applications, including image
classification, recommendation systems, and financial modeling, among
others.
13.Explain any five application of data scienced
Ans:10
14.Describe data science process.
Data Science Lifecycle revolves around the use of machine learning and different
analytical strategies to produce
The complete the method includes a number of steps like data cleaning, preparation,
modelling, model evaluation, etc.
Data science
1. Business Understanding:
- Think of data scientists as the "why" people, making sure decisions
are backed by data.
- Identify the problem you want to solve and set clear project
objectives.
2. Data Mining:
- Start gathering data to work with.
- Consider where the data lives, how to obtain it, and the most
efficient way to store and access it.
- Use tools like MySQL, Beautiful Soup, or Google Analytics for
Firebase.
3. Data Cleaning:
- This step takes time but is crucial.
- Fix inconsistencies, deal with missing data, and ensure your data is
ready for analysis.
- Tools like Pandas and Dplyr can help in this stage.
4. Data Exploration:
- Time to analyze! Explore patterns and biases in your clean data.
- Use tools like Pandas for basic analysis and create visualizations to
understand the story in your data.
Data science
5. Feature Engineering:
- Features are measurable properties in your data.
- Transform raw data into informative features.
- Use domain knowledge to select and create features.
- Check out tools like sklearn for feature selection and engineering.
6. Predictive Modeling:
- This is where machine learning comes in.
- Choose a model based on your problem and data.
- Evaluate the model's success using techniques like k-fold cross-
validation.
- Tools like Azure Cheat Sheet and SAS Cheat Sheet can help pick the
right algorithm.
7. Data Visualization:
- Communicate your insights visually.
- Use data visualization to bridge communication gaps between
different stakeholders.
- Combine skills from communication, psychology, statistics, and art to
present data effectively.
1. Python: Python is like the Swiss Army knife of data science. It's a versatile
and powerful programming language with many libraries and tools (e.g.,
NumPy, pandas, scikit-learn) that make it perfect for tasks like data analysis,
machine learning, and data visualization.
2. R: R is like a specialized toolbox for statistics and data analysis. It's known
for its wide range of statistical packages and data visualization capabilities,
making it a favorite among statisticians and data analysts.
3. Jupyter Notebook: Jupyter Notebook is like a digital lab notebook for data
scientists. It allows you to write and run code in an interactive and
organized way, making it great for documenting and sharing your data
analysis.
Data science
4. Tableau: Tableau is like a magic wand for creating beautiful data
visualizations and dashboards. You can turn your data into interactive charts
and graphs without needing to write code.
5. Hadoop: Hadoop is like a massive data storage and processing system. It's
designed to handle and analyze large datasets, making it essential for big
data and distributed computing tasks in data science.
These toolkits are essential for data scientists, helping them collect, analyze,
and visualize data effectively.
2. **Web Scraping:** Tools like Beautiful Soup and Scrapy (Python libraries)
are used to extract data from websites and web pages.
4. **IoT Devices:** Internet of Things (IoT) sensors and devices collect data
from the physical world, including temperature sensors, GPS trackers, and
smart home devices.
5. **Mobile Apps and SDKs:** Mobile app analytics tools and software
development kits (SDKs) collect data from app usage, user interactions, and
device information.
1. Data Retrieval APIs: These APIs are like data fetchers. They help you get
data from various sources, like social media platforms, databases, or
websites. For example, a Twitter API can fetch tweets, and a weather API
can provide weather data for analysis.
2. Data Processing APIs: These APIs are like data transformers. They allow
you to perform operations on data, such as cleaning, filtering, or
transforming it into a different format. This is essential for preparing data
for analysis.
3. Machine Learning APIs: These APIs are like AI assistants. They provide
pre-built machine learning models that you can use for tasks like image
recognition, text analysis, or predictive modeling. For example, Google's
Cloud Vision API can identify objects in images.
Data science
4. Visualization APIs: These APIs are like artists. They help you create
beautiful charts, graphs, and visualizations to represent your data. You can
use libraries like D3.js or Plotly to display data in an understandable way.
5. Geospatial APIs: These APIs are like digital maps. They allow you to work
with location-based data, such as mapping addresses, finding distances, or
analyzing geographic patterns. Google Maps API is a popular example.
6. Natural Language Processing (NLP) APIs: These APIs are like language
interpreters. They help you understand and work with text data, including
tasks like sentiment analysis, language translation, and text summarization.
An example is the Natural Language API by Google.
7. Social Media APIs: These APIs connect to social platforms like Facebook,
Twitter, or Instagram. They enable you to interact with social media data,
such as posting updates, fetching user profiles, or analyzing trends.
Each category of API serves a specific role in the data science process,
making it easier to collect, process, analyze, and present data for various
applications and industries.
1. Removing Duplicates: If you have the same data repeated, you should get
rid of the extras. It's like finding and removing identical toys from your
room.
2. Dealing with Missing Data: Sometimes, data can have gaps or missing
pieces. You need to figure out what should go in those gaps, just like finding
the missing pieces of a puzzle.
Data science
3. Correcting Errors: Data can have mistakes, like typos or wrong values. It's
like fixing broken toys or cleaning dirty ones to make them work properly.
4. Handling Outliers: Sometimes, there are data points that are very
different from the rest, like a giant toy in a collection of small ones. You
decide whether to keep or remove them.
Data cleaning ensures that the data you use for analysis and decision-
making is accurate, reliable, and ready for action, just like a tidy room sets
the stage for play and productivity.
2.Diagnostic Analysis: This method is like finding out why something went
wrong. You investigate data to understand the causes of specific events or
problems, such as identifying the reasons for a drop in website traffic.
3.Prescriptive Analysis: It's like getting advice from your data. You use data
to recommend specific actions or strategies to achieve a desired outcome,
like suggesting changes to improve business performance.
Root Node: Root node is from where the decision tree starts. It represents
the entire dataset, which further gets divided into two or more homogeneous
sets.
Leaf Node: Leaf nodes are the final output node, and the tree cannot be
segregated further after getting a leaf node.
Splitting: Splitting is the process of dividing the decision node/root node
into sub-nodes according to the given conditions.
Branch/Sub Tree: A tree formed by splitting the tree.
Pruning: Pruning is the process of removing the unwanted branches from
the tree.
Parent/Child node: The root node of the tree is called the parent node, and
other nodes are called the child nodes.
Data science
Function Description
Example:
Pandas import pandas as pd
read_csv() df = pd.read_csv("people.csv")
Function print(df.head())
Pandas copy()
To copy DataFrame in Pandas.
Function
Data science
Function Description
# printing size
print("Size = {}".format(size))
Pandas size() outputp:
Function size = 1224
Data science
22.What is difference between scatter plot and bar chart. Explain with figures.
Scatter:-
Bar:-
Data science
23.How do u do numerical analysis with numpy. what is its role in array
functioning.
NumPy is a Python library used for working with arrays.
Numerical analysis with NumPy involves using the NumPy library in Python to
perform various numerical operations efficiently, especially when dealing with arrays
or large datasets
Array Manipulation
np.reshape(), np.transpose(), etc.
Functions
YARN - Yet Another Resource Negotiator. YARN performs 2 operations that are Job
scheduling and Resource Management. The Purpose of Job schedular is to divide a
big task into small jobs so that each job can be assigned to various slaves in a
Hadoop cluster and Processing can be Maximized. Job Scheduler also keeps track of
which job is important, which job has more priority, dependencies between the
jobs and all the other information like job timing, etc. And the use of Resource
Manager is to manage all the resources that are made available for running a
Hadoop cluster.
MapReduce - MapReduce nothing but just like an Algorithm or a data structure that
is based on the YARN framework. The major feature of MapReduce is to perform
the distributed processing in parallel in a Hadoop cluster which Makes Hadoop
working so fast.
Hadoop Common - Hadoop Common provides a set of services across libraries and
utilities to support the other Hadoop modules.
these utilities are used by HDFS, YARN, and MapReduce for running
the cluster
Data science
26.What are extract , transformation and load process in data
warehouse.explain with stepwise figure.
• ETL stands for Extract, Transform, Load and it is a process used in data
warehousing to extract data from various sources, transform it into a format
suitable for loading into a data warehouse, and then load it into the
warehouse.
• The process of ETL can be broken down into the following three stages:
• Extraction:
The first step of the ETL process is extraction. In this step, data from
various source systems is extracted which can be in various formats
like relational databases, No SQL, XML, and flat files into the staging
area.
• Transformation:
The second step of the ETL process is transformation. In this step, a
set of rules or functions are applied on the extracted data to convert it
into a single standard format.
• Loading:
The third and final step of the ETL process is loading. In this step, the
transformed data is finally loaded into the data warehouse.
Data science
2. Health Wizardry: