Dsbda U1 New

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Unit 1

18 March 2023 11:39 AM

™ Basics and Need of Data Science and Big Data


Data science is an interdisciplinary field that uses scientific methods, processes, algorithms, and
systems to extract knowledge and insights from structured and unstructured data. It combines
skills and knowledge from mathematics, statistics, computer science, and domain expertise to
make sense of complex data sets. The field is essential for businesses and organizations to make
informed decisions based on data­driven insights.
On the other hand, big data refers to extremely large and complex data sets that cannot be
processed using traditional data processing tools. Big data is characterized by the volume,
velocity, and variety of data. The data can come from various sources, including social media,
sensors, online transactions, and other digital sources.
The need for data science and big data arises because of the following reasons:
9 Business Intelligence: Data Science helps businesses to extract meaningful insights from large
volumes of data. These insights can be used to optimize operations, improve customer
experience, and develop new products and services.
9 Decision Making: Big data can be used to make more informed and accurate decisions. By
analyzing large data sets, decision­makers can gain a better understanding of the current state of
affairs, identify trends and patterns, and make predictions about the future.
9 Competitive Advantage: Companies that can leverage big data and data science can gain a
competitive advantage over their rivals. By analyzing data sets, they can identify new markets,
develop innovative products and services, and improve their overall performance.
9 Improved Efficiency: Data science and big data can help organizations to streamline their
operations and improve efficiency. By analyzing data sets, companies can identify areas where
they can reduce waste and improve their processes.
In conclusion, data science and big data are essential for businesses and organizations to remain
competitive and make informed decisions based on data­driven insights. The field is constantly
evolving, and new tools and techniques are being developed to help organizations harness the
power of big data.

™ Applications of Data Science


Data science has a wide range of applications across various fields, including healthcare, finance,
marketing, sports, transportation, and more. Here are some examples of how data science is used
in different industries:
9 Healthcare: Data science is used in healthcare to improve patient care, reduce costs, and optimize
hospital operations. Data scientists analyze electronic health records, medical images, and other
health­related data to identify patterns and make predictions about patient outcomes.
9 Finance: Data science is used in finance to improve risk management, fraud detection, and
portfolio optimization. Data scientists use machine learning algorithms to analyze financial data
and make predictions about stock prices, investment opportunities, and risk management.
9 Marketing: Data science is used in marketing to better understand consumer behavior and
preferences. By analyzing customer data, businesses can develop personalized marketing
campaigns, optimize pricing strategies, and improve customer retention.
9 Sports: Data science is used in sports to analyze player performance and improve team strategies.
Data scientists analyze data from wearable technology, video footage, and other sources to make
predictions about player injuries, game outcomes, and team performance.
9 Transportation: Data science is used in transportation to optimize logistics and reduce costs. Data
scientists use machine learning algorithms to analyze traffic patterns, predict maintenance needs,
and optimize route planning.
In conclusion, data science has a wide range of applications across various industries and fields.
By leveraging data and machine learning algorithms, businesses and organizations can make
informed decisions, optimize their operations, and gain a competitive advantage.

™ Data Explosion
Data explosion refers to the rapid growth of data generated by humans and machines in various
forms, such as text, images, videos, and more. This growth has been fueled by the increasing use
of digital technologies, the internet, and the proliferation of connected devices, sensors, and IoT
devices.
Here are some of the factors contributing to the data explosion:
9 Social Media: The popularity of social media platforms has led to the creation of vast amounts of
user­generated content, such as photos, videos, and messages.
9 Internet of Things (IoT): IoT devices, such as sensors and connected devices, generate massive
amounts of data in real­time.
9 Machine­generated data: The growth of automated processes and machine learning has led to an
increase in machine­generated data, such as log files, sensor data, and transactional data.
9 Cloud Computing: The availability of cloud computing has made it easier for organizations to
store and analyze large amounts of data.
The data explosion presents both opportunities and challenges. On one hand, the data explosion
has created new opportunities for businesses and organizations to gain insights into customer
behavior, improve operational efficiency, and develop innovative products and services. On the
other hand, it has also created challenges in terms of data management, storage, and security.
The increasing complexity of data and the need for real­time analysis require advanced
technologies and data science expertise to turn the data into actionable insights.
In conclusion, the data explosion is a significant trend that is transforming the way organizations
operate and make decisions. As the amount of data continues to grow, businesses and
organizations need to embrace data science and adopt new technologies to turn data into
insights and gain a competitive advantage.

™ 5 V's of Big Data


The 5 V's of big data are volume, velocity, variety, veracity, and value. These five characteristics
describe the unique features of big data and are essential for understanding how to manage and
leverage large data sets.
9 Volume: This refers to the vast amount of data generated from various sources, such as social
media, IoT devices, and other digital platforms. The amount of data generated is so large that
traditional data processing tools are not sufficient to handle it.
9 Velocity: This refers to the speed at which data is generated and needs to be processed in real­
time. Data is constantly streaming in from various sources, and businesses need to process it
quickly to make informed decisions.
9 Variety: This refers to the different types and formats of data, such as structured, unstructured,
and semi­structured data. The variety of data sources creates challenges in data processing and
analysis.
9 Veracity: This refers to the accuracy and quality of the data. As the amount of data increases, it
becomes more challenging to ensure that the data is accurate and free from errors.
9 Value: This refers to the ability to derive meaningful insights and value from the data. Businesses
and organizations need to ensure that the insights generated from big data analysis are
actionable and add value to the organization.
In conclusion, the 5 V's of big data describe the unique characteristics of large and complex data
sets. Understanding these characteristics is essential for organizations to manage, process, and
derive insights from big data.
™ Relationship between Data Science and Information Science
Data science and information science are related fields that share some similarities but have
different focuses.
Information science focuses on the study of how information is created, organized, stored,
retrieved, and disseminated. Information science is concerned with the principles and practices of
managing information in various formats, such as text, images, and multimedia.
Data science, on the other hand, is concerned with the process of extracting insights and
knowledge from large and complex data sets. Data science involves the use of statistical and
machine learning techniques to identify patterns, make predictions, and develop models to solve
complex problems.
While information science and data science have different focuses, they share some common
areas of interest, such as data management, data analysis, and data visualization. Both fields also
use similar tools and technologies, such as databases, data warehouses, and data mining
algorithms.
In conclusion, information science and data science are related fields that share some common
areas of interest and tools. While information science focuses on the principles and practices of
managing information, data science focuses on the process of extracting insights and knowledge
from large and complex data sets.

™ Business Intelligence vs Data Science


Business Intelligence (BI) and Data Science are two different approaches to analyzing data to
support business decision­making.
Business Intelligence is concerned with extracting data from various sources, transforming it into
a structured format, and presenting it in a way that is easy to understand and actionable. BI uses
tools such as dashboards, reports, and scorecards to provide insights into an organization's
performance and help business leaders make informed decisions. BI focuses on historical data
analysis and visualization to monitor business operations and performance, identify trends, and
track progress towards goals.
Data Science, on the other hand, involves the use of advanced statistical and machine learning
techniques to extract insights and knowledge from large and complex data sets. Data Science
goes beyond descriptive analytics (what happened) and provides predictive analytics (what could
happen) and prescriptive analytics (what should be done). Data Science uses techniques such as
clustering, classification, and regression analysis to identify patterns, make predictions, and
develop models to solve complex problems. Data Science focuses on finding actionable insights
that can drive business decisions and improve business performance.
In summary, Business Intelligence and Data Science are two different approaches to analyzing
data for decision­making purposes. Business Intelligence focuses on descriptive analytics and
provides historical data analysis and visualization to monitor business operations and
performance, while Data Science goes beyond that by providing predictive and prescriptive
analytics to drive business decisions and improve business performance.

™ Data Science Life Cycle


The Data Science Life Cycle is a process used by data scientists to approach and solve complex
data problems. The life cycle typically consists of the following stages:
9 Problem Definition: This is the initial stage where the problem or question is defined. The data
scientist works with the stakeholders to understand the problem, identify the variables involved,
and define the project goals.
9 Data Collection: In this stage, the data scientist identifies the sources of data that are needed to
solve the problem. This can include collecting data from various sources, such as databases, APIs,
web scraping, or even manual data entry.
9 Data Preparation: Once the data is collected, the data scientist needs to clean and prepare it for
analysis. This involves tasks such as removing missing data, dealing with outliers, and
transforming the data into a suitable format.
9 Data Exploration: This stage involves using statistical techniques and visualization tools to explore
the data and identify patterns or trends. The data scientist identifies variables that are most
relevant to the problem and tries to understand the relationships between the variables.
9 Data Modeling: In this stage, the data scientist develops a model or algorithm that can solve the
problem. This involves selecting an appropriate statistical or machine learning algorithm and
training it on the data.
9 Model Evaluation: Once the model is developed, it needs to be evaluated to ensure that it is
accurate and effective. This involves testing the model on a separate set of data and validating its
performance.
9 Deployment: Finally, the model is deployed into a production environment where it can be used
to solve the problem. This can involve integrating the model into existing software or developing
a new system to use the model.
In conclusion, the Data Science Life Cycle is a systematic approach to solving complex data
problems that involves several stages, including problem definition, data collection, data
preparation, data exploration, data modeling, model evaluation, and deployment.

™ Need of Data Wrangling


Data wrangling, also known as data cleaning or data pre­processing, is the process of cleaning and
transforming raw data into a format that can be easily analyzed. Here are some reasons why data
wrangling is necessary:
9 Data Quality: Data quality is crucial for accurate analysis. Data wrangling helps identify and
remove errors, inconsistencies, and missing data that can lead to inaccurate analysis.
9 Data Integration: In many cases, data is collected from multiple sources and in different formats.
Data wrangling helps to integrate this data into a unified format that can be used for analysis.
9 Data Exploration: Data exploration involves identifying patterns, trends, and relationships within
the data. Data wrangling is necessary to clean and transform the data in a way that makes it
easier to explore.
9 Data Modeling: Data modeling involves developing predictive models and algorithms to make
decisions based on the data. Data wrangling is necessary to prepare the data for modeling and to
select appropriate features for the model.
9 Efficiency: Data wrangling can save time and resources by automating repetitive tasks such as
data cleaning, formatting, and integration. This frees up time for data analysts to focus on higher­
level tasks such as analysis and modeling.
In conclusion, data wrangling is essential for ensuring the accuracy, reliability, and quality of data
used for analysis. Data wrangling helps to transform raw data into a format that can be easily
analyzed and used to make informed decisions.

™ Data Cleaning
Data cleaning, also known as data cleansing or data scrubbing, is the process of identifying and
correcting or removing errors, inconsistencies, and inaccuracies in data. Here are some common
steps involved in data cleaning:
9 Identify missing data: Missing data can lead to inaccurate analysis, so it's important to identify
missing data and determine the best course of action, such as imputing missing values or
removing incomplete data.
9 Remove duplicates: Duplicate data can skew analysis results, so it's important to identify and
remove duplicate records.
9 Correct data errors: Data errors can include typos, incorrect values, and inconsistencies in data
formatting. These errors can be corrected manually or through automated methods.
9 Address outliers: Outliers are data points that are significantly different from other data points in
the dataset. Outliers can be a result of data entry errors or genuine anomalies, so it's important
to identify and address outliers appropriately.
9 Standardize data: Data standardization involves converting data into a consistent format. This can
include converting data to a consistent measurement unit or formatting data in a consistent way.
9 Check for consistency: Inconsistencies in data can arise from different sources or data entry
errors. It's important to check for consistency in data values, formats, and units.
9 Validate data: Data validation involves verifying the accuracy and consistency of data against
external sources or using statistical methods.
Data cleaning is an important step in the data analysis process, as it ensures that the data used
for analysis is accurate, reliable, and consistent. Clean data can lead to more accurate analysis
results and better­informed decision­making.

™ Data Integration
Data integration is the process of combining data from different sources into a unified format
that can be easily analyzed. Here are some common steps involved in data integration:
9 Identify data sources: The first step in data integration is to identify the different data sources
that will be used. These data sources can include databases, spreadsheets, APIs, and other data
repositories.
9 Determine data formats: Each data source may have its own unique data format, such as CSV,
Excel, JSON, or XML. It's important to determine the data format of each data source to ensure
that they can be integrated properly.
9 Create a data model: A data model is a blueprint of how the data will be integrated. The data
model should identify the relationships between different data sources and define how the data
will be transformed and combined.
9 Transform and clean data: Each data source may have its own unique data structure and
formatting, so it's important to transform and clean the data to ensure that it's consistent and
can be easily analyzed. This can include data cleaning, normalization, and standardization.
9 Integrate data: Once the data has been transformed and cleaned, it can be integrated into a
unified format. This can include merging data from different sources, joining tables, and creating
a master data file.
9 Validate and verify data: After data integration, it's important to validate and verify the data to
ensure that it's accurate and complete. This can include data quality checks, testing data against
expected outcomes, and reviewing data with subject matter experts.
Data integration is important because it allows organizations to access and analyze data from
multiple sources, providing a more complete and accurate picture of their operations. By
integrating data, organizations can gain insights into customer behavior, market trends, and
operational performance, leading to better­informed decision­making.

™ Data Reduction
Data reduction is the process of reducing the amount of data in a dataset while preserving its
important features. Here are some common methods for data reduction:
9 Sampling: Sampling involves selecting a subset of the data for analysis. Simple random sampling
selects a random subset of the data, while stratified sampling selects a subset of the data based
on predetermined criteria.
9 Aggregation: Aggregation involves combining data from multiple sources into a single data point.
For example, sales data can be aggregated by month or quarter to provide a summary of sales
activity over a specific period of time.
9 Dimensionality reduction: Dimensionality reduction involves reducing the number of variables in
a dataset while preserving its important features. This can be achieved through methods such as
principal component analysis (PCA) or factor analysis.
9 Feature selection: Feature selection involves selecting a subset of the most relevant features
from a dataset for analysis. This can be achieved through methods such as mutual information,
correlation analysis, or lasso regression.
9 Clustering: Clustering involves grouping similar data points together to reduce the overall size of
the dataset. This can be achieved by methods like k­means clustering or hierarchical clustering.
Data reduction is important because it can help to reduce the computational resources required
for analysis, making it easier to process and analyze large datasets. By reducing the amount of
data, it can also improve the efficiency and accuracy of analysis and modeling, as well as reduce
the risk of overfitting or bias in the analysis.

™ Data Transformation
Data transformation is the process of converting data from one format or structure to another to
prepare it for analysis or to integrate it with other data sources. Here are some common methods
for data transformation:
9 Data cleaning: Data cleaning involves identifying and correcting errors or inconsistencies in the
data. This can include removing duplicates, filling in missing values, and correcting data that is
outside of expected ranges.
9 Data normalization: Data normalization involves converting data into a standard format to
eliminate redundancy and improve consistency. This can include converting data to a common
unit of measure, scaling data to a common range, and converting categorical data to numerical
values.
9 Data aggregation: Data aggregation involves summarizing or combining data from multiple
sources into a single dataset. This can include aggregating data by time period, region, or other
factors to provide a more comprehensive view of the data.
9 Data encoding: Data encoding involves converting data into a machine­readable format that can
be easily processed by algorithms or models. This can include encoding categorical data using
one­hot encoding or binary encoding.
9 Data discretization: Data discretization involves converting continuous data into discrete intervals
or categories. This can make it easier to analyze the data and can be useful when working with
data that has a large number of values or when the values are not evenly distributed.
Data transformation is important because it helps to prepare data for analysis or modeling,
making it easier to interpret and draw insights from. By converting data into a standard format, it
can also make it easier to integrate data from multiple sources, reducing the risk of errors or
inconsistencies in the analysis.

™ Data Discretization
Data discretization is the process of converting continuous data into discrete intervals or
categories. This can be useful when working with data that has a large number of values or when
the values are not evenly distributed. Here are some common methods for data discretization:
9 Equal width binning: Equal width binning involves dividing the range of values into equal­width
intervals or bins. This can be useful when the data has a uniform distribution.
9 Equal frequency binning: Equal frequency binning involves dividing the data into intervals or bins
such that each bin contains an equal number of data points. This can be useful when the data has
a non­uniform distribution.
9 K­means clustering: K­means clustering involves grouping similar data points together based on
their distance from the center of the cluster. This can be used to identify natural groupings in the
data and to create discrete categories based on those groupings.
9 Decision trees: Decision trees involve dividing the data into branches based on a series of
decisions or criteria. This can be used to create discrete categories based on the criteria used in
the decision tree.
Data discretization is important because it can help to simplify complex data and make it easier to
analyze. By converting continuous data into discrete intervals or categories, it can also make it
easier to interpret the data and to identify patterns or trends in the data.

You might also like