0% found this document useful (0 votes)
7 views18 pages

Computational Data Science - Unit 1

The document provides a comprehensive overview of data science, its importance, lifecycle, and the role of data scientists in various industries. It distinguishes between structured and unstructured data, outlines the prerequisites for data science, and highlights its applications across sectors such as healthcare, finance, and logistics. Additionally, it contrasts data science with business intelligence, emphasizing the advanced analytical capabilities of data science.

Uploaded by

brainx Magic
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views18 pages

Computational Data Science - Unit 1

The document provides a comprehensive overview of data science, its importance, lifecycle, and the role of data scientists in various industries. It distinguishes between structured and unstructured data, outlines the prerequisites for data science, and highlights its applications across sectors such as healthcare, finance, and logistics. Additionally, it contrasts data science with business intelligence, emphasizing the advanced analytical capabilities of data science.

Uploaded by

brainx Magic
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

COMPUTATIONAL DATA SCIENCE (ITITE68)

UNIT – 1

What is Data Science, importance of data science, big data and data Science, the current
Scenario, Industry Perspective Types of Data: Structured vs. Unstructured Data,
Quantitative vs. Categorical Data, Big Data vs. Little Data, Data science process, Role
Data Scientist

Data science is an essential part of many industries today, given the massive amounts of data
that are produced, and is one of the most debated topics in IT circles. Its popularity has grown
over the years, and companies have started implementing data science techniques to grow their
business and increase customer satisfaction.

What Is Data Science?

Data science is the domain of study that deals with vast volumes of data using modern tools
and techniques to find unseen patterns, derive meaningful information, and make business
decisions. Data science uses complex machine learning algorithms to build predictive models.

The data used for analysis can come from many different sources and presented in various
formats.

The Data Science Lifecycle

Data science’s lifecycle consists of five distinct stages, each with its own tasks:

1. Capture: Data Acquisition, Data Entry, Signal Reception, Data Extraction. This
stage involves gathering raw structured and unstructured data.

2. Maintain: Data Warehousing, Data Cleansing, Data Staging, Data Processing, Data
Architecture. This stage covers taking the raw data and putting it in a form that can
be used.

3. Process: Data Mining, Clustering/Classification, Data Modeling, Data


Summarization. Data scientists take the prepared data and examine its patterns,
ranges, and biases to determine how useful it will be in predictive analysis.
4. Analyze: Exploratory/Confirmatory, Predictive Analysis, Regression, Text Mining,
Qualitative Analysis. Here is the real meat of the lifecycle. This stage involves
performing the various analyses on the data.

5. Communicate: Data Reporting, Data Visualization, Business Intelligence, Decision


Making. In this final step, analysts prepare the analyses in easily readable forms such
as charts, graphs, and reports.

Prerequisites for Data Science

Here are some of the technical concepts you should know about before starting to learn what
is data science.

1. Machine Learning

Machine learning is the backbone of data science. Data Scientists need to have a solid grasp of
ML in addition to basic knowledge of statistics.

2. Modeling

Mathematical models enable you to make quick calculations and predictions based on what
you already know about the data. Modeling is also a part of Machine Learning and involves
identifying which algorithm is the most suitable to solve a given problem and how to train these
models.

3. Statistics

Statistics are at the core of data science. A sturdy handle on statistics can help you extract more
intelligence and obtain more meaningful results.

4. Programming

Some level of programming is required to execute a successful data science project. The most
common programming languages are Python, and R. Python is especially popular because it’s
easy to learn, and it supports multiple libraries for data science and ML.
5. Databases

A capable data scientist needs to understand how databases work, how to manage them, and
how to extract data from them.

Who Oversees the Data Science Process?

Business Managers

The business managers are the people in charge of overseeing the data science training method.
Their primary responsibility is to collaborate with the data science team to characterise the
problem and establish an analytical method. A data scientist may oversee the marketing,
finance, or sales department, and report to an executive in charge of the department. Their goal
is to ensure projects are completed on time by collaborating closely with data scientists and IT
managers.

IT Managers

Following them are the IT managers. If the member has been with the organisation for a long
time, the responsibilities will undoubtedly be more important than any others. They are
primarily responsible for developing the infrastructure and architecture to enable data science
activities. Data science teams are constantly monitored and resourced accordingly to ensure
that they operate efficiently and safely. They may also be in charge of creating and maintaining
IT environments for data science teams.

Data Science Managers

The data science managers make up the final section of the tea. They primarily trace and
supervise the working procedures of all data science team members. They also manage and
keep track of the day-to-day activities of the three data science teams. They are team builders
who can blend project planning and monitoring with team growth.
What is a Data Scientist?

Data scientists are among the most recent analytical data professionals who have the technical
ability to handle complicated issues as well as the desire to investigate what questions need to
be answered. They're a mix of mathematicians, computer scientists, and trend forecasters.
They're also in high demand and well-paid because they work in both the business and IT
sectors.

On a daily basis, a data scientist may do the following tasks:

1. Discover patterns and trends in datasets to get insights.

2. Create forecasting algorithms and data models.

3. Improve the quality of data or product offerings by utilising machine learning


techniques.

4. Distribute suggestions to other teams and top management.

5. In data analysis, use data tools such as R, SAS, Python, or SQL.

6. Top the field of data science innovations.

What Does a Data Scientist Do?

You know what is data science, and you must be wondering what exactly is this job role like -
here's the answer. A data scientist analyzes business data to extract meaningful insights. In
other words, a data scientist solves business problems through a series of steps, including:

• Before tackling the data collection and analysis, the data scientist determines the
problem by asking the right questions and gaining understanding.

• The data scientist then determines the correct set of variables and data sets.

• The data scientist gathers structured and unstructured data from many disparate
sources—enterprise data, public data, etc.
• Once the data is collected, the data scientist processes the raw data and converts it
into a format suitable for analysis. This involves cleaning and validating the data to
guarantee uniformity, completeness, and accuracy.

• After the data has been rendered into a usable form, it’s fed into the analytic
system—ML algorithm or a statistical model. This is where the data scientists
analyze and identify patterns and trends.

• When the data has been completely rendered, the data scientist interprets the data to
find opportunities and solutions.

• The data scientists finish the task by preparing the results and insights to share with
the appropriate stakeholders and communicating the results.

Use of Data Science

1. Data science may detect patterns in seemingly unstructured or unconnected data,


allowing conclusions and predictions to be made.

2. Tech businesses that acquire user data can utilise strategies to transform that data
into valuable or profitable information.

3. Data Science has also made inroads into the transportation industry, such as with
driverless cars. It is simple to lower the number of accidents with the use of
driverless cars. For example, with driverless cars, training data is supplied to the
algorithm, and the data is examined using data Science approaches, such as the speed
limit on the highway, busy streets, etc.

4. Data Science applications provide a better level of therapeutic customisation


through genetics and genomics research.

Difference Between Business Intelligence and Data Science

Business intelligence is a combination of the strategies and technologies used for the analysis
of business data/information. Like data science, it can provide historical, current, and predictive
views of business operations. However, there are some key differences.
Business Intelligence Data Science

Uses both structured and


Uses structured data
unstructured data

Scientific in nature - perform an


Analytical in nature - provides a historical report of the
in-depth statistical analysis on
data
the data

Leverages more sophisticated


Use of basic statistics with emphasis on visualization
statistical and predictive analysis
(dashboards, reports)
and machine learning (ML)

Combines historical and current


Compares historical data to current data to identify
data to predict future
trends
performance and outcomes
Applications of Data Science

Data science has found its applications in almost every industry.

1. Healthcare

Healthcare companies are using data science to build sophisticated medical instruments to
detect and cure diseases.

2. Gaming

Video and computer games are now being created with the help of data science and that has
taken the gaming experience to the next level.

3. Image Recognition

Identifying patterns in images and detecting objects in an image is one of the most popular data
science applications.

4. Recommendation Systems

Netflix and Amazon give movie and product recommendations based on what you like to
watch, purchase, or browse on their platforms.
5. Logistics

Data Science is used by logistics companies to optimize routes to ensure faster delivery of
products and increase operational efficiency.

6. Fraud Detection

Banking and financial institutions use data science and related algorithms to detect fraudulent
transactions.

7. Internet Search

When we think of search, we immediately think of Google. Right? However, there are other
search engines, such as Yahoo, Duckduckgo, Bing, AOL, Ask, and others, that employ data
science algorithms to offer the best results for our searched query in a matter of seconds. Given
that Google handles more than 20 petabytes of data per day. Google would not be the 'Google'
we know today if data science did not exist.

8. Speech recognition

Speech recognition is dominated by data science techniques. We may see the excellent work
of these algorithms in our daily lives. Have you ever needed the help of a virtual speech
assistant like Google Assistant, Alexa, or Siri? Well, its voice recognition technology is
operating behind the scenes, attempting to interpret and evaluate your words and delivering
useful results from your use. Image recognition may also be seen on social media platforms
such as Facebook, Instagram, and Twitter. When you submit a picture of yourself with someone
on your list, these applications will recognise them and tag them.

9. Targeted Advertising

If you thought Search was the most essential data science use, consider this: the whole digital
marketing spectrum. From display banners on various websites to digital billboards at airports,
data science algorithms are utilised to identify almost anything. This is why digital
advertisements have a far higher CTR (Call-Through Rate) than traditional marketing. They
can be customised based on a user's prior behaviour. That is why you may see adverts for Data
Science Training Programs while another person sees an advertisement for clothes in the same
region at the same time.
10. Airline Route Planning

As a result of data science, it is easier to predict flight delays for the airline industry, which is
helping it grow. It also helps to determine whether to land immediately at the destination or to
make a stop in between, such as a flight from Delhi to the United States of America or to stop
in between and then arrive at the destination.

11. Augmented Reality

Last but not least, the final data science applications appear to be the most fascinating in the
future. Yes, we are discussing something other than augmented reality. Do you realise there's
a fascinating relationship between data science and virtual reality? A virtual reality headset
incorporates computer expertise, algorithms, and data to create the greatest viewing experience
possible. The popular game Pokemon GO is a minor step in that direction. The ability to wander
about and look at Pokemon on walls, streets, and other non-existent surfaces. The makers of
this game chose the locations of the Pokemon and gyms using data from Ingress, the previous
app from the same business.

Example of Data Science

Here are some brief overviews of a couple of use cases, showing data science’s versatility.

• Law Enforcement: In this scenario, data science is used to help police in Belgium to
better understand where and when to deploy personnel to prevent crime. With only
limited resources and a large area to cover data science used dashboards and reports
to increase the officers’ situational awareness, allowing a police force that’s spread
thin to maintain order and anticipate criminal activity.

• Pandemic Fighting: The state of Rhode Island wanted to reopen schools, but was
naturally cautious, considering the ongoing COVID-19 pandemic. The state used
data science to expedite case investigations and contact tracing, enabling a small
staff to handle an overwhelming number of concerned calls from citizens. This
information helped the state set up a call center and coordinate preventative
measures.
• Driverless Vehicles: Lunewave, a sensor manufacturing company, was looking for
a way to make sensor technology more cost-effective and accurate. They turned to
data science and machine learning to train their sensors to be safer and more reliable,
as well as using data to improve their 3D-printed sensor manufacturing process.

• Entertainment: Data science enables streaming services to follow and evaluate what
consumers view, which aids in the creation of new TV series and films. Data-driven
algorithms are also utilised to provide tailored suggestions based on the watching
history of a user.

• Finance: Banks and credit card firms mine and analyse data in order to detect
fraudulent activities, manage financial risks on loans and credit lines, and assess
client portfolios in order to uncover upselling possibilities.

• Manufacturing: Data science applications in manufacturing include supply chain


management and distribution optimization, as well as predictive maintenance to
anticipate probable equipment faults in facilities before they occur.

• Healthcare: Machine learning models and other data science components are used
by hospitals and other healthcare providers to automate X-ray analysis and assist
doctors in diagnosing illnesses and planning treatments based on previous patient
outcomes.

• Retail: Retailers evaluate client behaviour and purchasing trends in order to provide
individualised product suggestions as well as targeted advertising, marketing, and
promotions. Data science also assists them in managing product inventories and
supply chains in order to keep items in stock.

What is data and computational science?

Data Science is the art of generating insight, knowledge and predictions by processing of data
gathered about a system or a process. Computational Science is the art of developing validated
(simulation) models in order to gain a better understanding of a phenomenon (systems or
processes).

WHAT IS COMPUTATIONAL DATA SCIENCE?


Computational Data Science combines aspects of statistics, computer science, mathematics and
machine learning to identify trends, make predictions, and solve problems. Computational data
science uses algorithms and data structures to store, manipulate, visualize and learn from large
data sets.
WHAT DOES A COMPUTATIONAL DATA SCIENTIST DO?
Computational Data Scientists work in a wide variety of industries. Some common tasks
include:
• Collecting and categorizing large datasets
• Cleaning and validating data to ensure accuracy, completeness and uniformity
• Identifying patterns and trends in data sets
• Devising models and algorithms to uncover hidden meaning
• Forecasting future trends and results
• Training intelligent systems
• Producing summarizations and visualizations of datasets and communicate results to
stakeholders
• Discovering solutions and opportunities through an understanding of data sets

Difference between Structured data and Unstructured data

Structured Data

➢ The data which is to the point, factual, and highly organized is referred to as structured
data. It is quantitative in nature, i.e., it is related to quantities that means it contains
measurable numerical values like numbers, dates, and times.
➢ It is easy to search and analyze structured data. Structured data exists in a predefined
format. Relational database consisting of tables with rows and columns is one of the
best examples of structured data.
➢ Structured data generally exist in tables like excel files and Google Docs spreadsheets.
The programming language SQL (structured query language) is used for managing the
structured data. SQL is developed by IBM in the 1970s and majorly used to handle
relational databases and warehouses.
➢ Structured data is highly organized and understandable for machine language. Common
applications of relational databases with structured data include sales transactions,
Airline reservation systems, inventory control, and others.

Unstructured Data

➢ All the unstructured files, log files, audio files, and image files are included in the
unstructured data. Some organizations have much data available, but they did not know
how to derive data value since the data is raw.
➢ Unstructured data is the data that lacks any predefined model or format. It requires a lot
of storage space, and it is hard to maintain security in it. It cannot be presented in a data
model or schema.
➢ That's why managing, analyzing, or searching for unstructured data is hard. It resides
in various different formats like text, images, audio and video files, etc. It is qualitative
in nature and sometimes stored in a non-relational database or NO-SQL.
➢ It is not stored in relational databases, so it is hard for computers and humans to interpret
it. The limitations of unstructured data include the requirement of data science experts
and specialized tools to manipulate the data.
➢ The amount of unstructured data is much more than the structured or semi-structured
data.
➢ Examples of human-generated unstructured data are Text files, Email, social media,
media, mobile data, business applications, and others.
➢ The machine-generated unstructured data includes satellite images, scientific data,
sensor data, digital surveillance, and many more.

Structured data v/s Unstructured data

On the basis Structured data Unstructured data


of

Technology It is based on a relational database. It is based on character and binary data.

Flexibility Structured data is less flexible and There is an absence of schema, so it is more
schema-dependent. flexible.

Scalability It is hard to scale database schema. It is more scalable.

Robustness It is very robust. It is less robust.

Performance Here, we can perform a structured While in unstructured data, textual queries
query that allows complex joining, are possible, the performance is lower than
so the performance is higher. semi-structured and structured data.
Nature Structured data is quantitative, i.e., It is qualitative, as it cannot be processed and
it consists of hard numbers or things analyzed using conventional tools.
that can be counted.

Format It has a predefined format. It has a variety of formats, i.e., it comes in a


variety of shapes and sizes.

Analysis It is easy to search. Searching for unstructured data is more


difficult.

Categorical Data vs. Quantitative Data: What’s the Difference?

What is categorical data

Categorical data refers to values that are divided into groups, or categories, such as gender,

country of origin, or eye color. It is often expressed in non-numeric forms such as words, letters,

or symbols instead of numerical values. Categorical data can provide insight into the

characteristics of different groups, or populations. For example, a survey may reveal that males

are more likely to purchase a certain product than females.

When to use categorical data

As data science has grown in prominence, categorical data has become increasingly important.

Categorical data is often used when trying to ascertain correlation between different variables,

such as whether certain behaviors or characteristics are associated with particular outcomes. It

can also be used to help understand trends and patterns that may exist within a population. For

example, a study may use it to determine whether certain demographic factors – such as age,
income level, or even IT knowledge level, for example (e.g. knowing how to fix corrupted
Windows files) – predict certain behaviors. And, on top of that, categorical data can be used to

segment customers into different groups for targeted marketing campaigns.

Benefits of using categorical data

There are plenty of advantages to using categorical data. It is much easier to interpret and

analyze than quantitative data, which makes it an ideal choice for people without a strong

background in mathematics or statistics. Since it’s non-numeric, it allows researchers and

analysts to gain insight into the data without having to run complex and expensive quantitative

analyses.

What is quantitative data

Quantitative data, on the other hand, refers to numerical measurements - it does not involve

grouping values into categories. Quantitative data can be used to measure changes or trends

over time. For example, consumer behavior trends such as the number of people aged between

18-25 who use a smartphone app can be tracked to measure uptake and usage over different

periods of time.

When to use quantitative data

Analyzing quantitative data can allow us to see how variables are related. It can be used to

measure changes in behavior, attitudes, or preferences over time. Quantitative data is often

used in research studies and surveys that involve collecting numerical data from participants.

It is also found useful for setting objectives and targets as well as tracking performance.
Benefits of using quantitative data

The main advantage of using quantitative data is that it allows researchers and analysts to make

predictions based on patterns and trends they observe in the data. This can help companies

make better decisions when it comes to product development, marketing strategies, and

customer service. And quantitative data provides a level of accuracy that is often not achievable

with categorical data due to its numerical nature.

Differences between categorical and quantitative data

When it comes to differences between categorical and quantitative data, the most significant is

in the way each type of data is analyzed. Categorical data can be analyzed by counting the

frequency of each group, while quantitative data requires mathematical operations such as

summation or averaging to determine meaningful correlations. Furthermore, categorical data

has limited application in statistical analysis and modeling since it provides less information

than quantitative data.

Categorical data is used to describe characteristics of a population based on non-numeric values

while quantitative data is used to measure numerical values over time or to compare different

groups. Categorical data can provide insights into how different populations interact with each

other, which can be used for targeted marketing, while quantitative data can be used for

predictive analysis and setting objectives.

Difference Between Small Data and Big Data

Small Data: It can be defined as small datasets that are capable of impacting decisions in
the present. Anything that is currently ongoing and whose data can be accumulated in an
Excel file. Small Data is also helpful in making decisions, but does not aim to impact the
business to a great extent, rather for a short span of small data can be described as small
datasets that are capable of having an influence on current decisions. Almost everything
currently in progress and the data of which can be acquired in an Excel file. Small data is
also useful in decision-making but is not intended to have a large impact on business, rather
for a short period of time.
In nutshell, data that is simple enough to be used for human understanding in such a
volume and structure that makes it accessible, concise, and workable is known as small
data.

Big Data: It can be represented as large chunks of structured and unstructured data. The
amount of data stored is immense. It is therefore important for analysts to thoroughly dig the
whole thing into making it relevant and useful to make proper business decisions.
In short, datasets that are really huge and complex that conventional data processing
techniques cannot manage them are known as big data.

Feature Small Data Big Data

Technology Traditional Modern

The Big Data collection is done by


Generally, it is obtained in using pipelines having queues like
an organized manner than is AWS Kinesis or Google Pub / Sub to
Collection inserted into the database balance high-speed data

Data in the range of tens or


Volume hundreds of Gigabytes Size of Data is more than Terabytes

Analysis Clusters (Data Scientists), Data


Areas Data marts (Analysts) marts(Analysts)

Contains less noise as data


is less collected in a Usually, the quality of data is not
Quality controlled manner guaranteed

It requires batch-oriented It has both batch and stream


Processing processing pipelines processing pipelines

Database SQL NoSQL

A regulated and constant Data arrives at extremely high speeds,


flow of data, data large volumes of data aggregation in
Velocity aggregation is slow a short time
Feature Small Data Big Data

Numerous variety of data set


Structured data in tabular including tabular data, text, audio,
format with fixed images, video, logs, JSON etc.(Non
Structure schema(Relational) Relational)

They are mostly based on


horizontally scaling architectures,
They are usually vertically which gives more versatility at a
Scalability scaled lower cost

Query
Language only Sequel Python, R, Java, Sequel

Hardware A single server is sufficient Requires more than one server

Complex data mining techniques for


Business Intelligence, pattern finding, recommendation,
Value analysis and reporting prediction etc.

Data can be optimized Requires machine learning techniques


Optimization manually(human powered) for data optimization

Usually requires distributed storage


Storage within enterprises, systems on cloud or in external file
Storage local servers etc. systems

Data Analysts, Database Data Scientists, Data Analysts,


Administrators and Data Database Administrators and Data
People Engineers Engineers

Securing Big Data systems are much


Security practices for Small more complicated. Best security
Data include user privileges, practices include data encryption,
data encryption, hashing, cluster network isolation, strong
Security etc. access control protocols etc.

Database, Data Warehouse,


Nomenclature Data Mart Data Lake
Feature Small Data Big Data

Predictable resource
allocation, mostly vertically More agile infrastructure with
Infrastructure scalable hardware. horizontally scalable hardware

You might also like