FDS - Unit-I - Notes
FDS - Unit-I - Notes
Big Data is large, diversified sets of data sourcing out of multiple channels: social media
platforms, websites, electronic check-ins, sensors, product purchase, call logs, the
choices are limitless. Big Data has three unique characteristics: volume, velocity, and
variety.
Big Data allows companies to improve their products and create tailored
marketing by gaining a 360-degree view of their customers’ behavior and
motivations.
It enables businesses or service providers to monitor fraudulent activities in real-
time by identifying unusual patterns and behavior with the help of Predictive
Analytics.
It drives supply chain efficiencies by collecting and analyzing data to determine if
products are reaching their destination in the desired conditions to attract
customers’ interest.
Predictive analysis allows businesses to scan and analyze social media feeds to
understand the sentiment among customers.
Companies that collect a large amount of data have a better chance to explore
the untapped area alongside conducting a more profound and richer analysis to
benefit all stakeholders.
The faster and better a business understands its customer, the greater benefits it
reaps. Big Data is used to train Machine Learning models to identify patterns and
make informed decisions with minimal or no human intervention.
Completely Automated: The Hevo platform can be set up in just a few minutes
and requires minimal maintenance.
Transformations: Hevo provides preload transformations through Python code.
It also allows you to run transformation code for each event in the pipelines you
set up. You need to edit the properties of the event object received in the
transform method as a parameter to carry out the transformation. Hevo also
offers drag and drop transformations like Date and Control Functions, JSON, and
Event Manipulation to name a few. These can be configured and tested before
putting them to use.
Connectors: Hevo supports 100+ integrations to SaaS platforms, files,
databases, analytics, and BI tools. It supports various destinations including
Google BigQuery, Amazon Redshift, Snowflake Data Warehouses; Amazon S3
Data Lakes; and MySQL, MongoDB, TokuDB, DynamoDB, PostgreSQL
databases to name a few.
Real-Time Data Transfer: Hevo provides real-time data migration, so you can
have analysis-ready data always.
100% Complete & Accurate Data Transfer: Hevo’s robust infrastructure
ensures reliable data transfer with zero data loss.
Scalable Infrastructure: Hevo has in-built integrations for 100+ sources like
Google Analytics, that can help you scale your data infrastructure as required.
24/7 Live Support: The Hevo team is available round the clock to extend
exceptional support to you through chat, email, and support calls.
Schema Management: Hevo takes away the tedious task of schema
management & automatically detects the schema of incoming data and maps it
to the destination schema.
Live Monitoring: Hevo allows you to monitor the data flow so you can check
where your data is at a particular point in time.
Data Science Introduction
Data Science is a combination of multiple disciplines that uses statistics, data analysis, and
machine learning to analyze data and to extract knowledge and insights from it.
In today’s world a large amount of data is generated daily the main challenge is to deal
with this data and extract insights from the data to help various organizations and
businesses. This is where Data Science comes in when it helps data to combine data
and make a pattern with the help of skills as such computer science, mathematics,
statistics, information visualization, graphics, and business to deal with this data.
Data Science is about finding patterns in data, through analysis, and make future predictions.
Data Science is used in many industries in the world today, e.g. banking, consultancy,
healthcare, and manufacturing.
Data Science can be applied in nearly every part of a business where data is available. Examples
are:
Consumer goods
Stock markets
Industry
Politics
Logistic companies
E-commerce
ADVERTISEMENT
How Does a Data Scientist Work?
A Data Scientist requires expertise in several backgrounds:
Machine Learning
Statistics
Programming (Python or R)
Mathematics
Databases
A Data Scientist must find patterns within the data. Before he/she can find the patterns, he/she
must organize the data in a standard format.
One purpose of Data Science is to structure data, making it interpretable and easy to work with.
Data can be categorized into two groups:
Structured data
Unstructured data
Unstructured Data
Unstructured data is not organized. We must organize the data for analysis purposes.
Structured Data
Structured data is organized and easier to work with.
Example of an array:
[80, 85, 90, 95, 100, 105, 110, 115, 120, 125]
The following example shows how to create an array in Python:
Example
Array = [80, 85, 90, 95, 100, 105, 110, 115, 120, 125]
print(Array)
It is common to work with very large data sets in Data Science.
In this tutorial we will try to make it as easy as possible to understand the concepts of Data
Science. We will therefore work with a small data set that is easy to interpret.
The following are some primary motives for the use of Data science technology:
1. It helps to convert the big quantity of uncooked and unstructured records into
significant insights.
2. It can assist in unique predictions such as a range of surveys, elections, etc.
3. It also helps in automating transportation such as growing a self-driving car,
we can say which is the future of transportation.
4. Companies are shifting towards Data science and opting for this technology.
Amazon, Netflix, etc, which cope with the big quantity of data, are the use
of information science algorithms for higher consumer experience.
5. The lifecycle of Data Science
6.
7.
8. 1. Business Understanding: The complete cycle revolves around the enterprise
goal. What will you resolve if you do not longer have a specific problem? It is
extraordinarily essential to apprehend the commercial enterprise goal sincerely
due to the fact that will be your ultimate aim of the analysis. After desirable
perception only we can set the precise aim of evaluation that is in sync with the
enterprise objective. You need to understand if the customer desires to minimize
savings loss, or if they prefer to predict the rate of a commodity, etc.
9. 2. Data Understanding: After enterprise understanding, the subsequent step is
data understanding. This includes a series of all the reachable data. Here you
need to intently work with the commercial enterprise group as they are certainly
conscious of what information is present, what facts should be used for this
commercial enterprise problem, and different information. This step includes
describing the data, their structure, their relevance, their records type. Explore
the information using graphical plots. Basically, extracting any data that you can
get about the information through simply exploring the data.
10. 3. Preparation of Data: Next comes the data preparation stage. This consists of
steps like choosing the applicable data, integrating the data by means of merging
the data sets, cleaning it, treating the lacking values through either eliminating
them or imputing them, treating inaccurate data through eliminating them,
additionally test for outliers the use of box plots and cope with them.
Constructing new data, derive new elements from present ones. Format the data
into the preferred structure, eliminate undesirable columns and features. Data
preparation is the most time-consuming but arguably the most essential step in
the complete existence cycle. Your model will be as accurate as your data.
11. 4. Exploratory Data Analysis: This step includes getting some concept about
the answer and elements affecting it, earlier than constructing the real model.
Distribution of data inside distinctive variables of a character is explored
graphically the usage of bar-graphs, Relations between distinct aspects are
captured via graphical representations like scatter plots and warmth maps. Many
data visualization strategies are considerably used to discover each and every
characteristic individually and by means of combining them with different
features.
12. 5. Data Modeling: Data modeling is the coronary heart of data analysis. A
model takes the organized data as input and gives the preferred output. This step
consists of selecting the suitable kind of model, whether the problem is a
classification problem, or a regression problem or a clustering problem. After
deciding on the model family, amongst the number of algorithms amongst that
family, we need to cautiously pick out the algorithms to put into effect and
enforce them. We need to tune the hyperparameters of every model to obtain the
preferred performance. We additionally need to make positive there is the right
stability between overall performance and generalizability. We do no longer
desire the model to study the data and operate poorly on new data.
13. 6. Model Evaluation: Here the model is evaluated for checking if it is geared up
to be deployed. The model is examined on an unseen data, evaluated on a
cautiously thought out set of assessment metrics. We additionally need to make
positive that the model conforms to reality. If we do not acquire a quality end
result in the evaluation, we have to re-iterate the complete modelling procedure
until the preferred stage of metrics is achieved. Any data science solution, a
machine learning model, simply like a human, must evolve, must be capable to
enhance itself with new data, adapt to a new evaluation metric. We can construct
more than one model for a certain phenomenon, however, a lot of them may
additionally be imperfect. The model assessment helps us select and construct an
ideal model.
14. 7. Model Deployment: The model after a rigorous assessment is at the end
deployed in the preferred structure and channel. This is the last step in the data
science life cycle. Each step in the data science life cycle defined above must be
laboured upon carefully. If any step is performed improperly, and hence, have an
effect on the subsequent step and the complete effort goes to waste. For example,
if data is no longer accumulated properly, you’ll lose records and you will no
longer be constructing an ideal model. If information is not cleaned properly, the
model will no longer work. If the model is not evaluated properly, it will fail in
the actual world. Right from Business perception to model deployment, every
step has to be given appropriate attention, time, and effort.
What is Data?
Data is an extremely important factor when it comes to gaining insights about a specific
topic, study, research, or even people. This is why it is regarded as a vital component of
all the systems that make up our world today.
In fact, data offers a broad range of applications and uses in the modern age. So
whether or not you’re considering digital transformation, data collection is an aspect that
you should never brush off, especially if you want to get insights, make forecasts, and
manage your operations in a way that creates significant value.
However, many people still gravitate towards confusion when they come to terms with
the idea of data collection.
Let us understand:
While techniques and goals may vary per field, the general data collection methods
used in the process are essentially the same. In other words, there are specific
standards that need to be strictly followed and implemented to make sure that data is
collected accurately.
Not to mention, if the appropriate procedures are not given importance, a variety of
problems might arise and impact the study or research being conducted.
The most common risk is the inability to identify answers and draw correct conclusions
for the study, as well as failure to validate if the results are correct. These risks may also
result in questionable research, which can greatly affect your credibility.
So before you start collecting data, you have to rethink and review all of your research
goals. Start by creating a checklist of your objectives. Here are some important
questions to take into account:
Take note that bad data can never be useful. This is why you have to ensure that you
only collect high-quality ones. But to help you gain more confidence when it comes to
collecting the data you need for your research, let’s go through each question presented
above.
Identifying exactly what you want to achieve in your research can significantly help you
collect the most relevant data you need. Besides, clear goals always provide clarity to
what you are trying to accomplish. With clear objectives, you can easily identify what
you need and determine what’s most useful to your research.
Data can be divided into two major categories: qualitative data and quantitative data.
Qualitative data is the classification given to a set of data that refers to immeasurable
attributes. Quantitative data, on the other hand, can be measured using numbers.
Based on the goal of your research, you can either collect qualitative data or
quantitative data; or a combination of both.
There are specific types of data collection methods that can be used to acquire, store,
and process the data. If you’re not familiar with any of these methods, keep reading as
we will tackle each of them in the latter part of this article. But to give you a quick
overview, here are some of the most common data collection methods that you can
utilize:
Experiment
Survey
Observation
Ethnography
Secondary data collection
Archival research
Interview/focus group
Note: We will discuss these methods more in the Data Collection Methods + Examples
section of this article.
Regardless of the field, data collection offers heaps of benefits. To help you become
attuned to these advantages, we’ve listed some of the most notable ones below:
1. Collecting good data is extremely helpful when it comes to identifying and verifying
various problems, perceptions, theories, and other factors that can impact your
business.
2. It allows you to focus your time and attention on the most important aspects of your
business.
3. It helps you understand your customers better. Collecting data allows your company to
truly understand what your consumers expect from you, the unique products or services
they desire, and how they want to connect with your brand as a whole.
4. Collecting data allows you to study and analyze trends better.
5. Data collection enables you to make more effective decisions and come up with
solutions to common industry problems.
6. It allows you to resolve problems and improve your products or services based on data
collected.
7. Accurate data collection can help build trust, establish productive and professional
discussions, and win the support of important decision-makers and investors.
8. When engaging with key decision-makers, collecting, monitoring, and assessing data on
a regular basis may offer businesses reliable, relevant information.
9. Collecting relevant data can positively influence your marketing campaigns, which can
help you develop new strategies in the future.
10. Data collection enables you to satisfy customer expectations for personalized messages
and recommendations.
These are just a few of the many benefits of data collection in general. In fact, there are
still a lot of advantages when it comes to collecting consumer data that you can benefit
from.
Introduction – Importance of Data
“Data is the new oil.” Today data is everywhere in every field. Whether you are a data scientist,
marketer, businessman, data analyst, researcher, or you are in any other profession, you need to
play or experiment with raw or structured data. This data is so important for us that it becomes
important to handle and store it properly, without any error. While working on these data, it is
important to know the types of data to process them and get the right results. There are two
types of data: Qualitative and Quantitative data, which are further classified into:
Nominal data.
Ordinal data.
Discrete data.
Continuous data.
So there are 4 Types of Data: Nominal, Ordinal, Discrete, and Continuous.
Now business runs on data, and most companies use data for their insights to create and launch
campaigns, design strategies, launch products and services or try out different things. According
to a report, today, at least 2.5 quintillion bytes of data are produced per day.
Types of Data
Qualitative or Categorical Data
Qualitative or Categorical Data is data that can’t be measured or counted in the form of numbers.
These types of data are sorted by category, not by number. That’s why it is also known as
Categorical Data. These data consist of audio, images, symbols, or text. The gender of a person,
i.e., male, female, or others, is qualitative data.
Qualitative data tells about the perception of people. This data helps market researchers
understand the customers’ tastes and then design their ideas and strategies accordingly.
Nominal Data
Nominal Data is used to label variables without any order or quantitative value. The color of hair
can be considered nominal data, as one color can’t be compared with another color.
The name “nominal” comes from the Latin name “nomen,” which means “name.” With the help
of nominal data, we can’t do any numerical tasks or can’t give any order to sort the data. These
data don’t have any meaningful order; their values are distributed into distinct categories.
Ordinal data is qualitative data for which their values have some kind of relative position. These
kinds of data can be considered “in-between” qualitative and quantitative data. The ordinal data
only shows the sequences and cannot use for statistical analysis. Compared to nominal data,
ordinal data have some kind of order that is not present in nominal data.
Nominal data can’t be quantified, neither they Ordinal data gives some kind of sequential order by their
have any intrinsic ordering position on the scale
Nominal data is qualitative data or categorical Ordinal data is said to be “in-between” qualitative data and
data quantitative data
Nominal data cannot be used to compare with Ordinal data can help to compare one item with another
one another by ranking or ordering
Examples: Eye color, housing style, gender, hair Examples: Economic status, customer satisfaction,
color, religion, marital status, ethnicity, etc education level, letter grades, etc
Quantitative Data
Quantitative data can be expressed in numerical values, making it countable and including
statistical data analysis. These kinds of data are also known as Numerical data. It answers the
questions like “how much,” “how many,” and “how often.” For example, the price of a phone,
the computer’s ram, the height or weight of a person, etc., falls under quantitative data.
Quantitative data can be used for statistical manipulation. These data can be represented on a
wide variety of graphs and charts, such as bar graphs, histograms, scatter plots, boxplots, pie
charts, line graphs, etc.
Examples of Quantitative Data :
Discrete Data
The term discrete means distinct or separate. The discrete data contain the values that fall under
integers or whole numbers. The total number of students in a class is an example of discrete data.
These data can’t be broken into decimal or fraction values.The discrete data are countable and
have finite values; their subdivision is not possible. These data are represented mainly by a bar
graph, number line, or frequency table.
The key difference between discrete and continuous data is that discrete data contains the integer
or whole number. Still, continuous data stores the fractional numbers to record different types of
data such as temperature, height, width, time, speed, etc.
Height of a person
Speed of a vehicle
“Time-taken” to finish the work
Wi-Fi Frequency
Market share price
Difference between Discrete and Continuous Data
Discrete Data Continuous Data
Discrete data are countable and finite; they Continuous data are measurable; they are in the
are whole numbers or integers form of fractions or decimal
Discrete data are represented mainly by bar Continuous data are represented in the form of a
graphs histogram
The values cannot be divided into The values can be divided into subdivisions into
subdivisions into smaller pieces smaller pieces
Discrete data have spaces between the Continuous data are in the form of a continuous
values sequence
Examples: Total students in a class, number Example: Temperature of room, the weight of a
of days in a week, size of a shoe, etc person, length of an object, etc
Conclusion
In this article, we have discussed the data types and their differences. Working on data is crucial
because we need to figure out what kind of data it is and how to use it to get valuable output out
of it. It is also important to know what kind of plot is suitable for which data category; it helps in
data analysis and visualization. Working with data requires good data science skills and a deep
understanding of different types of data and how to work with them.
Different types of data are used in research, analysis, statistical analysis, data visualization, and
data science. This data helps a company analyze its business, design its strategies, and help build
a successful data-driven decision-making process. If these data-driven topics got you interested
in pursuing professional courses or a career in the field of Data Science. Log on to our website
and explore courses delivered by industry experts.
Forms of Data
Unstructured data
Unstructured Data
Unstructured data is not organized. We must organize the data for analysis purposes.
Structured Data
Example of an array:
[80, 85, 90, 95, 100, 105, 110, 115, 120, 125]
Example
Array = [80, 85, 90, 95, 100, 105, 110, 115, 120, 125]
print(Array)