Unit II
Unit II
Unit II
Big data refers to extremely large and diverse collections of structured, unstructured,
and semi-structured data that continues to grow exponentially over time. These
datasets are so huge and complex in volume, velocity, and variety, that traditional
data management systems cannot store, process, and analyze them.
Data can be gathered from two places: internal and external sources. The
information collected from internal sources is called “primary data,” while the
information gathered from outside references is called “secondary data.”
The data sampling method uses both kinds of statistical data. Usually, a sample
survey is used to do a statistical survey. In this method, sample data is collected
and then analyzed using statistical analysis plan and techniques. The surveys can
also be done using the questionnaire method.
Census is done in the country for official purposes. The respondents are asked
questions, which they answer. This interaction can take place in person or over the
phone. However, the census is a source of data that takes a lot of time and effort
because it involves the whole population.
Some of the different internal data are accounting resources, sales force reports,
internal experts, and miscellaneous reports. Practical business intelligence relies on
the synergy between analytics and reporting, where analytics uncovers valuable
insights, and reporting communicates these findings to stakeholders.
The data from external origins is harder to gather because it is much more varied,
and there can be many of them. There are different groups into which external data
can be put. They are given below:
Government publications
Researchers can get a massive amount of information from government sources.
Also, you can get much of this information for free on the Internet.
Non-government publications
Researchers can also find industry-related information in non-government
publications. The only research problem with non-government publications is that
their data may sometimes be biased.
Syndicate services
Some companies offer Syndicate services. As part of this, they collect and organize
the same marketing information for all their clients. Surveys, mail diary panels,
electronic services, wholesalers, industrial firms, retailers, etc., are ways they get
information from households.
Lower mailing costs: Accurate customer data reduces the amount of undeliverable
mail. Less undeliverable mail saves you money in postage cost—no more resending
packages that didn’t make it to their destinations. You may even qualify for discounts
on postage rates from the U.S. Postal Service if you consistently use accurate and
correctly formatted addresses.
Improved customer relations: Reliable data lets you get to know your customers. This
keeps you from sending messages they don’t want and helps you spot their needs in
advance. By meeting—and exceeding—customer expectations, you create goodwill
and a strong relationship with your brand.
More consistent data: Organizations with several points of entry for their customers
often face the problem of inconsistent data across the organization. Inconsistent data
leads to duplicate records, missed opportunities for messaging, and departmental data
silos. Part of managing data quality is making data available across your organization
to open communication between teams and keep records consistent.
Smart devices
Social media
But raw data can be hard to comprehend and use. Hence, data scientists
prepare and present data in the right context. They give it a visual form so
that decision-makers can identify the relationships between data and detect
hidden patterns or trends. Data visualization creates stories that advance
business intelligence and support data-driven decision-making and strategic
planning.
Types of Data
Structured Data
Structured data is a type of data that is organized and easily managed using traditional
data management tools such as spreadsheets, databases, or tables. Structured data is
typically quantitative and numeric in nature, meaning that it consists of numbers,
percentages, and other numerical values. Because of its organized nature, structured
data is relatively easy to analyze using statistical methods such as regression analysis
or correlation analysis.
Unstructured Data
Unstructured data is data that does not have a predefined format or organization,
making it difficult to manage using traditional data management tools. Examples of
unstructured data include social media posts, emails, images, and videos. Because of
its unstructured nature, unstructured data is typically qualitative in nature, meaning
that it is descriptive and narrative in nature. Analyzing unstructured data requires the
use of advanced analytics techniques such as natural language processing (NLP) or
sentiment analysis.
Semi-Structured Data
Semi-structured data is a type of data that has elements of both structured and
unstructured data. This type of data includes information that is partially organized,
but not to the extent that it can be classified as structured data. Examples of semi-
structured data include XML and JSON files, which have some organization but also
contain elements of unstructured data. Analyzing semi-structured data typically
requires a combination of traditional data management tools and advanced analytics
techniques.
Big Data
Big Data is a term used to describe large and complex data sets that cannot be
processed using traditional data management tools. Big Data includes a variety of data
types, including structured, unstructured, and semi-structured data. The main
challenge of analyzing Big Data is its volume, as the amount of data is too large to be
analyzed manually. Analyzing Big Data requires the use of specialized tools and
techniques such as Hadoop or Spark.