File 1
File 1
File 1
Big data analytics describes the process of uncovering trends, patterns, and correlations in
large amounts of raw data to help make data-informed decisions. These processes use familiar
statistical analysis techniques—like clustering and regression—and apply them to more extensive
datasets with the help of newer tools.
1. Collect Data
Data collection looks different for every organization. With today’s technology, organizations
can gather both structured and unstructured data from a variety of sources — from cloud storage to
mobile applications to in-store IoT sensors and beyond. Some data will be stored in data
warehouses where business intelligence tools and solutions can access it easily. Raw or unstructured
data that is too diverse or complex for a warehouse may be assigned metadata and stored in a data
lake.
2. Process Data
Once data is collected and stored, it must be organized properly to get accurate results on
analytical queries, especially when it’s large and unstructured. Available data is growing
exponentially, making data processing a challenge for organizations. One processing option is batch
processing, which looks at large data blocks over time. Batch processing is useful when there is a
longer turnaround time between collecting and analyzing data. Stream processing looks at small
batches of data at once, shortening the delay time between collection and analysis for quicker
decision-making. Stream processing is more complex and often more expensive.
3. Clean Data
Data big or small requires scrubbing to improve data quality and get stronger results; all data
must be formatted correctly, and any duplicative or irrelevant data must be eliminated or accounted
for. Dirty data can obscure and mislead, creating flawed insights.
4. Analyze Data
Getting big data into a usable state takes time. Once it’s ready, advanced analytics processes can
turn big data into big insights. Some of these big data analysis methods include:
Data mining sorts through large datasets to identify patterns and relationships by identifying
anomalies and creating data clusters.
Predictive analytics uses an organization’s historical data to make predictions about the
future, identifying upcoming risks and opportunities.
Deep learning imitates human learning patterns by using artificial intelligence and machine
learning to layer algorithms and find patterns in the most complex and abstract data.
3. Big data analytics tools and technology
Big data analytics cannot be narrowed down to a single tool or technology. Instead, several types
of tools work together to help you collect, process, cleanse, and analyze big data. Some of the major
players in big data ecosystems are listed below.
Hadoop is an open-source framework that efficiently stores and processes big datasets on
clusters of commodity hardware. This framework is free and can handle large amounts of
structured and unstructured data, making it a valuable mainstay for any big data operation.
NoSQL databases are non-relational data management systems that do not require a fixed
scheme, making them a great option for big, raw, unstructured data. NoSQL stands for “not
only SQL,” and these databases can handle a variety of data models.
MapReduce is an essential component to the Hadoop framework serving two functions. The
first is mapping, which filters data to various nodes within the cluster. The second is reducing,
which organizes and reduces the results from each node to answer a query.
YARN stands for “Yet Another Resource Negotiator.” It is another component of second-
generation Hadoop. The cluster management technology helps with job scheduling and
resource management in the cluster.
Spark is an open source cluster computing framework that uses implicit data parallelism and
fault tolerance to provide an interface for programming entire clusters. Spark can handle both
batch and stream processing for fast computation.
Tableau is an end-to-end data analytics platform that allows you to prep, analyze,
collaborate, and share your big data insights. Tableau excels in self-service visual analysis,
allowing people to ask new questions of governed big data and easily share those insights
across the organization.
Social media: Social media has a component of semi-structured data (e.g., data that does not
conform to a data model but has some structure) but the content of each social media message
itself is unstructured.
Email: While we sometimes consider this semi-structured, email message fields are text fields
that are not easily analyzed. Email content may include video, audio, or photo content as well,
making them unstructured.
Text files: Almost all traditional business files — including word processing documents (e.g.,
Google Docs or Microsoft Word), presentations (e.g., Microsoft PowerPoint), notes, and
PDFs — are classified as unstructured data.
Survey responses: When open-ended feedback is gathered via survey (e.g., text box) or
through respondents selecting "liked" photos, unstructured data is being gathered.
Scientific data: Scientific data can include field surveys, space exploration, seismic imagery,
atmospheric data, topographic data, weather data, and medical data. While these types of data
may have a base structure for collection, the data itself is often unstructured and may not lend
itself to traditional analysis tools and dashboards.
Machine and sensor data: Billions of small files from IoT (Internet of Things) devices, such
as mobile phones and iPads, generate significant amounts of unstructured data. In addition,
business systems’ log files, which are not consistent in structure, also create vast amounts of
unstructured data.