Adbase Presentation Group 4
Adbase Presentation Group 4
Masea Yvone
A data warehouse is a type of data management system that is designed to enable and support business intelligence (BI) activities, especially
analytics. Data warehouses are solely intended to perform queries and analysis and often contain large amounts of historical data. The data
within a data warehouse is usually derived from a wide range of sources such as application log files and transaction applications.
A data warehouse centralizes and consolidates large amounts of data from multiple sources. Its analytical capabilities allow organizations to
derive valuable business insights from their data to improve decision-making. Over time, it builds a historical record that can be invaluable
to data scientists and business analysts. Because of these capabilities, a data warehouse can be considered an organization’s “single source of
truth.”
Data analysis, or analytics is the process of examining data sets (within the form of text,
audio and video), and drawing conclusions about the information they contain, more
commonly through specific systems, software, and methods. Data analytics technologies
are used on an industrial scale, across commercial business industries, as they enable
organisations to make calculated, informed business decisions
Managing Large Datasets
Traditional tools and infrastructure do not work effectively for large, diverse and quickly
generated data sets. For an organization to be able to use the full potential of such data, it is
important to find a new approach to capturing; storing and analyzing data. Large data
analysis technologies use the power of a distributed network of computer resources and zero-
access architecture. Distributed computing architectures and non-relational (NoSQL) databases
to change the way data is managed and analyzed. Innovative servers and solutions for scalable
analysis in the operating memory allow optimization of computing power. It allows for
scaling, reliability and lower maintenance costs for the majority of demanding analysis tasks.
Methods For Managing Large Datasets
Analysis using operational memory (RAM memory)
Process big data sets in the main memory can significantly affect the performance and speed
of the analysis of large data sets. The processing technology in the operational memory allows
real-time decision-making based on facts.
Processing in the main memory removes one of the basic limitations of many solutions for the
analysis and process of big data sets, such as high delays and I/O bottlenecks caused by access
to data on disk mass memory. Processing in the main memory stores all related data in the
RAM memory of the computer system. Access to data is much faster, thanks to which it is
possible to perform instant analysis. This means that business information is available almost
immediately.
The processing technology in the main memory enables the transfer of entire database or data
warehouses to the RAM memory. As results it allows you for quick analysis of the entire big
data set. Analysis in operational memory integrates analytical applications and databases in
memory on dedicated servers. It is an ideal solution for analytical scenarios with high
computational requirements that are related to real-time data processing. Examples of database
solutions in working memory are SQL Server Analysis Services, Hyper (Tableau new in-
memory data engine).
NoSQL databases
Non-relational databases are in the form of four different types of stores – key-value,
column, graph or document pairs. It provides high performance, high-availability storage
on a high scale. Such databases are useful for handling huge data streams and flexible types
of diagrams and data with a short response time. NoSQL databases use a distributed, fault-
tolerant architecture that ensures system reliability and scalability. An example of NoSQL
databases is Apache HBase, Apache Cassandra, MongoDB, and Azure DocumentDB.
Columnar database
Grid-based databases store data using columns rather than rows. They reduce the number
of read data items during query processing and providing high performance when
performing a large number of concurrent queries. Column-based analysis databases are
read-only environments that offer higher cost-effectiveness and better scalability than
traditional RDBMS systems. They are used for enterprise data stores and other
applications with a large number of queries. In addition, they are optimized for storing
and retrieving data from advanced analysis. Amazon Redshift, Vertica Analytics
Platform, Maria DB are the examples of top column-oriented databases.
Graph databases and analytical tools for processing big data
The graph database is a type of NoSQL database. They are particularly useful for related
data with a large number of relationships or if relationships are more important than
individual objects. The graph data structures are flexible, which facilitates data merging
and modelling. Making queries is faster, and modelling and visualization is more intuitive.
Many big data sets have a graph nature. Graph databases operate independently or in
conjunction with other graph tools, such as graph visualization and analysis applications
or machine learning applications. In the latter case, the graph databases allow analyzing
and predicting relationships to solve many different problems.
Extracting, Transformation and Loading (ETL)
Extract, Transform, Load (ETL) operations aggregate, pre-process and saves data.
However, traditional ETL solutions can not handle the volume, speed, and diversity
of big data sets. The Hadoop platform stores and processes big data in a distributed
environment, thanks to which it is possible to divide incoming data streams into
fragments for the purpose of parallel processing of large data sets. The built-in
scalability of Hadoop architecture allows you to speed up ETL tasks, significantly
reducing the time of analysis.
Extracting, Transformation and Loading (ETL)
Extract, Transform, Load (ETL) operations aggregate, pre-process and saves data.
However, traditional ETL solutions can not handle the volume, speed, and diversity of
big data sets. The Hadoop platform stores and processes big data in a distributed
environment, thanks to which it is possible to divide incoming data streams into
fragments for the purpose of parallel processing of large data sets. The built-in
scalability of Hadoop architecture allows you to speed up ETL tasks, significantly
reducing the time of analysis.
Big data analysis
Big data is characterised by the three V’s: the major volume of data, the velocity at which it’s
processed, and the wide variety of data. It’s because of the second descriptor, velocity, that
data analytics has expanded into the technological fields of machine learning and artificial
intelligence. Alongside the evolving computer-based analysis techniques data harnesses,
analysis also relies on the traditional statistical methods. Ultimately, how data analysis
techniques function within an organisation is twofold; big data analysis is processed through
the streaming of data as it emerges, and then performing batch analysis’ of data as it builds – to
look for behavioural patterns and trends. As the generation of data increases, so will the
various techniques that manage it. As data becomes more insightful in its speed, scale, and
depth, the more it fuels innovation
Big Data Analysis Techniques
1. A/B testing
This involves comparing a control group with a variety of test groups, in order to discern what
treatments or changes will improve a given objective variable. McKinsey gives the example of
analysing what copy, text, images, or layout will improve conversion rates on an e-commerce
site. Big data once again fits into this model as it can test huge numbers, however, it can only be
achieved if the groups are of a big enough size to gain meaningful differences.
2. Data fusion and data integration
By combining a set of techniques that analyse and integrate data from multiple sources and
solutions, the insights are more efficient and potentially more accurate than if developed through
a single source of data.
Big Data Analysis Techniques
3. Data mining
A common tool used within big data analytics, data mining extracts patterns from large data sets
by combining methods from statistics and machine learning, within database management. An
example would be when customer data is mined to determine which segments are most likely to
react to an offer.
4. Machine learning
Well known within the field of artificial intelligence, machine learning is also used for data
analysis. Emerging from computer science, it works with computer algorithms to produce
assumptions based on data. It provides predictions that would be impossible for human analysts.
5. Natural language processing (NLP).
Known as a subspecialty of computer science, artificial intelligence, and linguistics, this data
analysis tool uses algorithms to analyse human (natural) language.
6. Statistics.
This technique works to collect, organise, and interpret data, within surveys and experiments.
Statistical methods include:
Big Data Analysis Techniques
The primary goal of Big Data applications is to help companies make more informative
business decisions by analyzing large volumes of data. It could include web server logs,
Internet click stream data, social media content and activity reports, text from customer
emails, mobile phone call details and machine data captured by multiple sensors.
Organisations from different domain are investing in Big Data applications, for examining
large data sets to uncover all hidden patterns, unknown correlations, market trends, customer
preferences and other useful business information
Big Data Applications: Healthcare
Big data analytics have improved healthcare by providing personalized medicine and
prescriptive analytics. Researchers are mining the data to see what treatments are more
effective for particular conditions, identify patterns related to drug side effects, and gains other
important information that can help patients and reduce costs.
With the added adoption of mHealth, eHealth and wearable technologies the volume of data is
increasing at an exponential rate. This includes electronic health record data, imaging data,
patient generated data, sensor data, and other forms of data.
By mapping healthcare data with geographical data sets, it’s possible to predict disease that
will escalate in specific areas. Based of predictions, it’s easier to strategize diagnostics and
plan for stocking serums and vaccines.
Big Data Applications: Manufacturing
What’s going on in a customer’s call center is often a great barometer and influencer of market
sentiment, but without a Big Data solution, much of the insight that a call center can provide
will be overlooked or discovered too late. Big Data solutions can help identify recurring
problems or customer and staff behaviour patterns on the fly not only by making sense of
time/quality resolution metrics but also by capturing and processing call content itself.
Big Data Applications: Call Center Analytics
Marketers have begun to use facial recognition software to learn how well their advertising
succeeds or fails at stimulating interest in their products. A recent study published in the
Harvard Business Review looked at what kinds of advertisements compelled viewers to
continue watching and what turned viewers off. Among their tools was “a system that analyses
facial expressions to reveal what viewers are feeling.” The research was designed to discover
what kinds of promotions induced watchers to share the ads with their social network, helping
marketers create ads most likely to “go viral” and improve sales.
Data Product Development
Data products take on all different shapes and forms. However, they all have similar
goals.
These goals revolve around providing insights, better services, and helping
companies improve their operations. Making a great data product means having a
clear set of questions you want your end-users to be able to answer with your
product or clear goals that help provide information and better services.
This was the first of several articles we will be discussing developing data products
in.
This is part of our current push to develop a guide to help companies of all sizes
improve their data strategy.
Step 1: Conceptualizing the Product
This introductory step needs to take place before data acquisition. It requires
conceptualizing the information product, along with identifying the required data
resources. The process involves product definition, data investigation (which should
include sourcing data creatively), and establishing the framework necessary to produce a
prototype.
Creating successful data products like dashboards, data APIs and algorithms first requires
that you have a clear business problem you are trying to solve.
Step 2: Data Acquisition
Today, much data refining is achieved with automated tools. Real-time machine learning
and algorithmic processing of data elements can categorize, correlate, personalize, profile,
and search data quickly to create meaningful models that have significant value for
consumers.
Defining your final product, what it will do, and generally how it will look is a necessary
step before setting up a plan.
What will it generally look like?
How will users interact with it?
Will it be an API, a dashboard, a model, or a report?
Step 4: Storage and Retrieval
Storage and retrieval are as important as ever. However, retrieval in today’s environment
must incorporate advancements in query and search processing capabilities (for instance,
making use of algorithms) that can access more granular levels of data. Traditional storage
techniques need to be augmented by new technologies such as map reduction (a software
framework for distributed processing of large data sets on computer clusters of commodity
hardware) and parallel processing capabilities to manage larger and faster-moving data
sources. Many organizations store data in relatively unstructured formats when they
initially capture it, refining it over time. Data storage, retrieval, and processing are
increasingly taking place in the cloud rather than on a company’s premises. This not only
provides companies with flexibility in their technology infrastructures but also can make it
easier for them to combine internal and external data.
Step 5: Distribution
The distribution options for information products have shifted dramatically from the earlier
menu of possibilities, some of which (such as fax and CD-ROM) have been superseded by
the Web. Timing and frequency remain critical aspects of distribution; data products must be
continuously available and updated in near-real time. In the digital economy, online media
(such as websites and portals) fully address the required level of continuous accessibility to
information products. However, Web access via traditional computers is quickly being
overtaken by mobile access via smartphones, tablets, and apps. As a result, providers of
information products that are distributed via mobile devices need to revamp their content
formats and design.
At the same time, distributing data products through the cloud adds a new dimension to the
question of how frequently information needs to be updated for users. Consider, for example,
a business-to-business case involving a shipping service provider that offers information
products including en-route metrics, such as estimates of time to delivery. Assuming the data
is available, the frequency and timeliness of such information — generated through GPS
traffic information, location data, and analytics — can be close to real time.
Step 6: Presentation
Information products gained value from the context of their use. The user interface
mattered — and the easier products were to use, the more valuable they were. Although the
digital economy places heavier emphasis on analytics than on simple data provision, there
are some important constants. While standard reporting (that is, simple information
products) continues to meet the needs of many consumers, more advanced analytics-based
products such as forecasts, predictions, and probabilities (such as real-time calculations
generated through machine learning) can lead to differentiation and competitive advantage.
Data Product Life Cycle
Data Product Life Cycle
Experiment: Which use case are we trying to solve? Which data do we have available to
solve it? I run a few notebooks, or execute a few ad-hoc queries, maybe visualise a few
things in a BI tool to understand better what I’m looking for
Implement: Once I know what I’m looking for, I need to implement this. That means that
I want to run it regularly in batch or real-time. Mature data products are built using
software engineering best practices, with version control, tests, modular code, and all that
good stuff.
Deploy: Ok, it works on my machine. Ship it. Data products only add value once they run
in production. So we need to version our code, build artifacts and actually do deployments
to development, acceptance or production accounts.
Monitor: Once it’s deployed, you need to monitor the performance of the data product,
both from a technical and business perspective.
Managing and analyzing large datasets, big data applications, and product development is currently a
necessity. In the world of technology various ways evolved and are applied to manage and analyze large
datasets. Some of the methods include ,automation, visualization, indexing ,compression, version control,
meta data recording to mention but a few. It is the purpose of this paper to unearth the existing feasible
methods of analyzing large datasets, big data applications, and product development.
As the name suggests, big data represents large amounts of data that is unmanageable using traditional
software or internet-based platforms. It surpasses the traditionally used amount of storage, processing and
analytical power. Douglas Laney observed that big data was growing in three different dimensions namely,
volume, velocity and variety, known as the 3 Vs (Laney D, 2001). The ‘big’ part of big data is indicative of
its large volume. In addition to volume, the big data description also includes velocity and variety. Velocity
indicates the speed or rate of data collection and making it accessible for further analysis; while, variety
remarks on the different types of organized and unorganized data that any firm or system can collect, such
as transaction-level data, video, audio, text or log files. These three Vs have become the standard
definition of big data. Although, other people have added several other Vs to this definition (Mauro AD,
Greco M, Grimaldi M, 2016), the most accepted 4th V remains veracity.
In respect of the above scenario , multifaceted methods have been developed in an attempt to manage and analyze each attribute of large
dataset. Among the methods automations is regarded a key to large datasets management. Shoaib Mufti postulated that big data sets are
too large to comb through manually, so automation is key. Allen Institute for Brain Science in Seattle, Washington uses a template for
brain-cell and genetics data that accepts information only in the correct format and type. When it’s time to integrate those data into a
larger database or collection, data-quality assurance steps are automated using Apache Spark and Apache Hbase, two open-source tools,
to validate and repair data in real time. The set of software tools are ab le to validate and ingest data that runs in the cloud, which
allows the responsible end users to easily scale data. At the same institute the Open Connectome Project also provides automated
quality assurance, this generates visualizations of summary statistics that users can inspect before moving forward with their analyses.
This above scenario depicts that automation is key in reducing information overload, and it increase speed in real time decision making.
Researchers can use version-control systems to see how a file has changed over time and who made the modifications. However, some
systems restrict the file sizes one can use. According to Alyssa Goodman, an astrophysicist and data-visualization specialist at Harvard
University, Harvard Dataverse (which is open to all scholars) and Zenodo can be used for version control of big files. Another option
is Dat, a free peer-to-peer network for sharing and versioning files of any size. According to Andrew Osheroff, a primary software
developer at Dat in Copenhagen, the system keeps a tamper-proof history of all the operations you conduct on your file. According to
Dat product manager Karissa McKelvey, users can command the system to archive a duplicate of each version of a file. End users can
direct the system to archive a copy of each version of a file. Dat is currently a command-line utility, the team that developed this system
hopes to release a more user-friendly one.
such documents on or near the data servers to do remote analyses and explore the data. Jupyter Notebook is not
particularly accessible to researchers who might be uncomfortable using a command line, there are more user-
friendly platforms that can bridge the gap, including Terra and Seven Bridges Genomics.
Big data applications can help companies to make better business decisions by analyzing large volumes of data
and discovering hidden patterns. These data sets might be from social media, data captured by sensors, website
logs, customer feedbacks. Organizations are spending huge amounts on big data applications to discover hidden
patterns, unknown associations, market style, consumer preferences, and other valuable business information .
There is a significant improvement in the healthcare domain by personalized medicine and prescriptive
analytics due to the role of big data systems. Researchers analyze the data to determine the best treatment for a
particular disease, side effects of the drugs, forecasting the health risks, etc. Mobile applications on health and
wearable devices are causing available data to grow at an exponential rate Doyle-Lindrud S (2015).
It is possible to predict a disease outbreak by mapping healthcare data and
geographical data. Once predicted, containment of the outbreak can be handled
and plans to eradicate the disease made. Among other benefits of EHRs is that
healthcare professionals have an improved access to the entire medical history of a
patient. The information includes medical diagnoses, prescriptions, data related to
known allergies, demographics, clinical narratives, and the results obtained from
various laboratory tests. The recognition and treatment of medical conditions thus
is time efficient due to a reduction in the lag time of previous test results. The
diagram shows how big data can be stored, utilized, and analyzed. Data
warehouses store massive amounts of data generated from various sources. This
data is processed using analytic pipelines to obtain smarter and affordable
healthcare options
The media and entertainment industries are creating, advertising, and distributing their content using new business
models. This is due to customer requirements to view digital content from any location and at any time. The introduction
of online TV shows, Netflix channels, etc. is proving that new customers are not only interested in watching TV but are
interested in accessing data from any location. The media houses are targeting audiences by predicting what they would
like to see, how to target the ads, content monetization, etc. Big data systems are thus increasing the revenues of such
media houses by analyzing viewer patterns.
IoT devices generate continuous data and send them to a server on a daily basis. These data are mined to provide the
interconnectivity of devices. This mapping can be put to good use by government agencies and also a range of
companies to increase their competence. IoT is finding applications in smart irrigation systems, traffic management,
crowd management, etc.
Predictive manufacturing can help to increase efficiency by producing more goods by minimizing the
downtime of machines. This involves a massive quantity of data for such industries. Sophisticated forecasting
tools follow an organized process to explore valuable information for these data. The following are the some of
the major advantages of employing big data applications in manufacturing industries: high product quality,
tracking faults, supply planning, predicting the output, increasing energy efficiency, testing and simulation of
new manufacturing processes, and large-scale customization of manufacturing.
By adopting big data systems, the government can attain efficiencies in terms of cost, output, and novelty. Since
the same data set is used in many applications, many departments can work in association with each other.
Government plays an important role in innovation by acting in all these domains.
Big data applications can be applied in each and every field. Some of the major areas where big data finds
applications include: agriculture, aviation, cyber security and intelligence, crime prediction and prevention, e-
commerce, fake news detection, fraud detection, pharmaceutical drug evaluation, scientific research, weather
forecasting, and tax compliance.
The best logical approach for analyzing huge volumes of complex big data is to distribute and process it in
parallel on multiple nodes. However, the size of data is usually so large that thousands of computing machines are
required to distribute and finish processing in a reasonable amount of time. When working with hundreds or
thousands of nodes, one has to handle issues like how to parallelize the computation, distribute the data, and
handle failures. One of most popular open-source distributed application for this purpose is Hadoop Shvachko K,
et al(2010). Hadoop implements MapReduce algorithm for processing and generating large datasets.
MapReduce uses map and reduce primitives to map each logical record’ in the input into a set
of intermediate key/value pairs, and reduce operation combines all the values that shared the
same key Dean J, Ghemawat S. MapReduce(2008). It efficiently parallelizes the computation,
handles failures, and schedules inter-machine communication across large-scale clusters of
machines. Hadoop Distributed File System (HDFS) is the file system component that provides
a scalable, efficient, and replica based storage of data at various nodes that form a part of a
cluster Shvachko K, et al (2010). Hadoop has other tools that enhance the storage and
processing components therefore many large companies like Yahoo, Facebook, and others
have rapidly adopted it. Hadoop has enabled researchers to use data sets otherwise impossible
to handle. Many large projects, like the determination of a correlation between the air quality
data and asthma admissions, drug development using genomic and proteomic data, and other
such aspects of healthcare are implementing Hadoop. Therefore, with the implementation of
Hadoop system, the healthcare analytics will not be held back.
Apache Spark is another open source alternative to Hadoop. It is a unified engine for distributed data
processing that includes higher-level libraries for supporting SQL queries (Spark SQL), streaming
data (Spark Streaming), machine learning (MLlib) and graph processing (GraphX) Zaharia M, et al
(2016). These libraries help in increasing developer productivity because the programming interface
requires lesser coding efforts and can be seamlessly combined to create more types of complex
computations. By implementing Resilient distributed Datasets (RDDs), in-memory processing of
data is supported that can make Spark about 100× faster than Hadoop in multi-pass analytics (on
smaller datasets) Gopalani S, Arora R.(2015). This is more true when the data size is smaller than
the available memory Saouabi M, Ezzati A(2017).. This indicates that processing of really big data
with Apache Spark would require a large amount of memory. Since, the cost of memory is higher
than the hard drive, MapReduce is expected to be more cost effective for large datasets compared to
Apache Spark. Similarly, Apache Storm was developed to provide a real-time framework for data
stream processing. This platform supports most of the programming languages. Additionally, it
offers good horizontal scalability and built-in-fault-tolerance capability for big data analysis.
In order to tackle big data challenges and perform smoother analytics, various companies
have implemented AI to analyze published results, textual data, and image data to obtain
meaningful outcomes. IBM Corporation is one of the biggest and experienced players in
this sector to provide healthcare analytics services commercially. IBM’s Watson Health is
an AI platform to share and analyze health data among hospitals, providers and
researchers. Similarly, Flatiron Health provides technology-oriented services in healthcare
analytics specially focused in cancer research. Other big companies such as Oracle
Corporation and Google Inc. are also focusing to develop cloud-based storage and
distributed computing power platforms. Interestingly, in the recent few years, several
companies and start-ups have also emerged to provide health care-based analytics and
solutions.
Quantum computing is picking up and seems to be a potential solution for big data analysis. For
example, identification of rare events, such as the production of Higgs bosons at the Large Hadron
Collider (LHC) can now be performed using quantum approaches Mott A, et al (2017). At LHC, huge
amounts of collision data (1PB/s) is generated that needs to be filtered and analyzed. One such
approach, the quantum annealing for ML (QAML) that implements a combination of ML and quantum
computing with a programmable quantum annealer, helps reduce human intervention and increase the
accuracy of assessing particle-collision data. In another example, the quantum support vector machine
was implemented for both training and classification stages to classify new data Rebentrost P, Mohseni
M, Lloyd S (2014). Such quantum approaches could find applications in many areas of science Mott A,
et al (2017). Indeed, recurrent quantum neural network (RQNN) was implemented to increase signal
separability in electroencephalogram (EEG) signals Gandhi V, et al (2014). Similarly, quantum
annealing was applied to intensity modulated radiotherapy (IMRT) beamlet intensity optimization
Nazareth DP, Spaans JD (2015). Similarly, there exist more applications of quantum approaches
regarding healthcare e.g. quantum sensors and quantum microscopes Reardon S (2017).
Step 1: Conceptualizing the Product
Before jumping in, organizations must identify an information product that meets a need from the marketplace. This
introductory step needs to take place before data acquisition. It requires conceptualizing the information product, along with
identifying the required data resources. The process involves product definition, data investigation (which should include
sourcing data creatively, and establishing the framework necessary to produce a prototype. Once this set of requirements is
met, the remaining steps of the development process can be carried out more efficiently. For example, once managers know
the data elements that will go into the product, storage and retrieval can be streamlined.
An interesting example is CarMD.com Corp., an Irvine, California-based company that provides services that leverage
automotive diagnostic information. The original idea was to provide diagnostic capabilities that led consumers to auto repair
estimates and potential service providers. One of the company’s products compares the data extracted from onboard
computers in cars against online auto repair databases and offers consumers information on auto maintenance.
Step 2: Data Acquisition
Once the conceptual model has been worked out, data acquisition can be pursued in a more efficient
manner. Organizations tend to acquire or accumulate data that corresponds to their functional
activities. However, given the vast amounts of data being generated by information devices and the
data available from public sources, the acquisition process needs to connect the requirements of the
conceptual model to data that will create the product. In addition to acquiring structured data (for
example, customer purchase records), companies should also consider using unstructured sources
(for instance, social media comments) that might be able to add value. Companies should be
prepared to look within and outside their own systems for such data.
Step 3: Refinement
Although Meyer and Zack’s data refinement process remains quite relevant, it has to be augmented to facilitate new data
sources and to take advantage of advanced analytic methods. The original model talked about the importance of being
able to “glean further meaning from combinations of individual [data] elements.” Today, much data refining is achieved
with automated tools. Real-time machine learning and algorithmic processing of data elements can categorize, correlate,
personalize, profile, and search data quickly to create meaningful models that have significant value for consumers.
For example, Passur Aerospace Inc., based in Stamford, Connecticut, uses both its own data and public data to develop
scheduling information for airlines and travelers. Drawing on publicly available data on weather, flight schedules, and
other factors, along with its own internal data based on radar statistic feeds, it generates flight arrival estimates. Applying
advanced analytics, Passur’s arrival estimates outperform ones based on traditional techniques.
Step 4: Storage and Retrieval
Storage and retrieval are as important as ever. However, retrieval in today’s environment must incorporate
advancements in query and search processing capabilities (for instance, making use of algorithms) that can access
more granular levels of data. Traditional storage techniques need to be augmented by new technologies such as
map reduction (a software framework for distributed processing of large data sets on computer clusters of
commodity hardware) and parallel processing capabilities to manage larger and faster-moving data sources. Many
organizations store data in relatively unstructured formats when they initially capture it, refining it over time.
Data storage, retrieval, and processing are increasingly taking place in the cloud rather than on a company’s
premises. This not only provides companies with flexibility in their technology infrastructures but also can make
it easier for them to combine internal and external data.
Step 5: Distribution
The distribution options for information products have shifted dramatically from the earlier menu of possibilities, some of which (such
as fax and CD-ROM) have been superseded by the Web. Timing and frequency remain critical aspects of distribution; data products
must be continuously available and updated in near-real time. In the digital economy, online media (such as websites and portals) fully
address the required level of continuous accessibility to information products. However, Web access via traditional computers is quickly
being overtaken by mobile access via smartphones, tablets, and apps. As a result, providers of information products that are distributed
via mobile devices need to revamp their content formats and design.
At the same time, distributing data products through the cloud adds a new dimension to the question of how frequently information
needs to be updated for users. Consider, for example, a business-to-business case involving a shipping service provider that offers
information products including en-route metrics, such as estimates of time to delivery. Assuming the data is available, the frequency and
timeliness of such information — generated through GPS traffic information, location data, and analytics — can be close to real time.
Step 6: Presentation
In the original Meyer-Zack model, information products gained value from the context of their
use. The user interface mattered — and the easier products were to use, the more valuable they
were. Although the digital economy places heavier emphasis on analytics than on simple data
provision, there are some important constants. While standard reporting (that is, simple
information products) continues to meet the needs of many consumers, more advanced analytics-
based products such as forecasts, predictions, and probabilities (such as real-time calculations
generated through machine learning) can lead to differentiation and competitive advantage.
Step 7: Market Feedback
The competitive nature of the information product space, availability of new data sources, and demand for timely
decision support require an ongoing emphasis on innovation and on monitoring product usage. Adding this step at this
stage of the analytics-based data product development process is consistent with the iterative nature of product
development in a “lean startup” context. Once again, the evolution of new technologies has provided a mechanism for
facilitating a feedback and information extraction process from the marketplace. New forms of market research are
capable of leveraging social media platforms (for example, business Facebook pages) to listen to the marketplace.
Interactive blogs and flash surveys can be utilized to assess customer perceptions of existing information products. New
features of online information products can be tested in a matter of hours with A/B or multivariate online testing
approaches. Both user correspondence and digital metrics on product use (for instance, views, clicks, downloads, and
bounces) can be analyzed to enhance products continuously.
A Structured Approach to Stakeholder Involvement
In order to achieve effective results from the implementation of the product development model, stakeholder
involvement is essential. Having particular types of input at different stages of the product development process is
important. Therefore, companies need to develop some degree of structure for stakeholder input.
During the stage when the product is being conceptualized, it’s important to have involvement from three specific
groups: subject matter experts at the business level (who can help determine the feasibility of the product design);
managers of existing and complementary information products (who can help companies avoid cannibalization and
duplication); and marketing people (who can help assess the nature and scale of consumer demand). These
individuals can assist in providing the framework for designing or upgrading existing products to add value to meet
market needs.
For the data acquisition and the storage and refinement stages, stakeholder involvement should expand to
include legal representatives, who can speak to data ownership, privacy, and use issues; IT personnel, who can
provide input on hardware and software requirements for data products and also help in developing and
improving the functionality of the product; and data managers and analytics and data scientists to assist in
product platform execution. It is critical to involve analytics and data science professionals to help in
structuring and analyzing data.
For the distribution and presentation stages, the stakeholders should again include marketing people (who can
help sort out consumer/user needs for the initial product launch and subsequent product releases) and IT
personnel (who can deal with hardware and software issues in product functionality during the product rollout).
Conclusion
Tool selection is a necessary first step. Often, the choice of tools is decided well in advance of the
specific project of interest. Organizations make the decision to use SAS, SPSS, R, or even Excel for all
their data analysis needs. Since specific applications of those tools are not all known in advance, the
choice is made for one-size-fits-all needs. If given the choice or flexibility to choose other tools, think
carefully about the capabilities needed. If running R, is there concern about running out of memory given
the size of data? Can these concerns be addressed with a better server, more memory, or other tools such
as Hadoop? Even tools such as SAS come with practical concerns that should be carefully considered.
Implementation of artificial intelligence (AI) algorithms and novel fusion algorithms
would be necessary to make sense from this large amount of data. Indeed, it would be a
great feat to achieve automated decision-making by the implementation of machine
learning (ML) methods like neural networks and other AI techniques. However, in absence
of appropriate software and hardware support, big data can be quite hazy. We need to
develop better techniques to handle this ‘endless sea’ of data and smart web applications
for efficient analysis to gain workable insights. With proper storage and analytical tools in
hand, the information and insights derived from big data can make the critical social
infrastructure components and services (like healthcare, safety or transportation) more
aware, interactive and efficient. In addition, visualization of big data in a user-friendly
manner will be a critical factor for societal development.
References
Laney D. 3D data management: controlling data volume, velocity, and variety, Application delivery strategies.
Stamford: META Group Inc; 2001
Mauro AD, Greco M, Grimaldi M. A formal definition of big data based on its essential features. Libr Rev.
2016;65(3):122–35.
Rebentrost P, Mohseni M, Lloyd S. Quantum support vector machine for big data classification. Phys Rev Lett.
2014;113(13):130503.
SparkSeq is an efficient and cloud-ready platform based on Apache Spark framework and
Hadoop library that is used for analyses of genomic data for interactive genomic data analysis
with nucleotide precision
SAMQA identifies errors and ensures the quality of large-scale genomic data. This tool was
originally built for the National Institutes of Health Cancer Genome Atlas project to identify
and report errors including sequence alignment/map [SAM] format error and empty reads.
ART can simulate profiles of read errors and read lengths for data obtained using high 4.
DistMap is another toolkit used for distributed short-read mapping based on Hadoop cluster that aims to cover a
wider range of sequencing applications. For instance, one of its applications namely the BWA mapper can perform
500 million read pairs in about 6 h, approximately 13 times faster than a conventional single-node mapper.
SeqWare is a query engine based on Apache HBase database system that enables access for large-scale whole-
genome datasets by integrating genome browsers and tools.
CloudBurst is a parallel computing model utilized in genome mapping experiments to improve the scalability of
reading large sequencing data.
Hydra uses the Hadoop-distributed computing framework for processing large peptide
and spectra databases for proteomics datasets. This specific tool is capable of performing
27 billion peptide scorings in less than 60 min on a Hadoop cluster.