0% found this document useful (0 votes)

142 views60 pages

Adbase Presentation Group 4

1. Managing and analyzing large datasets requires new approaches like distributed computing architectures, non-relational databases, and analysis in operational memory. This allows for scaling, reliability, and lower maintenance costs. 2. Methods for managing large datasets include using operational memory for real-time analysis, NoSQL databases for flexibility and scalability, and columnar databases for high performance querying. 3. Extracting, transforming, and loading data in Hadoop allows for parallel processing and scalability to speed up analysis of big data.

Uploaded by

brittain markale

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

142 views60 pages

Adbase Presentation Group 4

Uploaded by

brittain markale

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 60

Name Registration Number

Makhosonke Ndlovu R104644A

Matanda Tigerenashe Theophilus R1814472C

Mangena Israel R208957Z

Tapiwa Maunganidze R174830V

Lovemore Chasowa R211772W

Gwaringa Batsirai R212483C

Masea Yvone

Data Warehousing ,Science and Analysis.

Qsn: Methods for managing and analyzing large datasets, big data applications, data
product development.
Data Warehouse Defined

 A data warehouse is a type of data management system that is designed to enable and support business intelligence (BI) activities, especially
analytics. Data warehouses are solely intended to perform queries and analysis and often contain large amounts of historical data. The data
within a data warehouse is usually derived from a wide range of sources such as application log files and transaction applications.
 A data warehouse centralizes and consolidates large amounts of data from multiple sources. Its analytical capabilities allow organizations to
derive valuable business insights from their data to improve decision-making. Over time, it builds a historical record that can be invaluable
to data scientists and business analysts. Because of these capabilities, a data warehouse can be considered an organization’s “single source of
truth.”

A typical data warehouse often includes the following elements:

 A relational database to store and manage data
 An extraction, loading, and transformation (ELT) solution for preparing the data for analysis
 Statistical analysis, reporting, and data mining capabilities
 Client analysis tools for visualizing and presenting data to business users
 Other, more sophisticated analytical applications that generate actionable information by applying data science and artificial intelligence (AI)
algorithms, or graph and spatial features that enable more kinds of analysis of data at scale
Data Science Defined
 Data science is a multidisciplinary approach to extracting actionable insights from the large and ever-
increasing volumes of data collected and created by today’s organizations. Data science encompasses
preparing data for analysis and processing, performing advanced data analysis, and presenting the
results to reveal patterns and enable stakeholders to draw informed conclusions.
 Data preparation can involve cleansing, aggregating, and manipulating it to be ready for specific types
of processing. Analysis requires the development and use of algorithms, analytics and AI models. It’s
driven by software that combs through data to find patterns within to transform these patterns into
predictions that support business decision-making. The accuracy of these predictions must be validated
through scientifically designed tests and experiments. And the results should be shared through the use
of data visualization tools that make it possible for anyone to see the patterns and understand trends.
 Data science provides meaningful information based on large amounts of complex data or big data. Data
science, or data-driven science, combines different fields of work in statistics and computation to
interpret data for decision-making purposes.
 Data science incorporates tools from multiple disciplines to gather a data set, process, and derive
insights from the data set, extract meaningful data from the set, and interpret it for decision-making
purposes. The disciplinary areas that make up the data science field include mining, statistics, machine
learning, analytics, and programming.
Data Analysis Defined

 Data analysis, or analytics is the process of examining data sets (within the form of text,
audio and video), and drawing conclusions about the information they contain, more
commonly through specific systems, software, and methods. Data analytics technologies
are used on an industrial scale, across commercial business industries, as they enable
organisations to make calculated, informed business decisions
Managing Large Datasets

Traditional tools and infrastructure do not work effectively for large, diverse and quickly
generated data sets. For an organization to be able to use the full potential of such data, it is
important to find a new approach to capturing; storing and analyzing data. Large data
analysis technologies use the power of a distributed network of computer resources and zero-
access architecture. Distributed computing architectures and non-relational (NoSQL) databases
to change the way data is managed and analyzed. Innovative servers and solutions for scalable
analysis in the operating memory allow optimization of computing power. It allows for
scaling, reliability and lower maintenance costs for the majority of demanding analysis tasks.
Methods For Managing Large Datasets
Analysis using operational memory (RAM memory)
 Process big data sets in the main memory can significantly affect the performance and speed
of the analysis of large data sets. The processing technology in the operational memory allows
real-time decision-making based on facts.
 Processing in the main memory removes one of the basic limitations of many solutions for the
analysis and process of big data sets, such as high delays and I/O bottlenecks caused by access
to data on disk mass memory. Processing in the main memory stores all related data in the
RAM memory of the computer system. Access to data is much faster, thanks to which it is
possible to perform instant analysis. This means that business information is available almost
immediately.
 The processing technology in the main memory enables the transfer of entire database or data
warehouses to the RAM memory. As results it allows you for quick analysis of the entire big
data set. Analysis in operational memory integrates analytical applications and databases in
memory on dedicated servers. It is an ideal solution for analytical scenarios with high
computational requirements that are related to real-time data processing. Examples of database
solutions in working memory are SQL Server Analysis Services, Hyper (Tableau new in-
memory data engine).
NoSQL databases
Non-relational databases are in the form of four different types of stores – key-value,
column, graph or document pairs. It provides high performance, high-availability storage
on a high scale. Such databases are useful for handling huge data streams and flexible types
of diagrams and data with a short response time. NoSQL databases use a distributed, fault-
tolerant architecture that ensures system reliability and scalability. An example of NoSQL
databases is Apache HBase, Apache Cassandra, MongoDB, and Azure DocumentDB.
Columnar database
Grid-based databases store data using columns rather than rows. They reduce the number
of read data items during query processing and providing high performance when
performing a large number of concurrent queries. Column-based analysis databases are
read-only environments that offer higher cost-effectiveness and better scalability than
traditional RDBMS systems. They are used for enterprise data stores and other
applications with a large number of queries. In addition, they are optimized for storing
and retrieving data from advanced analysis. Amazon Redshift, Vertica Analytics
Platform, Maria DB are the examples of top column-oriented databases.
Graph databases and analytical tools for processing big data
The graph database is a type of NoSQL database. They are particularly useful for related
data with a large number of relationships or if relationships are more important than
individual objects. The graph data structures are flexible, which facilitates data merging
and modelling. Making queries is faster, and modelling and visualization is more intuitive.
Many big data sets have a graph nature. Graph databases operate independently or in
conjunction with other graph tools, such as graph visualization and analysis applications
or machine learning applications. In the latter case, the graph databases allow analyzing
and predicting relationships to solve many different problems.
Extracting, Transformation and Loading (ETL)
 Extract, Transform, Load (ETL) operations aggregate, pre-process and saves data.
However, traditional ETL solutions can not handle the volume, speed, and diversity
of big data sets. The Hadoop platform stores and processes big data in a distributed
environment, thanks to which it is possible to divide incoming data streams into
fragments for the purpose of parallel processing of large data sets. The built-in
scalability of Hadoop architecture allows you to speed up ETL tasks, significantly
reducing the time of analysis.
Extracting, Transformation and Loading (ETL)
 Extract, Transform, Load (ETL) operations aggregate, pre-process and saves data.
However, traditional ETL solutions can not handle the volume, speed, and diversity of
big data sets. The Hadoop platform stores and processes big data in a distributed
environment, thanks to which it is possible to divide incoming data streams into
fragments for the purpose of parallel processing of large data sets. The built-in
scalability of Hadoop architecture allows you to speed up ETL tasks, significantly
reducing the time of analysis.
Big data analysis

Big data is characterised by the three V’s: the major volume of data, the velocity at which it’s
processed, and the wide variety of data. It’s because of the second descriptor, velocity, that
data analytics has expanded into the technological fields of machine learning and artificial
intelligence. Alongside the evolving computer-based analysis techniques data harnesses,
analysis also relies on the traditional statistical methods. Ultimately, how data analysis
techniques function within an organisation is twofold; big data analysis is processed through
the streaming of data as it emerges, and then performing batch analysis’ of data as it builds – to
look for behavioural patterns and trends. As the generation of data increases, so will the
various techniques that manage it. As data becomes more insightful in its speed, scale, and
depth, the more it fuels innovation
Big Data Analysis Techniques

1. A/B testing
This involves comparing a control group with a variety of test groups, in order to discern what
treatments or changes will improve a given objective variable. McKinsey gives the example of
analysing what copy, text, images, or layout will improve conversion rates on an e-commerce
site. Big data once again fits into this model as it can test huge numbers, however, it can only be
achieved if the groups are of a big enough size to gain meaningful differences.
2. Data fusion and data integration
By combining a set of techniques that analyse and integrate data from multiple sources and
solutions, the insights are more efficient and potentially more accurate than if developed through
a single source of data.
Big Data Analysis Techniques

3. Data mining
A common tool used within big data analytics, data mining extracts patterns from large data sets
by combining methods from statistics and machine learning, within database management. An
example would be when customer data is mined to determine which segments are most likely to
react to an offer.
4. Machine learning
Well known within the field of artificial intelligence, machine learning is also used for data
analysis. Emerging from computer science, it works with computer algorithms to produce
assumptions based on data. It provides predictions that would be impossible for human analysts.
5. Natural language processing (NLP).
Known as a subspecialty of computer science, artificial intelligence, and linguistics, this data
analysis tool uses algorithms to analyse human (natural) language.
6. Statistics.
This technique works to collect, organise, and interpret data, within surveys and experiments.
Statistical methods include:
Big Data Analysis Techniques

7. Association rule learning

 Association rule learning is a method for discovering interesting correlations between variables in large databases. It was
first used by major supermarket chains to discover interesting relations between products, using data from supermarket
point-of-sale (POS) systems.
Association rule learning is being used to help:
 place products in better proximity to each other in order to increase sales
 extract information about visitors to websites from web server logs
 analyze biological data to uncover new relationships
 monitor system logs to detect intruders and malicious activity
 identify if people who buy milk and butter are more likely to buy diapers
8. Regression Analysis
 At a basic level, regression analysis involves manipulating some independent variable (i.e. background music) to see how
it influences a dependent variable (i.e. time spent in store). It describes how the value of a dependent variable changes
when the independent variable is varied. It works best with continuous quantitative data like weight, speed or age.
Regression analysis is being used to determine how:
 levels of customer satisfaction affect customer loyalty
 the number of supports calls received may be influenced by the weather forecast given the previous day
Big Data Applications

The primary goal of Big Data applications is to help companies make more informative
business decisions by analyzing large volumes of data. It could include web server logs,
Internet click stream data, social media content and activity reports, text from customer
emails, mobile phone call details and machine data captured by multiple sensors.
Organisations from different domain are investing in Big Data applications, for examining
large data sets to uncover all hidden patterns, unknown correlations, market trends, customer
preferences and other useful business information
Big Data Applications: Healthcare

Big data analytics have improved healthcare by providing personalized medicine and
prescriptive analytics. Researchers are mining the data to see what treatments are more
effective for particular conditions, identify patterns related to drug side effects, and gains other
important information that can help patients and reduce costs.
With the added adoption of mHealth, eHealth and wearable technologies the volume of data is
increasing at an exponential rate. This includes electronic health record data, imaging data,
patient generated data, sensor data, and other forms of data.
By mapping healthcare data with geographical data sets, it’s possible to predict disease that
will escalate in specific areas. Based of predictions, it’s easier to strategize diagnostics and
plan for stocking serums and vaccines.
Big Data Applications: Manufacturing

Predictive manufacturing provides near-zero downtime and transparency. It requires an enormous

amount of data and advanced prediction tools for a systematic process of data into useful information.
Major benefits of using Big Data applications in manufacturing industry are:
 Product quality and defects tracking
 Supply planning
 Manufacturing process defect tracking
 Output forecasting
 Increasing energy efficiency
 Testing and simulation of new manufacturing processes
 Support for mass-customization of manufacturing
Big Data Applications: Media &
Entertainment
Various companies in the media and entertainment industry are facing new business models, for the way
they create, market and distribute their content. This is happening because of current consumer’s search
and the requirement of accessing content anywhere, any time, on any device.
Big Data provides actionable points of information about millions of individuals. Now, publishing
environments are tailoring advertisements and content to appeal consumers. These insights are gathered
through various data-mining activities. Big Data applications benefits media and entertainment industry
by:
 Predicting what the audience wants
 Scheduling optimization
 Increasing acquisition and retention
 Ad targeting
 Content monetization and new product development
Big Data Applications: Internet of Things (IoT)
Data extracted from IoT devices provides a mapping of device inter-connectivity. Such mappings have been used
by various companies and governments to increase efficiency. IoT is also increasingly adopted as a means of
gathering sensory data, and this sensory data is used in medical and manufacturing contexts.

Big Data Applications: Government

The use and adoption of Big Data within governmental processes allows efficiencies in terms of cost,
productivity, and innovation. In government use cases, the same data sets are often applied across multiple
applications & it requires multiple departments to work in collaboration. Big Data applications in government
include Cyber security & Intelligence, Crime Prediction & Prevention, Pharmaceutical Drug Evaluation,
Scientific research, Tax Compliance, Weather Forecasting, etc.
Big Data Applications: Call Center Analytics

What’s going on in a customer’s call center is often a great barometer and influencer of market
sentiment, but without a Big Data solution, much of the insight that a call center can provide
will be overlooked or discovered too late. Big Data solutions can help identify recurring
problems or customer and staff behaviour patterns on the fly not only by making sense of
time/quality resolution metrics but also by capturing and processing call content itself.
Big Data Applications: Call Center Analytics

Marketers have begun to use facial recognition software to learn how well their advertising
succeeds or fails at stimulating interest in their products. A recent study published in the
Harvard Business Review looked at what kinds of advertisements compelled viewers to
continue watching and what turned viewers off. Among their tools was “a system that analyses
facial expressions to reveal what viewers are feeling.” The research was designed to discover
what kinds of promotions induced watchers to share the ads with their social network, helping
marketers create ads most likely to “go viral” and improve sales.
Data Product Development

Data products take on all different shapes and forms. However, they all have similar
goals.
These goals revolve around providing insights, better services, and helping
companies improve their operations. Making a great data product means having a
clear set of questions you want your end-users to be able to answer with your
product or clear goals that help provide information and better services.
This was the first of several articles we will be discussing developing data products
in.
This is part of our current push to develop a guide to help companies of all sizes
improve their data strategy.
Step 1: Conceptualizing the Product

 This introductory step needs to take place before data acquisition. It requires
conceptualizing the information product, along with identifying the required data
resources. The process involves product definition, data investigation (which should
include sourcing data creatively), and establishing the framework necessary to produce a
prototype.
 Creating successful data products like dashboards, data APIs and algorithms first requires
that you have a clear business problem you are trying to solve.
Step 2: Data Acquisition

 Organizations tend to acquire or accumulate data that corresponds to their functional

activities. However, given the vast amounts of data being generated by information devices
and the data available from public sources, the acquisition process needs to connect the
requirements of the conceptual model to data that will create the product. In addition to
acquiring structured data (for example, customer purchase records), companies should also
consider using unstructured sources (for instance, social media comments) that might be
able to add value. Companies should be prepared to look within and outside their own
systems for such data.
Step 3: Refinement

 Today, much data refining is achieved with automated tools. Real-time machine learning
and algorithmic processing of data elements can categorize, correlate, personalize, profile,
and search data quickly to create meaningful models that have significant value for
consumers.
 Defining your final product, what it will do, and generally how it will look is a necessary
step before setting up a plan.
 What will it generally look like?
 How will users interact with it?
 Will it be an API, a dashboard, a model, or a report?
Step 4: Storage and Retrieval

 Storage and retrieval are as important as ever. However, retrieval in today’s environment
must incorporate advancements in query and search processing capabilities (for instance,
making use of algorithms) that can access more granular levels of data. Traditional storage
techniques need to be augmented by new technologies such as map reduction (a software
framework for distributed processing of large data sets on computer clusters of commodity
hardware) and parallel processing capabilities to manage larger and faster-moving data
sources. Many organizations store data in relatively unstructured formats when they
initially capture it, refining it over time. Data storage, retrieval, and processing are
increasingly taking place in the cloud rather than on a company’s premises. This not only
provides companies with flexibility in their technology infrastructures but also can make it
easier for them to combine internal and external data.
Step 5: Distribution

 The distribution options for information products have shifted dramatically from the earlier
menu of possibilities, some of which (such as fax and CD-ROM) have been superseded by
the Web. Timing and frequency remain critical aspects of distribution; data products must be
continuously available and updated in near-real time. In the digital economy, online media
(such as websites and portals) fully address the required level of continuous accessibility to
information products. However, Web access via traditional computers is quickly being
overtaken by mobile access via smartphones, tablets, and apps. As a result, providers of
information products that are distributed via mobile devices need to revamp their content
formats and design.
 At the same time, distributing data products through the cloud adds a new dimension to the
question of how frequently information needs to be updated for users. Consider, for example,
a business-to-business case involving a shipping service provider that offers information
products including en-route metrics, such as estimates of time to delivery. Assuming the data
is available, the frequency and timeliness of such information — generated through GPS
traffic information, location data, and analytics — can be close to real time.
Step 6: Presentation

 Information products gained value from the context of their use. The user interface
mattered — and the easier products were to use, the more valuable they were. Although the
digital economy places heavier emphasis on analytics than on simple data provision, there
are some important constants. While standard reporting (that is, simple information
products) continues to meet the needs of many consumers, more advanced analytics-based
products such as forecasts, predictions, and probabilities (such as real-time calculations
generated through machine learning) can lead to differentiation and competitive advantage.
Data Product Life Cycle
Data Product Life Cycle

 Experiment: Which use case are we trying to solve? Which data do we have available to
solve it? I run a few notebooks, or execute a few ad-hoc queries, maybe visualise a few
things in a BI tool to understand better what I’m looking for
 Implement: Once I know what I’m looking for, I need to implement this. That means that
I want to run it regularly in batch or real-time. Mature data products are built using
software engineering best practices, with version control, tests, modular code, and all that
good stuff.
 Deploy: Ok, it works on my machine. Ship it. Data products only add value once they run
in production. So we need to version our code, build artifacts and actually do deployments
to development, acceptance or production accounts.
 Monitor: Once it’s deployed, you need to monitor the performance of the data product,
both from a technical and business perspective.
 Managing and analyzing large datasets, big data applications, and product development is currently a
necessity. In the world of technology various ways evolved and are applied to manage and analyze large
datasets. Some of the methods include ,automation, visualization, indexing ,compression, version control,
meta data recording to mention but a few. It is the purpose of this paper to unearth the existing feasible
methods of analyzing large datasets, big data applications, and product development.
 As the name suggests, big data represents large amounts of data that is unmanageable using traditional
software or internet-based platforms. It surpasses the traditionally used amount of storage, processing and
analytical power. Douglas Laney observed that big data was growing in three different dimensions namely,
volume, velocity and variety, known as the 3 Vs (Laney D, 2001). The ‘big’ part of big data is indicative of
its large volume. In addition to volume, the big data description also includes velocity and variety. Velocity
indicates the speed or rate of data collection and making it accessible for further analysis; while, variety
remarks on the different types of organized and unorganized data that any firm or system can collect, such
as transaction-level data, video, audio, text or log files. These three Vs have become the standard
definition of big data. Although, other people have added several other Vs to this definition (Mauro AD,
Greco M, Grimaldi M, 2016), the most accepted 4th V remains veracity.
 In respect of the above scenario , multifaceted methods have been developed in an attempt to manage and analyze each attribute of large
dataset. Among the methods automations is regarded a key to large datasets management. Shoaib Mufti postulated that big data sets are
too large to comb through manually, so automation is key. Allen Institute for Brain Science in Seattle, Washington uses a template for
brain-cell and genetics data that accepts information only in the correct format and type. When it’s time to integrate those data into a
larger database or collection, data-quality assurance steps are automated using Apache Spark and Apache Hbase, two open-source tools,
to validate and repair data in real time. The set of software tools are ab le to validate and ingest data that runs in the cloud, which
allows the responsible end users to easily scale data. At the same institute the Open Connectome Project also provides automated
quality assurance, this generates visualizations of summary statistics that users can inspect before moving forward with their analyses.
This above scenario depicts that automation is key in reducing information overload, and it increase speed in real time decision making.

 Researchers can use version-control systems to see how a file has changed over time and who made the modifications. However, some
systems restrict the file sizes one can use. According to Alyssa Goodman, an astrophysicist and data-visualization specialist at Harvard
University, Harvard Dataverse (which is open to all scholars) and Zenodo can be used for version control of big files. Another option
is Dat, a free peer-to-peer network for sharing and versioning files of any size. According to Andrew Osheroff, a primary software
developer at Dat in Copenhagen, the system keeps a tamper-proof history of all the operations you conduct on your file. According to
Dat product manager Karissa McKelvey, users can command the system to archive a duplicate of each version of a file. End users can
direct the system to archive a copy of each version of a file. Dat is currently a command-line utility, the team that developed this system
hopes to release a more user-friendly one.
 such documents on or near the data servers to do remote analyses and explore the data. Jupyter Notebook is not
particularly accessible to researchers who might be uncomfortable using a command line, there are more user-
friendly platforms that can bridge the gap, including Terra and Seven Bridges Genomics.
 Big data applications can help companies to make better business decisions by analyzing large volumes of data
and discovering hidden patterns. These data sets might be from social media, data captured by sensors, website
logs, customer feedbacks. Organizations are spending huge amounts on big data applications to discover hidden
patterns, unknown associations, market style, consumer preferences, and other valuable business information .

 There is a significant improvement in the healthcare domain by personalized medicine and prescriptive
analytics due to the role of big data systems. Researchers analyze the data to determine the best treatment for a
particular disease, side effects of the drugs, forecasting the health risks, etc. Mobile applications on health and
wearable devices are causing available data to grow at an exponential rate Doyle-Lindrud S (2015).


 It is possible to predict a disease outbreak by mapping healthcare data and
geographical data. Once predicted, containment of the outbreak can be handled
and plans to eradicate the disease made. Among other benefits of EHRs is that
healthcare professionals have an improved access to the entire medical history of a
patient. The information includes medical diagnoses, prescriptions, data related to
known allergies, demographics, clinical narratives, and the results obtained from
various laboratory tests. The recognition and treatment of medical conditions thus
is time efficient due to a reduction in the lag time of previous test results. The
diagram shows how big data can be stored, utilized, and analyzed. Data
warehouses store massive amounts of data generated from various sources. This
data is processed using analytic pipelines to obtain smarter and affordable
healthcare options
 The media and entertainment industries are creating, advertising, and distributing their content using new business
models. This is due to customer requirements to view digital content from any location and at any time. The introduction
of online TV shows, Netflix channels, etc. is proving that new customers are not only interested in watching TV but are
interested in accessing data from any location. The media houses are targeting audiences by predicting what they would
like to see, how to target the ads, content monetization, etc. Big data systems are thus increasing the revenues of such
media houses by analyzing viewer patterns.

 IoT devices generate continuous data and send them to a server on a daily basis. These data are mined to provide the
interconnectivity of devices. This mapping can be put to good use by government agencies and also a range of
companies to increase their competence. IoT is finding applications in smart irrigation systems, traffic management,
crowd management, etc.


 Predictive manufacturing can help to increase efficiency by producing more goods by minimizing the
downtime of machines. This involves a massive quantity of data for such industries. Sophisticated forecasting
tools follow an organized process to explore valuable information for these data. The following are the some of
the major advantages of employing big data applications in manufacturing industries: high product quality,
tracking faults, supply planning, predicting the output, increasing energy efficiency, testing and simulation of
new manufacturing processes, and large-scale customization of manufacturing.



 By adopting big data systems, the government can attain efficiencies in terms of cost, output, and novelty. Since
the same data set is used in many applications, many departments can work in association with each other.
Government plays an important role in innovation by acting in all these domains.
 Big data applications can be applied in each and every field. Some of the major areas where big data finds
applications include: agriculture, aviation, cyber security and intelligence, crime prediction and prevention, e-
commerce, fake news detection, fraud detection, pharmaceutical drug evaluation, scientific research, weather
forecasting, and tax compliance.

 The best logical approach for analyzing huge volumes of complex big data is to distribute and process it in
parallel on multiple nodes. However, the size of data is usually so large that thousands of computing machines are
required to distribute and finish processing in a reasonable amount of time. When working with hundreds or
thousands of nodes, one has to handle issues like how to parallelize the computation, distribute the data, and
handle failures. One of most popular open-source distributed application for this purpose is Hadoop Shvachko K,
et al(2010). Hadoop implements MapReduce algorithm for processing and generating large datasets.
 MapReduce uses map and reduce primitives to map each logical record’ in the input into a set
of intermediate key/value pairs, and reduce operation combines all the values that shared the
same key Dean J, Ghemawat S. MapReduce(2008). It efficiently parallelizes the computation,
handles failures, and schedules inter-machine communication across large-scale clusters of
machines. Hadoop Distributed File System (HDFS) is the file system component that provides
a scalable, efficient, and replica based storage of data at various nodes that form a part of a
cluster Shvachko K, et al (2010). Hadoop has other tools that enhance the storage and
processing components therefore many large companies like Yahoo, Facebook, and others
have rapidly adopted it. Hadoop has enabled researchers to use data sets otherwise impossible
to handle. Many large projects, like the determination of a correlation between the air quality
data and asthma admissions, drug development using genomic and proteomic data, and other
such aspects of healthcare are implementing Hadoop. Therefore, with the implementation of
Hadoop system, the healthcare analytics will not be held back.
 Apache Spark is another open source alternative to Hadoop. It is a unified engine for distributed data
processing that includes higher-level libraries for supporting SQL queries (Spark SQL), streaming
data (Spark Streaming), machine learning (MLlib) and graph processing (GraphX) Zaharia M, et al
(2016). These libraries help in increasing developer productivity because the programming interface
requires lesser coding efforts and can be seamlessly combined to create more types of complex
computations. By implementing Resilient distributed Datasets (RDDs), in-memory processing of
data is supported that can make Spark about 100× faster than Hadoop in multi-pass analytics (on
smaller datasets) Gopalani S, Arora R.(2015). This is more true when the data size is smaller than
the available memory Saouabi M, Ezzati A(2017).. This indicates that processing of really big data
with Apache Spark would require a large amount of memory. Since, the cost of memory is higher
than the hard drive, MapReduce is expected to be more cost effective for large datasets compared to
Apache Spark. Similarly, Apache Storm was developed to provide a real-time framework for data
stream processing. This platform supports most of the programming languages. Additionally, it
offers good horizontal scalability and built-in-fault-tolerance capability for big data analysis.
 In order to tackle big data challenges and perform smoother analytics, various companies
have implemented AI to analyze published results, textual data, and image data to obtain
meaningful outcomes. IBM Corporation is one of the biggest and experienced players in
this sector to provide healthcare analytics services commercially. IBM’s Watson Health is
an AI platform to share and analyze health data among hospitals, providers and
researchers. Similarly, Flatiron Health provides technology-oriented services in healthcare
analytics specially focused in cancer research. Other big companies such as Oracle
Corporation and Google Inc. are also focusing to develop cloud-based storage and
distributed computing power platforms. Interestingly, in the recent few years, several
companies and start-ups have also emerged to provide health care-based analytics and
solutions.
 Quantum computing is picking up and seems to be a potential solution for big data analysis. For
example, identification of rare events, such as the production of Higgs bosons at the Large Hadron
Collider (LHC) can now be performed using quantum approaches Mott A, et al (2017). At LHC, huge
amounts of collision data (1PB/s) is generated that needs to be filtered and analyzed. One such
approach, the quantum annealing for ML (QAML) that implements a combination of ML and quantum
computing with a programmable quantum annealer, helps reduce human intervention and increase the
accuracy of assessing particle-collision data. In another example, the quantum support vector machine
was implemented for both training and classification stages to classify new data Rebentrost P, Mohseni
M, Lloyd S (2014). Such quantum approaches could find applications in many areas of science Mott A,
et al (2017). Indeed, recurrent quantum neural network (RQNN) was implemented to increase signal
separability in electroencephalogram (EEG) signals Gandhi V, et al (2014). Similarly, quantum
annealing was applied to intensity modulated radiotherapy (IMRT) beamlet intensity optimization
Nazareth DP, Spaans JD (2015). Similarly, there exist more applications of quantum approaches
regarding healthcare e.g. quantum sensors and quantum microscopes Reardon S (2017).
 Step 1: Conceptualizing the Product

 Before jumping in, organizations must identify an information product that meets a need from the marketplace. This
introductory step needs to take place before data acquisition. It requires conceptualizing the information product, along with
identifying the required data resources. The process involves product definition, data investigation (which should include
sourcing data creatively, and establishing the framework necessary to produce a prototype. Once this set of requirements is
met, the remaining steps of the development process can be carried out more efficiently. For example, once managers know
the data elements that will go into the product, storage and retrieval can be streamlined.

 An interesting example is CarMD.com Corp., an Irvine, California-based company that provides services that leverage
automotive diagnostic information. The original idea was to provide diagnostic capabilities that led consumers to auto repair
estimates and potential service providers. One of the company’s products compares the data extracted from onboard
computers in cars against online auto repair databases and offers consumers information on auto maintenance.
 Step 2: Data Acquisition

 Once the conceptual model has been worked out, data acquisition can be pursued in a more efficient
manner. Organizations tend to acquire or accumulate data that corresponds to their functional
activities. However, given the vast amounts of data being generated by information devices and the
data available from public sources, the acquisition process needs to connect the requirements of the
conceptual model to data that will create the product. In addition to acquiring structured data (for
example, customer purchase records), companies should also consider using unstructured sources
(for instance, social media comments) that might be able to add value. Companies should be
prepared to look within and outside their own systems for such data.
 Step 3: Refinement

 Although Meyer and Zack’s data refinement process remains quite relevant, it has to be augmented to facilitate new data
sources and to take advantage of advanced analytic methods. The original model talked about the importance of being
able to “glean further meaning from combinations of individual [data] elements.” Today, much data refining is achieved
with automated tools. Real-time machine learning and algorithmic processing of data elements can categorize, correlate,
personalize, profile, and search data quickly to create meaningful models that have significant value for consumers.

 For example, Passur Aerospace Inc., based in Stamford, Connecticut, uses both its own data and public data to develop
scheduling information for airlines and travelers. Drawing on publicly available data on weather, flight schedules, and
other factors, along with its own internal data based on radar statistic feeds, it generates flight arrival estimates. Applying
advanced analytics, Passur’s arrival estimates outperform ones based on traditional techniques.
 Step 4: Storage and Retrieval

 Storage and retrieval are as important as ever. However, retrieval in today’s environment must incorporate
advancements in query and search processing capabilities (for instance, making use of algorithms) that can access
more granular levels of data. Traditional storage techniques need to be augmented by new technologies such as
map reduction (a software framework for distributed processing of large data sets on computer clusters of
commodity hardware) and parallel processing capabilities to manage larger and faster-moving data sources. Many
organizations store data in relatively unstructured formats when they initially capture it, refining it over time.
Data storage, retrieval, and processing are increasingly taking place in the cloud rather than on a company’s
premises. This not only provides companies with flexibility in their technology infrastructures but also can make
it easier for them to combine internal and external data.
 Step 5: Distribution

 At the same time, distributing data products through the cloud adds a new dimension to the question of how frequently information
needs to be updated for users. Consider, for example, a business-to-business case involving a shipping service provider that offers
information products including en-route metrics, such as estimates of time to delivery. Assuming the data is available, the frequency and
timeliness of such information — generated through GPS traffic information, location data, and analytics — can be close to real time.
 Step 6: Presentation

 In the original Meyer-Zack model, information products gained value from the context of their
use. The user interface mattered — and the easier products were to use, the more valuable they
were. Although the digital economy places heavier emphasis on analytics than on simple data
provision, there are some important constants. While standard reporting (that is, simple
information products) continues to meet the needs of many consumers, more advanced analytics-
based products such as forecasts, predictions, and probabilities (such as real-time calculations
generated through machine learning) can lead to differentiation and competitive advantage.
 Step 7: Market Feedback

 The competitive nature of the information product space, availability of new data sources, and demand for timely
decision support require an ongoing emphasis on innovation and on monitoring product usage. Adding this step at this
stage of the analytics-based data product development process is consistent with the iterative nature of product
development in a “lean startup” context. Once again, the evolution of new technologies has provided a mechanism for
facilitating a feedback and information extraction process from the marketplace. New forms of market research are
capable of leveraging social media platforms (for example, business Facebook pages) to listen to the marketplace.
Interactive blogs and flash surveys can be utilized to assess customer perceptions of existing information products. New
features of online information products can be tested in a matter of hours with A/B or multivariate online testing
approaches. Both user correspondence and digital metrics on product use (for instance, views, clicks, downloads, and
bounces) can be analyzed to enhance products continuously.
 A Structured Approach to Stakeholder Involvement

 In order to achieve effective results from the implementation of the product development model, stakeholder
involvement is essential. Having particular types of input at different stages of the product development process is
important. Therefore, companies need to develop some degree of structure for stakeholder input.

 During the stage when the product is being conceptualized, it’s important to have involvement from three specific
groups: subject matter experts at the business level (who can help determine the feasibility of the product design);
managers of existing and complementary information products (who can help companies avoid cannibalization and
duplication); and marketing people (who can help assess the nature and scale of consumer demand). These
individuals can assist in providing the framework for designing or upgrading existing products to add value to meet
market needs.
 For the data acquisition and the storage and refinement stages, stakeholder involvement should expand to
include legal representatives, who can speak to data ownership, privacy, and use issues; IT personnel, who can
provide input on hardware and software requirements for data products and also help in developing and
improving the functionality of the product; and data managers and analytics and data scientists to assist in
product platform execution. It is critical to involve analytics and data science professionals to help in
structuring and analyzing data.

 For the distribution and presentation stages, the stakeholders should again include marketing people (who can
help sort out consumer/user needs for the initial product launch and subsequent product releases) and IT
personnel (who can deal with hardware and software issues in product functionality during the product rollout).
 Conclusion

 Tool selection is a necessary first step. Often, the choice of tools is decided well in advance of the
specific project of interest. Organizations make the decision to use SAS, SPSS, R, or even Excel for all
their data analysis needs. Since specific applications of those tools are not all known in advance, the
choice is made for one-size-fits-all needs. If given the choice or flexibility to choose other tools, think
carefully about the capabilities needed. If running R, is there concern about running out of memory given
the size of data? Can these concerns be addressed with a better server, more memory, or other tools such
as Hadoop? Even tools such as SAS come with practical concerns that should be carefully considered.
 Implementation of artificial intelligence (AI) algorithms and novel fusion algorithms
would be necessary to make sense from this large amount of data. Indeed, it would be a
great feat to achieve automated decision-making by the implementation of machine
learning (ML) methods like neural networks and other AI techniques. However, in absence
of appropriate software and hardware support, big data can be quite hazy. We need to
develop better techniques to handle this ‘endless sea’ of data and smart web applications
for efficient analysis to gain workable insights. With proper storage and analytical tools in
hand, the information and insights derived from big data can make the critical social
infrastructure components and services (like healthcare, safety or transportation) more
aware, interactive and efficient. In addition, visualization of big data in a user-friendly
manner will be a critical factor for societal development.
 References

 Laney D. 3D data management: controlling data volume, velocity, and variety, Application delivery strategies.
Stamford: META Group Inc; 2001

 Mauro AD, Greco M, Grimaldi M. A formal definition of big data based on its essential features. Libr Rev.
2016;65(3):122–35.

 Rebentrost P, Mohseni M, Lloyd S. Quantum support vector machine for big data classification. Phys Rev Lett.
2014;113(13):130503.

 Mott A, et al. Solving a Higgs optimization problem with

 Definition of Terms

 SparkSeq is an efficient and cloud-ready platform based on Apache Spark framework and
Hadoop library that is used for analyses of genomic data for interactive genomic data analysis
with nucleotide precision

 SAMQA identifies errors and ensures the quality of large-scale genomic data. This tool was
originally built for the National Institutes of Health Cancer Genome Atlas project to identify
and report errors including sequence alignment/map [SAM] format error and empty reads.
 ART can simulate profiles of read errors and read lengths for data obtained using high 4.

 DistMap is another toolkit used for distributed short-read mapping based on Hadoop cluster that aims to cover a
wider range of sequencing applications. For instance, one of its applications namely the BWA mapper can perform
500 million read pairs in about 6 h, approximately 13 times faster than a conventional single-node mapper.

 SeqWare is a query engine based on Apache HBase database system that enables access for large-scale whole-
genome datasets by integrating genome browsers and tools.

 CloudBurst is a parallel computing model utilized in genome mapping experiments to improve the scalability of
reading large sequencing data.
 Hydra uses the Hadoop-distributed computing framework for processing large peptide
and spectra databases for proteomics datasets. This specific tool is capable of performing
27 billion peptide scorings in less than 60 min on a Hadoop cluster.

 BlueSNP is an R package based on Hadoop platform used for genome-wide association

studies (GWAS) analysis, primarily aiming on the statistical readouts to obtain significant
associations between genotype–phenotype datasets. The efficiency of this tool is
 estimated to analyze 1000 phenotypes on 106 SNPs in 104 individuals in a duration of
half-an-hour.

 Myrna the cloud-based pipeline, provides information on the expression level differences

of genes, including read alignments, data normalization, and statistical modeling.

【Zybio】Hematology Analyzer Data Management System 1.3 Operation Guide V1.0 - 20200214
0% (2)
【Zybio】Hematology Analyzer Data Management System 1.3 Operation Guide V1.0 - 20200214
25 pages
CS6303 - Load Testing - Prelim Exam - Attempt Review
No ratings yet
CS6303 - Load Testing - Prelim Exam - Attempt Review
20 pages
Big Data Analytics 1-5
100% (1)
Big Data Analytics 1-5
63 pages
South Africa - Telecoms, Mobile, Broadband and Forecasts - 10apr2013 - 03134817
No ratings yet
South Africa - Telecoms, Mobile, Broadband and Forecasts - 10apr2013 - 03134817
142 pages
Ericsson Nfvi Solution: Francois Lemarchand - Head of Nfvi Strategy Gerry Feeney - Telco Cloud Transformation Lead
100% (2)
Ericsson Nfvi Solution: Francois Lemarchand - Head of Nfvi Strategy Gerry Feeney - Telco Cloud Transformation Lead
21 pages
Big Data Analysis Concepts and References
100% (1)
Big Data Analysis Concepts and References
60 pages
MIM Advanced Databases Outline
No ratings yet
MIM Advanced Databases Outline
4 pages
Big Data Analytics
No ratings yet
Big Data Analytics
36 pages
Hadoop Report
No ratings yet
Hadoop Report
110 pages
BDA Unit-1
No ratings yet
BDA Unit-1
31 pages
Big Data Seminar
100% (2)
Big Data Seminar
27 pages
Sas/Access 9.3 Interface To PC Files: Reference
No ratings yet
Sas/Access 9.3 Interface To PC Files: Reference
328 pages
Midterm Examination Spring 2010 IT430-E-Commerce (Session - 5)
No ratings yet
Midterm Examination Spring 2010 IT430-E-Commerce (Session - 5)
47 pages
Big Data Answers
No ratings yet
Big Data Answers
14 pages
Big Data BDO
No ratings yet
Big Data BDO
11 pages
XML and Web Databases
No ratings yet
XML and Web Databases
58 pages
TIB Bwce 2.7.1 Relnotes
No ratings yet
TIB Bwce 2.7.1 Relnotes
95 pages
Contoh Paper Yang Ditandai
No ratings yet
Contoh Paper Yang Ditandai
5 pages
XML and Internet Databases: Dawood Al-Nasseri Wade Meena MIS 409 DR - Sumali Conlon
No ratings yet
XML and Internet Databases: Dawood Al-Nasseri Wade Meena MIS 409 DR - Sumali Conlon
66 pages
Unit I Relational Databases
No ratings yet
Unit I Relational Databases
50 pages
Bda 4
No ratings yet
Bda 4
18 pages
Hadoop & BigData (UNIT - 2)
No ratings yet
Hadoop & BigData (UNIT - 2)
22 pages
3GPP TS 23.288
No ratings yet
3GPP TS 23.288
67 pages
Ovation Select Overview - Detailed
100% (1)
Ovation Select Overview - Detailed
60 pages
Group - 3 Sap
No ratings yet
Group - 3 Sap
23 pages
DSBDA EndSem2023 12F FlyHigh
No ratings yet
DSBDA EndSem2023 12F FlyHigh
20 pages
BDA Unit-2 (Part 3)
No ratings yet
BDA Unit-2 (Part 3)
7 pages
Curriculum Vitae: XXXXX
No ratings yet
Curriculum Vitae: XXXXX
19 pages
Master of Engineering in Cyber Security
No ratings yet
Master of Engineering in Cyber Security
22 pages
Quectel BG96 Hardware Design V1.2
No ratings yet
Quectel BG96 Hardware Design V1.2
79 pages
Hadoop PPT
No ratings yet
Hadoop PPT
25 pages
Big Data: Concepts, Techniques, Storage and Challenges
No ratings yet
Big Data: Concepts, Techniques, Storage and Challenges
9 pages
(Clark) Characterizing Cyberspace - Past, Present and Future
No ratings yet
(Clark) Characterizing Cyberspace - Past, Present and Future
18 pages
Unit-5 DS
No ratings yet
Unit-5 DS
20 pages
Data Processing in Data Mining
No ratings yet
Data Processing in Data Mining
11 pages
Query Parallelism
No ratings yet
Query Parallelism
8 pages
Bhavesh Krishan Garg Cse2b-G1 (Lab02 CC)
No ratings yet
Bhavesh Krishan Garg Cse2b-G1 (Lab02 CC)
4 pages
(Subject Code: 410243) (Class: TE Computer Engineering) : Data Analytics
No ratings yet
(Subject Code: 410243) (Class: TE Computer Engineering) : Data Analytics
68 pages
Hidden Patterns, Unknown Correlations, Market Trends, Customer Preferences and Other Useful Information That Can Help Organizations Make More-Informed Business Decisions
No ratings yet
Hidden Patterns, Unknown Correlations, Market Trends, Customer Preferences and Other Useful Information That Can Help Organizations Make More-Informed Business Decisions
4 pages
Account Name - Vodafone - GPRS
No ratings yet
Account Name - Vodafone - GPRS
7 pages
GD32VF103 Datasheet Rev 1.2
No ratings yet
GD32VF103 Datasheet Rev 1.2
77 pages
Veritas Netbackup™ For Hbase Administrator'S Guide: Unix, Windows, and Linux
No ratings yet
Veritas Netbackup™ For Hbase Administrator'S Guide: Unix, Windows, and Linux
50 pages
Unit 1
No ratings yet
Unit 1
14 pages
Workforce Optimization Suite: Aspect - STCI For Contact Server Interface Guide
No ratings yet
Workforce Optimization Suite: Aspect - STCI For Contact Server Interface Guide
20 pages
Unit 4 LT
No ratings yet
Unit 4 LT
16 pages
Ccs 334
No ratings yet
Ccs 334
16 pages
ACC IT APP MIdterm Bigdata
No ratings yet
ACC IT APP MIdterm Bigdata
12 pages
System Software Seminar
No ratings yet
System Software Seminar
11 pages
Unit 3 Data-Analytics
No ratings yet
Unit 3 Data-Analytics
48 pages
Ambient Intelligence For Smart Home
No ratings yet
Ambient Intelligence For Smart Home
3 pages
Big Data Analytics 1
No ratings yet
Big Data Analytics 1
22 pages
Big Data Analytics Unit-1
100% (2)
Big Data Analytics Unit-1
5 pages
Systems Analysis and Design: Feasibility Report Operational Feasibility
No ratings yet
Systems Analysis and Design: Feasibility Report Operational Feasibility
2 pages
Unit - I - Types of Digital Data
No ratings yet
Unit - I - Types of Digital Data
45 pages
Chapter Two
100% (1)
Chapter Two
10 pages
Course Title: IF 3910-Assembly Language Programming (ALP) : Course Code Semester Objective Course Outcomes Statements
No ratings yet
Course Title: IF 3910-Assembly Language Programming (ALP) : Course Code Semester Objective Course Outcomes Statements
1 page
Electronic Devices - Living in The IT Era
No ratings yet
Electronic Devices - Living in The IT Era
1 page
Big Data Lec4
No ratings yet
Big Data Lec4
38 pages
What Is Data
No ratings yet
What Is Data
20 pages
Unit 1 Big Data
No ratings yet
Unit 1 Big Data
124 pages
Big Data Analytics
No ratings yet
Big Data Analytics
21 pages
BDA Unit-1
No ratings yet
BDA Unit-1
32 pages
Big-Data-Analytics Notes For Ug
No ratings yet
Big-Data-Analytics Notes For Ug
10 pages
5CS031 Assessment 3
No ratings yet
5CS031 Assessment 3
10 pages
Big Data and Hadoop Self Notes
No ratings yet
Big Data and Hadoop Self Notes
16 pages
TP 4 2docuatrimestre
No ratings yet
TP 4 2docuatrimestre
10 pages
BDA1-4 Bunits
No ratings yet
BDA1-4 Bunits
113 pages
Big Data Analytics
No ratings yet
Big Data Analytics
10 pages
Hadoop Big Data Unit 2
No ratings yet
Hadoop Big Data Unit 2
23 pages
Business Intelligence Notes
No ratings yet
Business Intelligence Notes
27 pages
BDA UNIT - 1 - PDF
No ratings yet
BDA UNIT - 1 - PDF
143 pages
Module 1
No ratings yet
Module 1
29 pages
Introduction To Big Data
No ratings yet
Introduction To Big Data
4 pages
Unit 2
No ratings yet
Unit 2
15 pages
ANST1195LV - Security Mindset - Building Secure Private Cloud With VMware Cloud Foundation and VMware Vdefend Firewall
No ratings yet
ANST1195LV - Security Mindset - Building Secure Private Cloud With VMware Cloud Foundation and VMware Vdefend Firewall
29 pages
Big Data Analytics
No ratings yet
Big Data Analytics
32 pages
En Cybersecurity Guide C A4-1
No ratings yet
En Cybersecurity Guide C A4-1
45 pages
Presentation 20
No ratings yet
Presentation 20
31 pages
Unit 1
No ratings yet
Unit 1
36 pages
Unit-1 Introduction To Data Analytics
No ratings yet
Unit-1 Introduction To Data Analytics
35 pages
Big Data Analytics
No ratings yet
Big Data Analytics
4 pages
UNUT 1 - Introduction and Data Analytics Life Cycle
No ratings yet
UNUT 1 - Introduction and Data Analytics Life Cycle
86 pages
Reviewed Big Data Assignment
No ratings yet
Reviewed Big Data Assignment
6 pages
$RM5TSDQ
No ratings yet
$RM5TSDQ
70 pages
Session 1
No ratings yet
Session 1
48 pages
Big Data Analysis by Deshbandhu
No ratings yet
Big Data Analysis by Deshbandhu
368 pages
What Is Big Data
No ratings yet
What Is Big Data
4 pages
Data Analytics II-unit
No ratings yet
Data Analytics II-unit
20 pages
BIG Data1
No ratings yet
BIG Data1
49 pages
Business Analytics Notes
No ratings yet
Business Analytics Notes
31 pages
Block-2-Unit 5
No ratings yet
Block-2-Unit 5
101 pages
Learn Data Warehousing in 24 Hours
From Everand
Learn Data Warehousing in 24 Hours
Alex Nordeen
No ratings yet
Learn Hadoop in 24 Hours
From Everand
Learn Hadoop in 24 Hours
Alex Nordeen
No ratings yet
Hadoop Ecosystem for Big Data
From Everand
Hadoop Ecosystem for Big Data
Dr. Zemelak Goraga
No ratings yet

Adbase Presentation Group 4

Uploaded by

Adbase Presentation Group 4

Uploaded by

Name Registration Number

Makhosonke Ndlovu R104644A

Matanda Tigerenashe Theophilus R1814472C

Mangena Israel R208957Z

Tapiwa Maunganidze R174830V

Lovemore Chasowa R211772W

Gwaringa Batsirai R212483C

Data Warehousing ,Science and Analysis.

A typical data warehouse often includes the following elements:

7. Association rule learning

Predictive manufacturing provides near-zero downtime and transparency. It requires an enormous

Big Data Applications: Government

 Organizations tend to acquire or accumulate data that corresponds to their functional

 Mott A, et al. Solving a Higgs optimization problem with

 BlueSNP is an R package based on Hadoop platform used for genome-wide association

 Myrna the cloud-based pipeline, provides information on the expression level differences

You might also like