Unit I - BDA
Unit I - BDA
UNIT-I
Big data:
What is Big Data?
Big Data is a collection of large datasets that cannot be processed using traditional computing
techniques. For example, the volume of data Facebook or Youtube need require it to collect and manage on
a daily basis, can fall under the category of Big Data. However, Big Data is not only about scale and
volume, it also involves one or more of the following aspects − Velocity, Variety, Volume, and
Complexity.
Variety of Big Data refers to structured, unstructured, and semistructured data that is gathered from multiple
sources. While in the past, data could only be collected from spreadsheets and databases, today data comes
in an array of forms such as emails, PDFs, photos, videos, audios, SM posts, and so much more.
2) Velocity
Velocity essentially refers to the speed at which data is being created in real-time. In a broader prospect, it
comprises the rate of change, linking of incoming data sets at varying speeds, and activity bursts.
3) Volume
We already know that Big Data indicates huge ‘volumes’ of data that is being generated on a daily basis
from various sources like social media platforms, business processes, machines, networks, human
interactions, etc. Such a large amount of data are stored in data warehouses.
Structured
By structured data, we mean data that can be processed, stored, and retrieved in a fixed format. It
refers to highly organized information that can be readily and seamlessly stored and accessed from a
database by simple search engine algorithms. For instance, the employee table in a company database
will be structured as the employee details, their job positions, their salaries, etc., will be present in an
organized manner.
Unstructured
Unstructured data refers to the data that lacks any specific form or structure whatsoever. This makes
it very difficult and time-consuming to process and analyze unstructured data. Email is an example of
unstructured data.
Semi-structured
Semi-structured data pertains to the data containing both the formats mentioned above, that is,
structured and unstructured data. To be precise, it refers to the data that although has not been classified
under a particular repository (database), yet contains vital information or tags that segregate individual
elements within the data.
The process of extracting and analyzing data amongst extensive big data sources is a complex
process and can be frustrating and time-consuming. These complications can be resolved if organizations
encompass all the necessary considerations of big data, take into account relevant data sources, and deploy
them in a manner which is well tuned to their organizational goals.
Before the modern day ubiquity of online and mobile applications, databases processed
straightforward, structured data. Data models were relatively simple and described a set of relationships
between different data types in the database.
Unstructured data, in contrast, refers to data that doesn’t fit neatly into the traditional row and
column structure of relational databases. Examples of unstructured data include: emails, videos, audio files,
web pages, and social media messages. In today’s world of Big Data, most of the data that is created is
unstructured with some estimates of it being more than 95% of all data generated.
As a result, enterprises are looking to this new generation of databases, known as NoSQL, to address
unstructured data. MongoDB stands as a leader in this movement with over 10 million downloads and
hundreds of thousands of deployments. As a document database with flexible schema, MongoDB was built
specifically to handle unstructured data. MongoDB’s flexible data model allows for development without a
predefined schema which resonates particularly when most of the data in your system is unstructured.
With the 1880 census, just counting people was not enough information for the U.S. government to
work with—particular elements, such as age, sex, occupation, education level, and even the “number of
insane people in household,” had to be accounted for. That information had intrinsic value to the process,
but only if it could be tallied, tabulated, analyzed, and presented. New methods of relating the data to other
data collected came into being, such as associating occupations with geographic areas, birth rates with
education levels, and countries of origin with skill sets.
The 1880 census truly yielded a mountain of data to deal with, yet only severely limited technology
was available to do any of the analytics. The problem of Big Data could not be solved for the 1880 census,
so it took over seven years to manually tabulate and report on the data.
Data growing at such a quick rate is making it a challenge to find insights from it. There is more and more
data generated every second from which the data that is actually relevant and useful has to be picked up for
further analysis.
Storage
Such large amount of data is difficult to store and manage by organizations without appropriate tools and
technologies.
Syncing Across Data Sources
This implies that when organisations import data from different sources the data from one source might not
be up to date as compared to the data from another source.
Security
Huge amount of data in organisations can easily become a target for advanced persistent threats, so here lies
another challenge for organisations to keep their data secure by proper authentication, data encryption, etc.
Unreliable Data
We can’t deny the fact that big data can’t be 100 percent accurate. It might contain redundant or incomplete
data, along with contradictions.
Miscellaneous Challenges
These are some other challenges that come forward while dealing with big data, like the integration of
data, skill and talent availability, solution expenses and processing a large amount of data in time and
with accuracy so that the data is available for data consumers whenever they need it.
BasisOf
Small Data Big Data
Comparison
Data that is ‘small’ enough for human Data sets that are so large or complex that
Definition comprehension.In a volume and format that makes it traditional data processing applications
accessible, informative and actionable cannot deal with them
● Data from traditional enterprise systems like ● Purchase data from point-of-sale
○ Enterprise resource planning ● Clickstream data from websites
Data Source ○ Customer relationship management(CRM) ● GPS stream data – Mobility data sent to
● Financial Data like general ledger data a server
● Payment transaction data from website ● Social media – Facebook, Twitter
Velocity ● Controlled and steady data flow ● Data can arrive at very fast speeds.
● Data accumulation is slow ● Enormous data can accumulate within
very short periods of time
Historical data equally valid as data represent solid In some cases, data gets older soon(Eg fraud
Time Variance
business interactions detection).
Investments made in an EPM (Enterprise Process Management) implementation can be very expensive, so it
is imperative that every asset is leveraged. Youngsoft’s team of experienced professionals provide a wide
range of services across the suite of EPM products, consistently delivering solutions that reduce costs,
increase profits, and improve overall efficiency.
Benefits:
Data Science is the combination of statistics, mathematics, programming, problem-solving, capturing data in
ingenious ways, the ability to look at things differently, and the activity of cleansing, preparing, and aligning
the data.
In simple terms, it is the umbrella of techniques used when trying to extract insights and information from
data.
Applications of Data Science
Internet Search
Search engines make use of data science algorithms to deliver the best results for search queries in a
fraction of seconds.
Digital Advertisements
The entire digital marketing spectrum uses the data science algorithms - from display banners to
digital billboards. This is the mean reason for digital ads getting higher CTR than traditional
advertisements.
Recommender Systems
The recommender systems not only make it easy to find relevant products from billions of products
available but also adds a lot to user-experience. A lot of companies use this system to promote their
products and suggestions in accordance with the user’s demands and relevance of information. The
recommendations are based on the user’s previous search results.
Big data analytics helps organizations harness their data and use it to identify new opportunities. That, in
turn, leads to smarter business moves, more efficient operations, higher profits and happier customers. In his
report Big Data in Big Companies, IIA Director of Research Tom Davenport interviewed more than 50
businesses to understand how they used big data. He found they got value in the following ways:
1. Cost reduction. Big data technologies such as Hadoop and cloud-based analytics bring significant
cost advantages when it comes to storing large amounts of data – plus they can identify more
efficient ways of doing business.
2. Faster, better decision making. With the speed of Hadoop and in-memory analytics, combined
with the ability to analyze new sources of data, businesses are able to analyze information
immediately – and make decisions based on what they’ve learned.
3. New products and services. With the ability to gauge customer needs and satisfaction through
analytics comes the power to give customers what they want. Davenport points out that with big data
analytics, more companies are creating new products to meet customers’ needs.
Types of data analytics
There are 4 different types of analytics.
Descriptive analytics
Descriptive analytics answers the question of what happened. Let us bring an example from ScienceSoft’s
practice: having analyzed monthly revenue and income per product group, and the total quantity of metal
parts produced per month, a manufacturer was able to answer a series of ‘what happened’ questions and
decide on focus product categories.
Descriptive analytics juggles raw data from multiple data sources to give valuable insights into the past.
However, these findings simply signal that something is wrong or right, without explaining why. For this
reason, our data consultants don’t recommend highly data-driven companies to settle for descriptive
analytics only, they’d rather combine it with other types of data analytics.
Diagnostic analytics
At this stage, historical data can be measured against other data to answer the question of why something
happened. For example, you can check ScienceSoft’s BI demo to see how a retailer can drill the sales and
gross profit down to categories to find out why they missed their net profit target. Another flashback to our
data analytics projects: in the healthcare industry, customer segmentation coupled with several filters
applied (like diagnoses and prescribed medications) allowed identifying the influence of medications.
Diagnostic analytics gives in-depth insights into a particular problem. At the same time, a company should
have detailed information at their disposal, otherwise, data collection may turn out to be individual for every
issue and time-consuming.
Predictive analytics
Predictive analytics tells what is likely to happen. It uses the findings of descriptive and diagnostic analytics
to detect clusters and exceptions, and to predict future trends, which makes it a valuable tool for forecasting.
Check ScienceSoft’s case study to get details on how advanced data analytics allowed a leading FMCG
company to predict what they could expect after changing brand positioning.
Predictive analytics belongs to advanced analytics types and brings many advantages like sophisticated
analysis based on machine or deep learning and proactive approach that predictions enable. However, our
data consultants state it clearly: forecasting is just an estimate, the accuracy of which highly depends on data
quality and stability of the situation, so it requires careful treatment and continuous optimization.
Prescriptive analytics
The purpose of prescriptive analytics is to literally prescribe what action to take to eliminate a future
problem or take full advantage of a promising trend. An example of prescriptive analytics from our project
portfolio: a multinational company was able to identify opportunities for repeat purchases based on
customer analytics and sales history.
Prescriptive analytics uses advanced tools and technologies, like machine learning, business rules and
algorithms, which makes it sophisticated to implement and manage. Besides, this state-of-the-art type of
data analytics requires not only historical internal data but also external information due to the nature of
algorithms it’s based on. That is why, before deciding to adopt prescriptive analytics, ScienceSoft strongly
recommends weighing the required efforts against an expected added value.
As data sets are becoming bigger and more diverse, there is a big challenge to incorporate them into an
analytical platform. If this is overlooked, it will create gaps and lead to wrong messages and insights.
The analysis of data is important to make this voluminous amount of data being produced in every minute,
useful. With the exponential rise of data, a huge demand for big data scientists and Big Data analysts has
been created in the market. It is important for business organizations to hire a data scientist having skills that
are varied as the job of a data scientist is multidisciplinary. Another major challenge faced by businesses is
the shortage of professionals who understand Big Data analysis. There is a sharp shortage of data scientists
in comparison to the massive amount of data being produced.
It is imperative for business organizations to gain important insights from Big Data analytics, and also it is
important that only the relevant department has access to this information. A big challenge faced by the
companies in the Big Data analytics is mending this wide gap in an effective manner.
It is hardly surprising that data is growing with every passing day. This simply indicates that business
organizations need to handle a large amount of data on daily basis. The amount and variety of data available
these days can overwhelm any data engineer and that is why it is considered vital to make data accessibility
easy and convenient for brand owners and managers.
With the rise of Big Data, new technologies and companies are being developed every day. However, a big
challenge faced by the companies in the Big Data analytics is to find out which technology will be best
suited to them without the introduction of new problems and potential risks.
Business organizations are growing at a rapid pace. With the tremendous growth of the companies and large
business organizations, increases the amount of data produced. The storage of this massive amount of data is
becoming a real challenge for everyone. Popular data storage options like data lakes/ warehouses are
commonly used to gather and store large quantities of unstructured and structured data in its native format.
The real problem arises when a data lakes/ warehouse try to combine unstructured and inconsistent data
from diverse sources, it encounters errors. Missing data, inconsistent data, logic conflicts, and duplicates
data all result in data quality challenges.
Once business enterprises discover how to use Big Data, it brings them a wide range of possibilities and
opportunities. However, it also involves the potential risks associated with big data when it comes to the
privacy and the security of the data. The Big Data tools used for analysis and storage utilizes the data
disparate sources. This eventually leads to a high risk of exposure of the data, making it vulnerable. Thus,
the rise of voluminous amount of data increases privacy and security concerns.
The Importance of Big Data Analytics
Driven by specialized analytics systems and software, as well as high-powered computing systems, big data
analytics offers various business benefits, including:
Big data analytics applications enable big data analysts, data scientists, predictive modelers, statisticians and
other analytics professionals to analyze growing volumes of structured transaction data, plus other forms of
data that are often left untapped by conventional BI and analytics programs. This encompasses a mix of
semi-structured and unstructured data -- for example, internet clickstream data, web server logs, social
media content, text from customer emails and survey responses, mobile phone records, and machine data
captured by sensors connected to the internet of things (IoT).
we will discuss the terminology related to Big Data ecosystem. This will give you a complete understanding
of Big Data and its terms.
Over time, Hadoop has become the nucleus of the Big Data ecosystem, where many new technologies have
emerged and have got integrated with Hadoop. So it’s important that, first, we understand and appreciate the
nucleus of modern Big Data architecture.
The Apache Hadoop software library is a framework that allows for the distributed processing of large data
sets across clusters of computers, using simple programming models. It is designed to scale up from single
servers to thousands of machines, each offering local computation and storage.