0% found this document useful (0 votes)
87 views12 pages

Unit 1 - DSA

Big data refers to huge amounts of data that are difficult to process manually. Data science uses big data to build predictive models through data analysis and machine learning. It involves collecting, cleaning, analyzing and applying data to solve problems. The data analytics lifecycle includes 6 phases - discovery, data preparation, model planning, model building, communicating results, and operationalizing models. Each phase involves specific tasks like defining objectives, loading and exploring data, selecting algorithms, developing and testing models, and implementing results. The overall process aims to gain insights from data to make informed decisions.

Uploaded by

Roshanaa R
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
87 views12 pages

Unit 1 - DSA

Big data refers to huge amounts of data that are difficult to process manually. Data science uses big data to build predictive models through data analysis and machine learning. It involves collecting, cleaning, analyzing and applying data to solve problems. The data analytics lifecycle includes 6 phases - discovery, data preparation, model planning, model building, communicating results, and operationalizing models. Each phase involves specific tasks like defining objectives, loading and exploring data, selecting algorithms, developing and testing models, and implementing results. The overall process aims to gain insights from data to make informed decisions.

Uploaded by

Roshanaa R
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 12

BIG DATA VS DATASCIENCE:

Big Data: It is huge, large, or voluminous data, information, or the relevant statistics acquired
by large organizations and ventures. Many software and data storages is created and prepared
as it is difficult to compute the big data manually. It is used to discover patterns and trends and
make decisions related to human behavior and interaction technology. 

Data Science: Data Science is a field or domain which includes and involves working with a
huge amount of data and using it for building predictive, prescriptive, and prescriptive
analytical models. It’s about digging, capturing, (building the model) analyzing (validating the
model), and utilizing the data(deploying the best model). It is an intersection of Data and
computing. It is a blend of the field of Computer Science, Business, and Statistics together. 

Below is a table of differences between Big Data and Data Science:

Data Science Big Data

Big Data is a technique to collect, maintain and


Data Science is an area.
process huge information.

It is about the collection, processing, It is about extracting vital and valuable


analyzing, and utilizing of data in various information from a huge amount of data.

It is a field of study just like Computer It is a technique for tracking and discovering
Science, Applied Statistics, or Applied trends in complex data sets.

The goal is to build data-dominant products The goal is to make data more vital and usable
for a venture. i.e. by extracting only important information

Tools mainly used in Data Science include Tools mostly used in Big Data include Hadoop,
SAS, R, Python, etc Spark, Flink, etc.

It is a superset of Big Data as data science It is a sub-set of Data Science as mining


consists of Data scrapping, cleaning, activities which is in a pipeline of Data science.

It is mainly used for business purposes and


It is mainly used for scientific purposes.
customer satisfaction.

It broadly focuses on the science of the It is more involved with the processes of
data. handling voluminous data.
DATA ANALYTICS LIFE CYCLE OVERVIEW:

Phase 1- Discovery: In Phase 1, the team learns the business domain, including relevant history
Such as whether the organization or business unit has attempted similar projects in the past from
Which they can learn. The team assesses the resources available to support the project in terms
of people, technology, time, and data. Important activities in this phase include framing the
business problem as an analytics challenge that can be addressed in subsequent phases and
formulating initial hypotheses (IHs) to test and begin learning the data.

Phase 2- Data preparation: Phase 2 requires the presence of an analytic sandbox, in which the
Team can work with data and perform analytics for the duration of the project. The team needs to
Execute extract, load, and transform (ELT) or extract, transform and load (ETL) to get data into
the sandbox. The ELT and ETL are sometimes abbreviated as ETLT. Data should be transformed
in the ETLT process so the team can work with it and analyze it. In this phase, the team also
needs to familiarize itself with the data thoroughly and take steps to condition the data

Phase 3-Model planning: Phase 3 is model planning, where the team determines the methods,
techniques, and workflow it intends to follow for the subsequent model building phase. The team
explores the data to learn about the relationships between variables and subsequently selects key
variables and the most suitable models.

Phase 4-Model building: In Phase 4, the team develops data sets for testing, training, and
production purposes. In addition, in this phase the team builds and executes models based on the
work done in the model planning phase. The team also considers whether its existing tools will
suffice for running the models, or if it will need a more robust environment for executing models
and work flows (for example, fast hardware and parallel processing, if applicable).

Phase 5-Communicate results: In Phase 5, the team, in collaboration with major stakeholders,
Determines if the results of the project are a success or a failure based on the criteria developed
in Phase 1. The team should identify key findings, quantify the business value, and develop a
narrative to summarize and convey findings to stakeholders.

Phase 6-0perationalize: In Phase 6, the team delivers final reports, briefings, code, and
technical documents. In addition, the team may run a pilot project to implement the models in a
production environment.
Fig: Overview of Data Analytics Lifecycle

PHASES OF DATA ANALYTICS LIFE CYCLE

Data Discovery
This is the initial phase to set your project's objectives and find ways to achieve a
complete data analytics lifecycle. Start with defining your business domain and ensure
you have enough resources (time, technology, data, and people) to achieve your goals.
The biggest challenge in this phase is to accumulate enough information. You need to
draft an analytic plan, which requires some serious leg work.
 Accumulate resources
First, you have to analyze the models you have intended to develop. Then determine how
much domain knowledge you need to acquire for fulfilling those models.
The next important thing to do is assess whether you have enough skills and resources to
bring your projects to fruition.
 Frame the issue
Problems are most likely to occur while meeting your client's expectations. Therefore,
you need to identify the issues related to the project and explain them to your clients.
This process is called "framing." You have to prepare a problem statement explaining the
current situation and challenges that can occur in the future. You also need to define the
project's objective, including the success and failure criteria for the project.
 Formulate initial hypothesis
Once you gather all the clients' requirements, you have to develop initial hypotheses after
exploring the initial data.
Data Preparation and Processing
The Data preparation and processing phase involves collecting, processing, and
conditioning data before moving to the model building process.
 Identify data sources
You have to identify various data sources and analyze how much and what kind of data
you can accumulate within a given timeframe. Evaluate the data structures, explore their
attributes and acquire all the tools needed.
Collection of data

You can collect data using three methods:


 Data acquisition: You can collect data through external sources.
 Data Entry: You can prepare data points through digital systems or manual entry as
well.
 Signal reception: You can accumulate data from digital devices such as IoT devices and
control systems.

Model Planning
This is a phase where you have to analyze the quality of data and find a suitable model
for your project.
 Loading Data in Analytics Sandbox
An analytics sandbox is a part of data lake architecture that allows you to store and process
large amounts of data. It can efficiently process a large range of data such as big data,
transactional data, social media data, web data, and many more. It is an environment that
allows your analysts to schedule and process data assets using the data tools of their choice.
The best part of the analytics sandbox is its agility. It empowers analysts to process data in
real-time and get essential information within a short duration.
Data are loaded in the sandbox in three ways:
 ETL − Team specialists make the data comply with the business rules before loading it
in the sandbox.
 ELT − The data is loaded in the sandbox and then transform as per business rules.
 ETLT − It comprises two levels of data transformation, including ETL and ELT both.
 The data you have collected may contain unnecessary features or null values. It may
come in a form too complex to anticipate. This is where data exploration' can help you
uncover the hidden trends in data.
Steps involved in data exploration:

 Data identification
 Univariate Analysis
 Multivariate Analysis
 Filling Null values
 Feature engineering
For model planning, data analysts often use regression techniques, decision trees, neural
networks, etc. Tools mostly used for model planning and execution include Rand PL/R,
WEKA, Octave, Statista, and MATLAB.
Model Building
Model building is the process where you have to deploy the planned model in a real-time
environment. It allows analysts to solidify their decision-making process by gain in-depth
analytical information. This is a repetitive process, as you have to add new features as
required by your customers constantly.
Your aim here is to forecast business decisions and customize market strategies and develop
tailor-made customer interests. This can be done by integrating the model into your existing
production domain.
In some cases, a specific model perfectly aligns with the business objectives/ data, and
sometimes it requires more than one try. As you start exploring the data, you need to run
particular algorithms and compare the outputs with your objectives. In some cases, you may
even have to run different variances of models simultaneously until you receive the desired
results.
Result Communication and Publication

This is the phase where you have to communicate the data analysis with your clients. It
requires several intricate processes where you how to present information to clients in a
lucid manner. Your clients don't have enough time to determine which data is essential.
Therefore, you must do an impeccable job to grab the attention of your clients.
Check the data accuracy
Is the data provide information as expected? If not, then you have to run some other
processes to resolve this issue. You need to ensure the data you process provides
consistent information. This will help you build a convincing argument while
summarizing your findings.
Highlight important findings
Well, each data holds a significant role in building an efficient project. However, some
data inherits more potent information that can truly serve your audience's benefits. While
summarizing your findings, try to categorize data into different key points.
Determine the most appropriate communication format
How you communicate your findings tells a lot about you as a professional. We
recommend you to go for visuals presentation and animations as it helps you to convey
information much faster. However, sometimes you also need to go old-school as well.
For instance, your clients may have to carry the findings in physical format. They may
also have to pick up certain information and share them with others.
Operationalize

As soon you prepare a detailed report including your key findings, documents, and
briefings, your data analytics life cycle almost comes close to the end. The next step
remains the measure the effectiveness of your analysis before submitting the final reports
to your stakeholders.
In this process, you have to move the sandbox data and run it in a live environment. Then
you have to closely monitor the results, ensuring they match with your expected goals. If
the findings fit perfectly with your objective, then you can finalize the report. Otherwise,
you have to take a step back in your data analytics lifecycle and make some changes.

Case Study: Global Innovation Network and Analysis (GINA)

EMC's Global Innovation Network and Analytics (GINA) team is a group of senior technologists
located in centers of excellence (COEs) around the world. This team's charter is to engage
employees across global COEs to drive innovation, research, and university partnerships. In
2012, a newly hired director wanted to improve these activities and provide a mechanism to track
and analyze the related information. In addition, this team wanted to create more robust
mechanisms for capturing the results of its informal conversations with other thought leaders
within EMC, in academia, or in other organizations, which could later be mined for insights.

The GINA team thought its approach would provide a means to share ideas globally and increase
knowledge sharing among GINA members who may be separated geographically. It planned to
create a data repository containing both structured and unstructured data to accomplish three
main goals.
 Store formal and informal data.
 Track research from global technologists.
 Mine the data for patterns and insights to improve the team's operations and
strategy.

The GINA case study provides an example of how a team applied the Data Analytics Ufecycle to
analyze innovation data at EMC.Innovation is typically a difficult concept to measure, and this
team wanted to look for ways to use advanced analytical methods to identify key innovators
within the company.

Phase 1: Discovery
In the GINA project's discovery phase, the team began identifying data sources. Although GINA
was a group of technologists skilled in many different aspects of engineering, it had some data
and ideas about what it wanted to explore but lacked a formal team that could perform these
analytics. After consulting with various experts including Tom Davenport, a noted expert in
analytics at Babson College, and Peter Gloor, an expert in collective intelligence and creator of
CoiN (Collaborative Innovation Networks) at MIT, the team decided to crowd source the work
by seeking volunteers within EMC. Here is a list of how the various roles on the working team
were fulfilled.
 Business User, Project Sponsor, Project Manager: Vice President from Office of
the CTO
 Business Intelligence Analyst: Representatives from IT
 Data Engineer and Database Administrator (DBA): Representatives from IT
 Data Scientist: Distinguished Engineer, who also developed the social graphs
shown in the GINA case study

The project sponsor's approach was to leverage social media and blogging to accelerate the
collection of innovation and research data worldwide and to motivate teams of "volunteer" data
scientists at worldwide locations. Given that he lacked a formal team, he needed to be
resourceful about finding people who were both capable and willing to volunteer their time to
work on interesting problems. Data scientists tend to be passionate about data, and the project
sponsor was able to tap into this passion of highly talented people to accomplish challenging
work in a creative way.

The data for the project fell into two main categories. The first category represented five years
of idea submissions from EMC's internal innovation contests, known as the Innovation Road
map (formerly called the Innovation Showcase). The Innovation Road map is a formal, organic
innovation process whereby employees from around the globe submit ideas that are then vetted
and judged. The best ideas are selected for further incubation. As a result, the data is a mix of
structured data, such as idea counts, submission dates, inventor names, and unstructured content,
such as the textual descriptions of the ideas themselves.

The second category of data encompassed minutes and notes representing innovation and
research activity from around the world. This also represented a mix of structured and
unstructured data. The structured data included attributes such as dates, names, and geographic
locations. The unstructured documents contained the "who, what, when, and where" information
that represents rich data about knowledge growth and transfer within the company. This type of
information is often stored in business silos that have little to no visibility across disparate
research teams.

The 10 main IHs that the GINA team developed were as follows:
 IH1: Innovation activity in different geographic regions can be mapped to
corporate strategic directions.
 IH2: The length oftime it takes to deliver ideas decreases when global knowledge
transfer occurs as part of the idea delivery process.
 IH3: Innovators who participate in global knowledge transfer deliver ideas more
quickly than those who do not.
 IH4: An idea submission can be analyzed and evaluated for the likelihood of
receiving funding. o IHS: Knowledge discovery and growth for a particular topic
can be measured and compared across geographic regions.
 IH6: Knowledge transfer activity can identify research-specific boundary
spanners in disparate regions. o IH7: Strategic corporate themes can be mapped to
geographic regions.
 IHS: Frequent knowledge expansion and transfer events reduce the time it takes
to generate a corporate asset from an idea.
 IH9: Lineage maps can reveal when knowledge expansion and transfer did not (or
has not) resulted in a corporate asset.
 IH10: Emerging research topics can be classified and mapped to specific ideators,
innovators, boundary spanners, and assets. The GINA (IHs) can be grouped into
two categories:
 Descriptive analytics of what is currently happening to spark further creativity,
collaboration, and asset generation
 Predictive analytics to advise executive management of where it should be
investing in the future.

Phase 2: Data Preparation The team partnered with its IT department to set up a new analytics
sandbox to store and experiment on the data. During the data exploration exercise, the data
scientists and data engineers began to notice that certain data needed conditioning and
normalization. In addition, the team realized that several missing data sets were critical to testing
some of the analytic hypotheses.

As the team explored the data, it quickly realized that if it did not have data of sufficient quality
or could not get good quality data, it would not be able to perform the subsequent steps in the
lifecycle process. As a result, it was important to determine what level of data quality and
cleanliness was sufficient for the project being undertaken. In the case of the GINA, the team
discovered that many of the names of the researchers and people interacting with the universities
were misspelled or had leading and trailing spaces in the data store. Seemingly small problems
such as these in the data had to be addressed in this phase to enable better analysis and data
aggregation in subsequent phases.

Phase 3: Model Planning In the GINA project, for much of the dataset, it seemed feasible to use
social network analysis techniques to look at the networks of innovators within EMC. In other
cases, it was difficult to come up with appropriate ways to test hypotheses due to the lack of data.
In one case (IH9), the team made a decision to initiate a longitudinal study to begin tracking data
points over time regarding people developing new intellectual property. This data collection
would enable the team to test the following two ideas in the future:
 IHS: Frequent knowledge expansion and transfer events reduce the amount of
time it takes to generate a corporate asset from an idea.
 IH9: Lineage maps can reveal when knowledge expansion and transfer did not (or
has not} result(ed) in a corporate asset.
For the longitudinal study being proposed, the team needed to establish goal criteria for the
study. Specifically, it needed to determine the end goal of a successful idea that had traversed the
entire journey. The parameters related to the scope of the study included the following
considerations:

 Identify the right milestones to achieve this goal.


 Trace how people move ideas from each milestone toward the goal.
 Once this is done, trace ideas that die, and trace others that reach the goal.
Compare the journeys of ideas that make it and those that do not.
 Compare the times and the outcomes using a few different methods (depending on
how the data is collected and assembled). These could be as simple as t-tests or
perhaps involve different types of classification algorithms

Phase 4: Model Building In Phase 4, the GINA team employed several analytical methods. This
included work by the data scientist using Natural Language Processing (NLP} techniques on the
textual descriptions of the Innovation Road map ideas. In addition, he conducted social network
analysis using Rand RStudio, and then he developed social graphs and visualizations of the
network of communications related to innovation using R's ggplot2 package. Examples of this
work are shown in Figures 2-10 and 2-11.

Social graph [27] visualization of idea submitters and finalist


.
Social graph visualization of top innovation influencers

FIGURE 2-10 Social graph [27] visualization of idea submitters and finalists

Each color represents an innovator from a different country. The large dots with red circles
around them represent hubs. A hub represents a person with high connectivity and a high
"betweenness" score. The cluster in Figure 2-11 contains geographic variety, which is critical to
prove the hypothesis about geographic boundary spanners. One person in this graph has an
unusually high score when compared to the rest of the nodes in the graph. The data scientist
identified this person and ran a query against his name within the analytic sandbox. These
actions yielded the following information about this research scientist (from the social graph),
which illustrated how influential he was within his business unit and across many other areas of
the company worldwide:
 In 2011, he attended the ACM SIGMOD conference, which is a top-tier
conference on large-scale data management problems and databases.
 He visited employees in France who are part of the business unit for EMC's
content management teams within Documentation (now part of the Information
Intelligence Group, or II G).
 He presented his thoughts on the SIGMOD conference at a virtual brown bag
session attended by three employees in Russia, one employee in Cairo, one
employee in Ireland, one employee in India, three employees in the United States,
and one employee in Israel.
 In 2012, he attended the SDM 2012 conference in California.
 On the same trip he visited innovators and researchers at EMC federated
companies, Pivotal and VMware.
 Later on that trip he stood before an internal council of technology leaders and
introduced two of his researchers to dozens of corporate innovators and
researchers. This finding suggests that at least part of the initial hypothesis is
correct; the data can identify innovators who span different geographies and
business units. The team used Tableau software for data visualization and
exploration and used the Pivotal Green plum database as the main data repository
and analytics engine.

Phase 5: Communicate Results

In Phase 5, the team found several ways to cull results of the analysis and identify the most
impactful and relevant findings. This project was considered successful in identifying boundary
spanners and hidden innovators. As a result, the CTO office launched longitudinal studies to
begin data collection efforts and track innovation results over longer periods of time. The GINA
project promoted knowledge sharing related to innovation and researchers spanning multiple
areas within the company and outside of it. GINA also enabled EMC to cultivate additional
intellectual property that led to additional research topics and provided opportunities to forge
relationships with universities for joint academic research in the fields of Data Science and Big
Data. In addition, the project was accomplished with a limited budget, leveraging a volunteer
force of highly skilled and distinguished engineers and data scientists. One of the key findings
from the project is that there was a disproportionately high density of innovators in Cork,
Ireland. Each year, EMC hosts an innovation contest, open to employees to submit innovation
ideas that would drive new value for the company. When looking at the data in 2011, 15% of the
finalists and 15% of the winners were from Ireland. These are unusually high numbers, given the
relative size of the Cork COE compared to other larger centers in other parts of the world. After
further research, it was learned that the COE in Cork, Ireland had received focused training in
innovation from an external consultant, which 2.8 Case Study: Global Innovation Network and
Analysis (GINA) was proving effective. The Cork COE came up with more innovation ideas,
and better ones, than it had in the past, and it was making larger contributions to innovation at
EMC. It would have been difficult, if not impossible, to identify this cluster of innovators
through traditional methods or even anecdotal, word-ofmouth feedback. Applying social network
analysis enabled the team to find a pocket of people within EMC who were making
disproportionately strong contributions. These findings were shared internally through
presentations and conferences and promoted through social media and blogs.

Phase 6: Operationalize

Running analytics against a sandbox fi lled with notes, minutes, and presentations from
innovation activities yielded great insights into EMC's innovation culture. Key findings from the
project include these:
• The CTO office and GINA need more data in the future, including a marketing initiative to
convince people to inform the global community on their innovation/research activities.
• Some of the data is sensitive, and the team needs to consider security and privacy related to the
data, such as who can run the models and see the results.
• In addition to running models, a parallel initiative needs to be created to improve basic
Business Intelligence activities, such as dashboards, reporting, and queries on research activities
worldwide.
• A mechanism is needed to continually reevaluate the model after deployment. Assessing the
benefits is one of the main goals of this stage, as is defining a process to retrain the model as
needed.

You might also like