0% found this document useful (0 votes)

66 views25 pages

Data Analytics - Unit - 1

Data Analytics unit 1 MCA 4th sem

Uploaded by

lotim40054

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

66 views25 pages

Data Analytics - Unit - 1

Data Analytics unit 1 MCA 4th sem

Uploaded by

lotim40054

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 25

Different Sources of Data for Data

Analysis
Data collection is the process of acquiring, collecting, extracting, and storing the
voluminous amount of data which may be in the structured or unstructured form
like text, video, audio, XML files, records, or other image files used in later
stages of data analysis.

In the process of big data analysis, “Data collection” is the initial step before
starting to analyze the patterns or useful information in data. The data which is
to be analyzed must be collected from different valid sources.

The data which is collected is known as raw data which is not useful now but on
cleaning the impure and utilizing that data for further analysis forms information,
the information obtained is known as “knowledge”. Knowledge has many
meanings like business knowledge or sales of enterprise products, disease
treatment, etc. The main goal of data collection is to collect information-rich
data.
The actual data is then further divided mainly into two types known as:
1.Primary data: The data which is Raw, original, and extracted directly from
the official sources is known as primary data. This type of data is collected
directly by performing techniques such as questionnaires, interviews, and
surveys. The data collected must be according to the demand and requirements
of the target audience on which analysis is performed otherwise it would be a
burden in the data processing.Few methods of collecting primary data:

1. Interview method:
The data collected during this process is through interviewing the target
audience by a person called interviewer and the person who answers the
interview is known as the interviewee. Some basic business or product related
questions are asked and noted down in the form of notes, audio, or video and
this data is stored for processing. These can be both structured and
unstructured like personal interviews or formal interviews through telephone,
face to face, email, etc.
2. Survey method:
The survey method is the process of research where a list of relevant questions
are asked and answers are noted down in the form of text, audio, or video. The
survey method can be obtained in both online and offline mode like through
website forms and email. Then that survey answers are stored for analyzing
data. Examples are online surveys or surveys through social media polls.
3. Observation method:
The observation method is a method of data collection in which the researcher
keenly observes the behavior and practices of the target audience using some
data collecting tool and stores the observed data in the form of text, audio,
video, or any raw formats. In this method, the data is collected directly by
posting a few questions on the participants. For example, observing a group of
customers and their behavior towards the products. The data obtained will be
sent for processing.
4. Experimental method:
The experimental method is the process of collecting data through performing
experiments, research, and investigation. The most frequently used experiment
methods are CRD, RBD, LSD, FD.
 CRD- Completely Randomized design is a simple experimental design
used in data analytics which is based on randomization and replication. It is
mostly used for comparing the experiments.
 RBD- Randomized Block Design is an experimental design in which the
experiment is divided into small units called blocks. Random experiments
are performed on each of the blocks and results are drawn using a
technique known as analysis of variance (ANOVA). RBD was originated from
the agriculture sector.
 LSD – Latin Square Design is an experimental design that is similar to
CRD and RBD blocks but contains rows and columns. It is an arrangement
of NxN squares with an equal amount of rows and columns which contain
letters that occurs only once in a row. Hence the differences can be easily
found with fewer errors in the experiment. Sudoku puzzle is an example of a
Latin square design.
 FD- Factorial design is an experimental design where each experiment has
two factors each with possible values and on performing trail other
combinational factors are derived.

2. Secondary data: Secondary data is the data which has already been
collected and reused again for some valid purpose. This type of data is
previously recorded from primary data and it has two types of sources named
internal source and external source.

Internal source:
These types of data can easily be found within the organization such as market
record, a sales record, transactions, customer data, accounting resources, etc.
The cost and time consumption is less in obtaining internal sources.
External source:
The data which can’t be found at internal organizations and can be gained
through external third party resources is external source data. The cost and
time consumption is more because this contains a huge amount of data.
Examples of external sources are Government publications, news publications,
Registrar General of India, planning commission, international labor bureau,
syndicate services, and other non-governmental publications.
Other sources:
 Sensors data: With the advancement of IoT devices, the sensors of these
devices collect data which can be used for sensor data analytics to track the
performance and usage of products.
 Satellites data: Satellites collect a lot of images and data in terabytes on
daily basis through surveillance cameras which can be used to collect useful
information.
 Web traffic: Due to fast and cheap internet facilities many formats of data
which is uploaded by users on different platforms can be predicted and
collected with their permission for data analysis. The search engines also
provide their data through keywords and queries searched mostly.
Structured and Unstructured Data

Data is the lifeblood of business, and it comes in a huge variety of formats —

everything from strictly formed relational databases to your last post on
Facebook. All of that data, in all different formats, can be sorted into one of two
categories: structured and unstructured data. Structured vs. unstructured data
can be understood by considering the who, what, when, where, and the how of
the data:

1. Who will be using the data?

2. What type of data are you collecting?
3. When does the data need to be prepared, before storage or when used?
4. Where will the data be stored?
5. How will the data be stored?

These five questions highlight the fundamentals of both structured and

unstructured data, and allow general users to understand how the two differ.
They will also help users understand nuances like semi-structured data, and
guide us as we navigate the future of data in the cloud.

What is structured data?

Structured data is data that has been predefined and formatted to a set
structure before being placed in data storage, which is often referred to as
schema-on-write. The best example of structured data is the relational
database: the data has been formatted into precisely defined fields, such as
credit card numbers or address, in order to be easily queried with SQL.

Pros of structured data

There are three key benefits of structured data:

1. Easily used by machine learning algorithms: The largest benefit of

structured data is how easily it can be used by machine learning. The
specific and organized nature of structured data allows for easy
manipulation and querying of that data.
2. Easily used by business users: Another benefit of structured data is that it
can be used by an average business user with an understanding of the
topic to which the data relates. There is no need to have an in-depth
understanding of various different types of data or the relationships of that
data. It opens up self-service data access to the business user.
3. Increased access to more tools: Structured data also has the benefit of
having been in use for far longer, as historically it was the only option. This
means that there are more tools that have been tried and tested in using
and analyzing structured data. Data managers have more product choices
when using structured data.
Cons of structured data

The cons of structured data are centered in a lack of data flexibility. Here are
some potential drawbacks to structured data’s use:

1. A predefined purpose limits use: While on-write-schema data definition is

a large benefit to structured data, it is also true that data with a
predefined structure can only be used for its intended purpose. This limits
its flexibility and use cases.
2. Limited storage options: Structured data is generally stored in data
warehouses. Data warehouses are data storage systems with rigid
schemas. Any change in requirements means updating all of that
structured data to meet the new needs; this results in massive expenditure
of resources and time. Some of the cost can be mitigated by using a cloud-
based data warehouse, as this allows for greater scalability and eliminates
the maintenance expenses generated by having equipment on-premises.
Examples of structured data

Structured data is an old, familiar friend. It’s the basis for inventory control
systems and ATMs. It can be human- or machine-generated.

Common examples of machine-generated structured data are weblog statistics

and point of sale data, such as barcodes and quantity. Plus, anyone who deals
with data knows about spreadsheets: a classic example of human-generated
structured data.

What is unstructured data?

Unstructured data is data stored in its native format and not processed
until it is used, which is known as schema-on-read. It comes in a myriad of file
formats, including email, social media posts, presentations, chats, IoT sensor
data, and satellite imagery.

Pros of unstructured data

As there are pros and cons of structured data, unstructured data also has
strengths and weaknesses for specific business needs. Some of its benefits
include:

1. Freedom of the native format: Because unstructured data is stored in its

native format, the data is not defined until it is needed. This leads to a
larger pool of use cases, because the purpose of the data is adaptable. It
allows for to prepare and analyze only the data needed. The native format
also allows for a wider variety of file formats in the database, because the
data that can be stored is not restricted by a specific format. That means
the company has more data to draw from.
2. Faster accumulation rates: Another benefit of unstructured data is in
data accumulation rates. There is no need to predefine the data, which
means it can be collected quickly and easily.
3. Data lake storage: Unstructured data is often stored in cloud data lakes,
which allow for massive storage. Cloud data lakes also allow for pay-as-
you-use storage pricing, which helps cut costs and allows for easy
scalability.
Cons of unstructured data

There are also cons to using unstructured data. It requires specific expertise and
specialized tools in order to be used to its fullest potential.
1. Requires data science expertise: The largest drawback to unstructured
data is that data science expertise is required to prepare and analyze the
data. A standard business user cannot use unstructured data as it is, due to
its undefined/non-formatted nature. Using unstructured data requires
understanding the topic or area of the data, but also of understanding how
the data can be related to make it useful.
2. Specialized tools: In addition to the required expertise, unstructured data
requires specialized tools to manipulate. Standard are intended for use
with structured data, which leaves a data manager with limited choices in
products for unstructured data, some of which are still in their infancy.
Examples of unstructured data

Unstructured data is qualitative rather than quantitative, which means that it is

more characteristic and categorical in nature.

It lends itself well to determining how effective a marketing campaign is, or to

uncovering potential buying trends through social media and review websites. It
can also be very useful to the enterprise by assisting with monitoring for policy
compliance, as it can be used to detect patterns in chats or suspicious email
trends.

Structured data vs. unstructured data

Structured data vs. unstructured data comes down to data types that can be
used, the level of data expertise required to use it, and on-write versus on-read
schema.

Structured Data Unstructured Data

Who Self-service access Requires data science expertise

Structured Data Unstructured Data

What Only select data types Many varied types conglomerated

When Schema-on-write Schema-on-read

Where Commonly stored in data Commonly stored in data lakes

warehouses

How Predefined format Native format

What is semi-structured data?

Semi-structured data refers to what would normally be considered unstructured

data, but that also has metadata that identifies certain characteristics. The
metadata contains enough information to enable the data to be more efficiently
cataloged, searched, and analyzed than strictly unstructured data. Think of
semi-structured data as the go-between of structured and unstructured data.

A good example of semi-structured data vs. structured data would be a tab

delimited file containing customer data versus a database containing CRM
tables. On the other side of the coin, semi-structured has more hierarchy than
unstructured data; the tab delimited file is more specific than a list of comments
from a customer’s instagram.

What is Big Data

Data which are very large in size is called Big Data. Normally we work on data of size
MB(WordDoc ,Excel) or maximum GB(Movies, Codes) but data in Peta bytes i.e.
10^15 byte size is called Big Data. It is stated that almost 90% of today's data has
been generated in the past 3 years.

Sources of Big Data

These data come from many sources like
o Social networking sites: Facebook, Google, LinkedIn all these sites generates
huge amount of data on a day to day basis as they have billions of users
worldwide.
o E-commerce site: Sites like Amazon, Flipkart, Alibaba generates huge amount
of logs from which users buying trends can be traced.
o Weather Station: All the weather station and satellite gives very huge data
which are stored and manipulated to forecast weather.
o Telecom company: Telecom giants like Airtel, Vodafone study the user trends
and accordingly publish their plans and for this they store the data of its
million users.
o Share Market: Stock exchange across the world generates huge amount of
data through its daily transaction.

Big data platform

A big data platform is an integrated computing solution that combines numerous

software systems, tools, and hardware for big data management. It is a one-stop
architecture that solves all the data needs of a business regardless of the volume and
size of the data at hand. Due to their efficiency in data management, enterprises are
increasingly adopting big data platforms to gather tons of data and convert them into
structured, actionable business insights.

Currently, the marketplace is flooded with numerous Open source and commercially
available big data platforms. They boast different features and capabilities for use in a
big data environment.
5 V’s of Big Data
Volume
The name Big Data itself is related to an enormous size. Big Data is a vast 'volumes'
of data generated from many sources daily, such as business processes, machines,
social media platforms, networks, human interactions, and many more.
Facebook can generate approximately a billion messages, 4.5 billion times that the
"Like" button is recorded, and more than 350 million new posts are uploaded each
day. Big data technologies can handle large amounts of data.

Variety
Big Data can be structured, unstructured, and semi-structured that are being
collected from different sources. Data will only be collected
from databases and sheets in the past, But these days the data will comes in array
forms, that are PDFs, Emails, audios, SM posts, photos, videos, etc.
The data is categorized as below:
a. Structured data: In Structured schema, along with all the required columns. It
is in a tabular form. Structured Data is stored in the relational database
management system.
b. Semi-structured: In Semi-structured, the schema is not appropriately defined,
e.g., JSON, XML, CSV, TSV, and email. OLTP (Online Transaction
Processing) systems are built to work with semi-structured data. It is stored in
relations, i.e., tables.
c. Unstructured Data: All the unstructured files, log files, audio files,
and image files are included in the unstructured data. Some organizations
have much data available, but they did not know how to derive the value of
data since the data is raw.
d. Quasi-structured Data:The data format contains textual data with
inconsistent data formats that are formatted with effort and time with some
tools.
Veracity
Veracity means how much the data is reliable. It has many ways to filter or translate
the data. Veracity is the process of being able to handle and manage data efficiently.
Big Data is also essential in business development.
For example, Facebook posts with hashtags.

Value
Value is an essential characteristic of big data. It is not the data that we process or
store. It is valuable and reliable data that we store, process, and also analyze.

Velocity
Velocity plays an important role compared to others. Velocity creates the speed by
which the data is created in real-time. It contains the linking of incoming data sets
speeds, rate of change, and activity bursts. The primary aspect of Big Data is to
provide demanding data rapidly.

Big data velocity deals with the speed at the data flows from sources
like application logs, business processes, networks, and social media sites,
sensors, mobile devices, etc.
Big Data Platforms vs. Data Lake vs. Data Warehouse

Big Data, at its core, refers to technologies that handle large volumes of data too
complex to be processed by traditional databases. However, it is a very broad term,
functioning as an umbrella term for more specific solutions such as Data Lake and Data
Warehouse.

What is a Data Lake?

Data Lake is a scalable storage repository that not only holds large volumes of raw data
in its native format but also enables organizations to prepare them for further usage.

That means data coming to Data Lake doesn’t have to be collected with a specific
purpose from the beginning, it can be defined later. Without it, data can be loaded faster
since they do not need to undergo an initial transformation process.

In Data Lakes, data is gathered in its native formats, which provides more opportunities
for exploration, analysis, and further operations, as all data requirements can be tailored
on a case-by-case basis, then – once the schema has been developed – it can be kept
for future use or discarded.

What is a Data Warehouse?

Compared to Data Lakes, it can be said that Data Warehouses represent a more
traditional and restrictive approach.

Data Warehouse is a scalable storage data repository holding large volumes of raw
data, but its environment is far more structured than in Data Lake. Data collected in
Data Warehouse are already pre-processed, which means it is not in their native
formats. Data requirements must be known and set up front to make sure the models
and schemas produce usable data for all users.
Key differences between Data Lake and Data
Warehouse
Analytic processes and tools
Data Analytics Process
The process of data analysis, or alternately, data analysis steps, involves gathering all the
information, processing it, exploring the data, and using it to find patterns and other insights.
The process of data analysis consists of:

 Data Requirement Gathering: Ask yourself why you’re doing this analysis, what type
of data you want to use, and what data you plan to analyze.

 Data Collection: Guided by your identified requirements, it’s time to collect the data
from your sources. Sources include case studies, surveys, interviews, questionnaires,
direct observation, and focus groups. Make sure to organize the collected data for
analysis.

 Data Cleaning: Not all of the data you collect will be useful, so it’s time to clean it up.
This process is where you remove white spaces, duplicate records, and basic errors. Data
cleaning is mandatory before sending the information on for analysis.

 Data Analysis: Here is where you use data analysis software and other tools to help you
interpret and understand the data and arrive at conclusions. Data analysis
tools include Excel, Python, R, Looker, Rapid Miner, Chartio, Metabase, Redash,
and Microsoft Power BI.

 Data Interpretation: Now that you have your results, you need to interpret them and
come up with the best courses of action based on your findings.

 Data Visualization: Data visualization is a fancy way of saying, “graphically show your
information in a way that people can read and understand it.” You can use charts, graphs,
maps, bullet points, or a host of other methods. Visualization helps you derive valuable
insights by helping you compare datasets and observe relationships.

Data analytics tools used most in the industry :

 R Programming (Leading Analytics Tool in the industry)
 Python
 Excel
 SAS
 Apache Spark
 Splunk
 RapidMiner
 Tableau Public
 KNime

Reporting Vs Analysis
When you have a set of data stored somewhere and you are happy with the structure
(e.g. you have already cleaned or enhanced your dataset), it is time to make something
out of it. So here comes the “reporting” part. Data reporting is about taking the
available information (e.g. your dataset), organizing it, and displaying it in a well-
structured and digestible format we call “reports”. You can present data from various
sources, making it available for anyone to analyze it.

Reporting is a great way to help the internal teams and experts answer the question
of what is happening.

Analytics is a much wider term. It actually contains reporting as you cannot actually talk
about analytics without proper reporting. Having said that, for proper decision-making,
you will need much more than that.

Analytics is about diving deeper into your data and reports in order to look for insights.
It’s actually an attempt to answer why something is happening. Analytics powers up
decision-making as the main goal is to make sense of the data explaining the reason
behind the reported numbers. Last but not least, in the context of reporting vs. analytics,
you will find that analytics includes recommendations as well. After you analyze your
data and know why something is happening, your aim is to determine a course of action
to either improve something or provide a solution.
As discussed, to do a proper analysis you will need well-designed reports. As opposed
to reporting where data is just grouped up and presented, analytics rests on dashboards
that allow you to dive deeper into existing numbers and look for insights.

Let’s summarize the key differences across three

pillars:
Reporting Analytics

Purpose Focuses on what is happening Focuses on why something is happening

Cleaning, organizing and Exploring, analyzing, and questioning your

Tasks
summarizing your data data

Transforms your data into Transforms the information into insights &
Value
information recommendations.

Data Analytics Application

Over the years, data analytics applications have improved due to advancements in the
IT sector. The rise of new technology trends like big data and the internet of things
(IoT) has led to new and innovative applications for data analytics. This includes:

1. Transportation
One can use data analytics to solve traffic congestion and improve travel by improving
transportation systems and intelligence. It works by obtaining enormous volumes of
data to build alternative routes to solve traffic cogestion. This would reduce traffic
congestion and, in turn, reduce road accidents. Likewise, travel companies can obtain
buyers’ preferences from social media and other sources to improve their packages.
This would improve the travel experiences of buyers and companies’ customer base.
For example, data analytics was used to solve the transportation problem of 18 million
people in London city during the 2012 Olympics.

2. Education
Policymakers can use data analytics to improve learning curricula and management
decisions. These applications would improve both learning experiences and
administrative management. To improve the curriculum, we can collect preference data
from each student to build curricula. This would create a better system where students
use different ways to learn the same content. Also, quality data obtained from students
can help better resource allocation and sustainable management decisions. For
example, data analytics can let admins know what facilities students use less or
subjects they are barely interested in.
3. Internet web search results
Search engines like Google, Amazon e-commerce search, Bing, etc., use analytics to
arrange data and deliver the best search results. This implies that data analytics is used
in most search engine operations. When storing web data, data analytics gathers
massive volumes of data submitted by different pages and groups them according to
keywords. In each group, analytics also helps rank web pages according to relevance.
Likewise, every word the searcher enters is a keyword in delivering search results. Data
analytics is again used to search a particular group of web pages to provide the one that
matches the keyword intent best.

4. Marketing and digital advertising

Marketers use data analytics to understand the audience and get high conversion rates.
There are different activities in these two sub-applications, which are done using data
analytics. To understand the audience, digital ad experts use analytics to know the
intended audience’s likes, dislikes, age, race, gender, and other features. They also use
this technology to segment their audience according to behaviors and preferences.

5. Logistics and delivery

Data analytics is used for productive workflow and better delivery processes in the
logistics industry. This has yielded improved industry performance and, in turn, a
broader customer base. It increases productivity by enabling real-time data sharing of
the company insights between partners. These insights show customer demand
fluctuations and the performance of the company’s workforce.
In improving the delivery process, logistics companies use data analytics for route
optimization. This enables companies to select the best routes and time using Global
Positioning System (GPS) data, weather data, road maintenance data, and personal
schedules.

6. Security
Security personnel use data analytics (especially predictive analytics) to find future
cases of crimes or security breaches.They can also investigate past or ongoing attacks.
Analytics makes it possible to analyze how IT systems were breached during an attack,
other plausible weaknesses, and the behavior of end-users or devices involved in a
security breach.
Some cities use data analytics to monitor areas with high crime rates. They monitor
crime patterns and predict future crime possibilities from these patterns. This helps
maintain a safe city without risking police officers’ lives.

7. Fraud detection

Many organizations in different industries use data analytics to detect fraudulent

activities. These industries include pharmaceutical, banking, finance, tax, retail, etc.
When identifying tax fraud, predictive analysis is used to assess the reliability of tax
returns for individuals. The Internal Revenue Service (IRS) uses this type of analytics to
predict future fraudulent activities.
It is also used to identify bank fraud by analyzing communication. Banks use data
analytics for constant communication with customers. They can leverage data analyzing
algorithms to detect fraudulent activities based on previous communication data with a
particular customer.

Data Analytics Lifecycle :

The Data analytic lifecycle is designed for Big Data problems and data science
projects. The cycle is iterative to represent real project. To address the distinct
requirements for performing analysis on Big Data, step – by – step methodology
is needed to organize the activities and tasks involved with acquiring,
processing, analyzing, and repurposing data.

Phase 1: Discovery –

 The data science team learn and investigate the problem.

 Develop context and understanding.
 Come to know about data sources needed and available for the project.
 The team formulates initial hypothesis that can be later tested with data.

Phase 2: Data Preparation –

 Steps to explore, preprocess, and condition data prior to modeling and

analysis.
 It requires the presence of an analytic sandbox, the team execute, load, and
transform, to get data into the sandbox.
 Data preparation tasks are likely to be performed multiple times and not in
predefined order.
 Several tools commonly used for this phase are – Hadoop, Alpine Miner,
Open Refine, etc.

Phase 3: Model Planning –

 Team explores data to learn about relationships between variables and
subsequently, selects key variables and the most suitable models.
 In this phase, data science team develop data sets for training, testing, and
production purposes.
 Team builds and executes models based on the work done in the model
planning phase.
 Several tools commonly used for this phase are – Matlab, STASTICA.

Phase 4: Model Building –

 Team develops datasets for testing, training, and production purposes.
 Team also considers whether its existing tools will suffice for running the
models or if they need more robust environment for executing models.
 Free or open-source tools – Rand PL/R, Octave, WEKA.
 Commercial tools – Matlab , STASTICA.
Phase 5: Communication Results –
 After executing model team need to compare outcomes of modeling to
criteria established for success and failure.
 Team considers how best to articulate findings and outcomes to various
team members and stakeholders, taking into account warning, assumptions.
 Team should identify key findings, quantify business value, and develop
narrative to summarize and convey findings to stakeholders.

Phase 6: Operationalize
 The team communicates benefits of project more broadly and sets up pilot
project to deploy work in controlled way before broadening the work to full
enterprise of users.
 This approach enables team to learn about performance and related
constraints of the model in production environment on small scale , and
make adjustments before full deployment.
 The team delivers final reports, briefings, codes.
 Free or open source tools – Octave, WEKA, SQL, MADlib.
Key Roles of a Successful Data Analytics Project
There are certain key roles that are required for the complete and fulfilled
functioning of the data science team to execute projects on analytics
successfully. The key roles are seven in number.

Each key plays a crucial role in developing a successful analytics project. There
is no hard and fast rule for considering the listed seven roles, they can be used
fewer or more depending on the scope of the project, skills of the participants,
and organizational structure.

Example –
For a small, versatile team, these listed seven roles may be fulfilled by only
three to four people but a large project on the contrary may require 20 or more
people for fulfilling the listed roles.
Key Roles for a Data analytics project :
1. Business User :
 The business user is the one who understands the main area of the
project and is also basically benefited from the results.
 This user gives advice and consult the team working on the project about
the value of the results obtained and how the operations on the outputs
are done.
 The business manager, line manager, or deep subject matter expert in
the project mains fulfills this role.

2. Project Sponsor :
 The Project Sponsor is the one who is responsible to initiate the project.
Project Sponsor provides the actual requirements for the project and
presents the basic business issue.
 He generally provides the funds and measures the degree of value from
the final output of the team working on the project.
 This person introduce the prime concern and brooms the desired output.
3. Project Manager :
 This person ensures that key milestone and purpose of the project is met
on time and of the expected quality.

4. Business Intelligence Analyst :

 Business Intelligence Analyst provides business domain perfection based
on a detailed and deep understanding of the data, key performance
indicators (KPIs), key matrix, and business intelligence from a reporting
point of view.
 This person generally creates fascia and reports and knows about the
data feeds and sources.

5. Database Administrator (DBA) :

 DBA facilitates and arrange the database environment to support the
analytics need of the team working on a project.
 His responsibilities may include providing permission to key databases or
tables and making sure that the appropriate security stages are in their
correct places related to the data repositories or not.

6. Data Engineer :
 Data engineer grasps deep technical skills to assist with tuning SQL
queries for data management and data extraction and provides support
for data intake into the analytic sandbox.
 The data engineer works jointly with the data scientist to help build data
in correct ways for analysis.
7. Data Scientist :
 Data scientist facilitates with the subject matter expertise for analytical
techniques, data modelling, and applying correct analytical techniques for
a given business issues.
 He ensures overall analytical objectives are met.
 Data scientists outline and apply analytical methods and proceed towards
the data available for the concerned project.

Mercedes Ml320 Ml500 Ml55amg Owners Manual 2002
100% (2)
Mercedes Ml320 Ml500 Ml55amg Owners Manual 2002
337 pages
Unit-I (Data Analytics)
No ratings yet
Unit-I (Data Analytics)
22 pages
UNIT 2 Notes - Data Science
No ratings yet
UNIT 2 Notes - Data Science
18 pages
Data Types and Sources
No ratings yet
Data Types and Sources
36 pages
महाराष्ट्र परिचर्या परिषद
No ratings yet
महाराष्ट्र परिचर्या परिषद
2 pages
DA KCS051 Unit 1
No ratings yet
DA KCS051 Unit 1
26 pages
Data Analysis
No ratings yet
Data Analysis
11 pages
BAD601 Module 1 PDF
No ratings yet
BAD601 Module 1 PDF
64 pages
All Unit Notes
No ratings yet
All Unit Notes
116 pages
Chapter Two
No ratings yet
Chapter Two
57 pages
DA Total Notes
No ratings yet
DA Total Notes
99 pages
Data Analytics and Supporting Services - Module 3-1
No ratings yet
Data Analytics and Supporting Services - Module 3-1
65 pages
Data Analytics PDF
No ratings yet
Data Analytics PDF
115 pages
Data Analytics Unit I
No ratings yet
Data Analytics Unit I
22 pages
Data Analytics BCSDS501
No ratings yet
Data Analytics BCSDS501
114 pages
大学本科创意写作课程
100% (2)
大学本科创意写作课程
4 pages
Da Unit-1
No ratings yet
Da Unit-1
24 pages
21csl581 Angular Js Rrce
No ratings yet
21csl581 Angular Js Rrce
37 pages
DM Unit I
No ratings yet
DM Unit I
52 pages
Data Analyst Work
No ratings yet
Data Analyst Work
22 pages
Introduction To Data Science Module 2
No ratings yet
Introduction To Data Science Module 2
35 pages
DA (Unit 1)
No ratings yet
DA (Unit 1)
45 pages
Data Analytics Unit-1 Part 1
No ratings yet
Data Analytics Unit-1 Part 1
37 pages
DA Unit 1
No ratings yet
DA Unit 1
23 pages
LN #1 Geng1202
No ratings yet
LN #1 Geng1202
56 pages
Unit 1 Da
No ratings yet
Unit 1 Da
69 pages
Alg 2 Vocab Cards 2016
No ratings yet
Alg 2 Vocab Cards 2016
150 pages
DA Unit1 Notes
No ratings yet
DA Unit1 Notes
28 pages
Data and Data Storage
No ratings yet
Data and Data Storage
29 pages
Da Module 1
No ratings yet
Da Module 1
34 pages
UPDATE FORM Pensioners
No ratings yet
UPDATE FORM Pensioners
2 pages
Unit 2 BI & Data Science
No ratings yet
Unit 2 BI & Data Science
35 pages
Tieng Anh
No ratings yet
Tieng Anh
17 pages
Data Mining 3
No ratings yet
Data Mining 3
31 pages
AI Primer
No ratings yet
AI Primer
24 pages
Data Analytics Unit 1
No ratings yet
Data Analytics Unit 1
16 pages
U1 D CLSRM
No ratings yet
U1 D CLSRM
18 pages
Unit I
No ratings yet
Unit I
15 pages
Lec02 Business Analytics - 20231224 - 102047 - 0000 1
No ratings yet
Lec02 Business Analytics - 20231224 - 102047 - 0000 1
23 pages
Survey-immersive-Analytics
No ratings yet
Survey-immersive-Analytics
22 pages
DATA ANALYSIS - Full - Note - Immersive 2
No ratings yet
DATA ANALYSIS - Full - Note - Immersive 2
13 pages
Da Unit-I
No ratings yet
Da Unit-I
39 pages
NetBackup102 InstallGuide
No ratings yet
NetBackup102 InstallGuide
217 pages
Notes of Unit-I Data Analyticsdocx - 250319 - 093958
No ratings yet
Notes of Unit-I Data Analyticsdocx - 250319 - 093958
18 pages
Course 3
No ratings yet
Course 3
22 pages
The Key Differences Between Data Vs Information: Unit 1 Introduction and Fundamentals of Data
No ratings yet
The Key Differences Between Data Vs Information: Unit 1 Introduction and Fundamentals of Data
27 pages
Buku Standard PDP2015
No ratings yet
Buku Standard PDP2015
24 pages
DA Unit 1 Trio 1
No ratings yet
DA Unit 1 Trio 1
16 pages
DAFD UNit-2
No ratings yet
DAFD UNit-2
16 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
8 pages
Midterm Notes
No ratings yet
Midterm Notes
10 pages
What Is Data Analytics
No ratings yet
What Is Data Analytics
12 pages
Structured vs. Unstructured Data Understanding Differences
No ratings yet
Structured vs. Unstructured Data Understanding Differences
9 pages
Unit 1
No ratings yet
Unit 1
22 pages
Module 1
No ratings yet
Module 1
20 pages
Vertical Tunnel Field Effect Transistors VTFETs A Potential Candidate For Low Power Applications
No ratings yet
Vertical Tunnel Field Effect Transistors VTFETs A Potential Candidate For Low Power Applications
7 pages
UNIT 1 INTRODUCTION TO BIGDATA by MIT
No ratings yet
UNIT 1 INTRODUCTION TO BIGDATA by MIT
12 pages
Unit-1 - ADA - Notes
No ratings yet
Unit-1 - ADA - Notes
23 pages
Big Data and Data Science
No ratings yet
Big Data and Data Science
6 pages
Spring Natraj Satya Best
No ratings yet
Spring Natraj Satya Best
236 pages
Bda Module 1 Notes
No ratings yet
Bda Module 1 Notes
10 pages
Introduction
No ratings yet
Introduction
21 pages
UM1734 User Manual: STM32Cube™ USB Device Library
No ratings yet
UM1734 User Manual: STM32Cube™ USB Device Library
60 pages
A First Russian Reader
No ratings yet
A First Russian Reader
97 pages
分析建模技術的革新 Abaqus一條龍
No ratings yet
分析建模技術的革新 Abaqus一條龍
5 pages
Undestanding Data Module-3
No ratings yet
Undestanding Data Module-3
8 pages
Data - Visualisation - Charts and Types of Data
No ratings yet
Data - Visualisation - Charts and Types of Data
7 pages
Primary and Secondary Data Research Methadology
No ratings yet
Primary and Secondary Data Research Methadology
11 pages
CHAPTER 2 Emerging
No ratings yet
CHAPTER 2 Emerging
8 pages
Unit II
No ratings yet
Unit II
6 pages
Create A Fruit Ninja Inspired Game
No ratings yet
Create A Fruit Ninja Inspired Game
61 pages
Rudra Bhatt Data
No ratings yet
Rudra Bhatt Data
9 pages
NDNE1-PTW-GL-002 Rev. 00 Permit To Work Guidelines in CSP DEWA IV
No ratings yet
NDNE1-PTW-GL-002 Rev. 00 Permit To Work Guidelines in CSP DEWA IV
9 pages
ToolKit 1 - Unit 1 - Introduction To Data Analytics
No ratings yet
ToolKit 1 - Unit 1 - Introduction To Data Analytics
15 pages
Hive Partitions and Buckets Exercises
No ratings yet
Hive Partitions and Buckets Exercises
8 pages
I. Data Collection What Is Data?
No ratings yet
I. Data Collection What Is Data?
12 pages
Answer Jarkomm
100% (1)
Answer Jarkomm
2 pages
Postgresql Configuration Cheat Sheet: by Via
No ratings yet
Postgresql Configuration Cheat Sheet: by Via
2 pages
Buha Habibah Majiid Tampubolon - 1704231 (Statistik)
No ratings yet
Buha Habibah Majiid Tampubolon - 1704231 (Statistik)
2 pages
Chapter3 1
No ratings yet
Chapter3 1
26 pages
Chapter 3 - Decision Analysis 1
No ratings yet
Chapter 3 - Decision Analysis 1
10 pages
Chapter 5
No ratings yet
Chapter 5
117 pages
Computer Studies Support Booklets
No ratings yet
Computer Studies Support Booklets
146 pages
XEROX Phaser 3200MFP Service Manual Pages
0% (1)
XEROX Phaser 3200MFP Service Manual Pages
5 pages
Introduction To Data Management - Week 1 - 2024
No ratings yet
Introduction To Data Management - Week 1 - 2024
17 pages
Image Processing Techniques For Machine Vision
No ratings yet
Image Processing Techniques For Machine Vision
9 pages
IP 31 Series Online UPS: Technical Specifications
100% (1)
IP 31 Series Online UPS: Technical Specifications
1 page
The Three-Wire Quarter-Bridge Circuit
No ratings yet
The Three-Wire Quarter-Bridge Circuit
4 pages
Modeling Unstructured Data Web
No ratings yet
Modeling Unstructured Data Web
6 pages
Mastering Data Mining Techniques
From Everand
Mastering Data Mining Techniques
Dhaanyalakshmi Ahuja
No ratings yet

Data Analytics - Unit - 1

Uploaded by

Data Analytics - Unit - 1

Uploaded by

Different Sources of Data for Data

Data is the lifeblood of business, and it comes in a huge variety of formats —

1. Who will be using the data?

These five questions highlight the fundamentals of both structured and

What is structured data?

Pros of structured data

There are three key benefits of structured data:

1. Easily used by machine learning algorithms: The largest benefit of

1. A predefined purpose limits use: While on-write-schema data definition is

Common examples of machine-generated structured data are weblog statistics

What is unstructured data?

Pros of unstructured data

1. Freedom of the native format: Because unstructured data is stored in its

Unstructured data is qualitative rather than quantitative, which means that it is

It lends itself well to determining how effective a marketing campaign is, or to

Structured data vs. unstructured data

Structured Data Unstructured Data

Who Self-service access Requires data science expertise

What Only select data types Many varied types conglomerated

When Schema-on-write Schema-on-read

Where Commonly stored in data Commonly stored in data lakes

How Predefined format Native format

What is semi-structured data?

Semi-structured data refers to what would normally be considered unstructured

A good example of semi-structured data vs. structured data would be a tab

What is Big Data

Sources of Big Data

Big data platform

A big data platform is an integrated computing solution that combines numerous

What is a Data Lake?

What is a Data Warehouse?

Data analytics tools used most in the industry :

Let’s summarize the key differences across three

Purpose Focuses on what is happening Focuses on why something is happening

Cleaning, organizing and Exploring, analyzing, and questioning your

Data Analytics Application

4. Marketing and digital advertising

5. Logistics and delivery

Many organizations in different industries use data analytics to detect fraudulent

Data Analytics Lifecycle :

 The data science team learn and investigate the problem.

Phase 2: Data Preparation –

 Steps to explore, preprocess, and condition data prior to modeling and

Phase 3: Model Planning –

Phase 4: Model Building –

4. Business Intelligence Analyst :

5. Database Administrator (DBA) :

You might also like