DA Unit 2
DA Unit 2
INTRODUCTION:
Data has been the buzzword for ages now. Either the data being generated from large-scale
enterprises or the data generated from an individual, each and every aspect of data needs to be
1. Improved Decision Making: Data Analytics eliminates guesswork and manual tasks. Be it
choosing the right content, planning marketing campaigns, or developing products. Organizations
can use the insights they gain from data analytics to make informed decisions. Thus, leading to
their needs. It also provides personalization and builds stronger relationships with customers.
Analysed data can reveal information about customers’ interests, concerns, and more. It helps you
3. Efficient Operations: With the help of data analytics, you can streamline your processes, save
money, and boost production. With an improved understanding of what your audience wants, you
spend lesser time creating ads and content that aren’t in line with your audience’s interests.
4. Effective Marketing: Data analytics gives you valuable insights into how your campaigns are
performing. This helps in fine-tuning them for optimal outcomes. Additionally, you can also find
potential customers who are most likely to interact with a campaign and convert into leads.
Next step to understanding what data analytics is to learn how data is analyzed in organizations.
There are a few steps that are involved in the data analytics lifecycle. Below are the steps that you
goals, and planning a lucrative solution is the first step in the analytics process. E-commerce
companies often encounter issues such as predicting the return of items, giving relevant product
2. Data Collection: Next, you need to collect transactional business data and customer-related
information from the past few years to address the problems your business is facing. The data can
have information about the total units that were sold for a product, the sales, and profit that were
made, and also when was the order placed. Past data plays a crucial role in shaping the future of a
business.
3. Data Cleaning: Now, all the data you collect will often be disorderly, messy, and contain
unwanted missing values. Such data is not suitable or relevant for performing data analysis. Hence,
you need to clean the data to remove unwanted, redundant, and missing values to make it ready
for analysis.
4. Data Exploration and Analysis: After you gather the right data, the next vital step is to
execute exploratory data analysis. You can use data visualization and business intelligence tools,
data mining techniques, and predictive modelling to analyze, visualize, and predict future outcomes
from this data. Applying these methods can tell you the impact and relationship of a certain feature
Below are the results you can get from the analysis
5. Interpret the results: The final step is to interpret the results and validate if the outcomes
meet your expectations. You can find out hidden patterns and future trends. This will help you gain
insights that will support you with appropriate data-driven decision making.
The tools used in Data Analytics
With the increasing demand for Data Analytics in the market, many tools have emerged with various
functionalities for this purpose. Either open-source or user-friendly, the top tools in the data
analytics market are as follows.
R programming – This tool is the leading analytics tool used for statistics and data modelling. R
compiles and runs on various platforms such as UNIX, Windows, and Mac OS. It also provides tools to
automatically install all packages as per user-requirement.
Tableau Public – This is a free software that connects to any data source such as Excel, corporate
Data Warehouse, etc. It then creates visualizations, maps, dashboards etc .with real-time updates on
the web.
QlikView – This tool offers in-memory data processing with the results delivered to the end-users
quickly. It also offers data association and data visualization with data being compressed to almost
10% of its original size.
SAS – A programming language and environment for data manipulation and analytics, this tool is
easily accessible and can analyse data from different sources.
Microsoft Excel – This tool is one of the most widely used tools for data analytics. Mostly used for
clients’ internal data, this tool analyses the tasks that summarize the data with a preview of pivot
tables.
RapidMiner – A powerful, integrated platform that can integrate with any data source types such as
Access, Excel, Microsoft SQL, Tera data, Oracle, Sybase etc. This tool is mostly used for predictive
analytics, such as data mining, text analytics, machine learning.
KNIME – Konstanz Information Miner (KNIME) is an open-source data analytics platform, which
allows you to analyse and model data. With the benefit of visual programming, KNIME provides a
platform for reporting and integration through its modular data pipeline concept.
Open Refine – Also known as Google Refine, this data cleaning software will help you clean up data
for analysis. It is used for cleaning messy data, the transformation of data and parsing data from
websites.
Apache Spark – One of the largest large-scale data processing engines, this tool executes applications
in Hadoop clusters 100 times faster in memory and 10 times faster on disk. This tool is also popular
for data pipelines and machine learning model development.
Data analytics is used in almost every sector of business, let’s discuss a few of them:
1. Retail: Data analytics helps retailers understand their customer needs and buying habits to
predict trends, recommend new products, and boost their business. They optimize the supply
2. Healthcare: Healthcare industries analyse patient data to provide lifesaving diagnoses and
treatment options. Data analytics help in discovering new drug development methods as well.
3. Manufacturing: Using data analytics, manufacturing sectors can discover new cost-saving
opportunities. They can solve complex supply chain issues, labour constraints, and equipment
breakdowns.
4. Banking sector: Banking and financial institutions use analytics to find out probable loan
defaulters and customer churn out rate. It also helps in detecting fraudulent transactions
immediately.
5. Logistics: Logistics companies use data analytics to develop new business models and optimize
routes. This, in turn, ensures that the delivery reaches on time in a cost-efficient manner.
Need for business Modelling
Using big data as fundamental factor of making decision which need new capability, most firms are
far away from accessing all data resources. Companies in various sectors have acquired crucial insight
from the structured data collected from different enterprise systems and anatomize by commercial
database management systems.
1.) Facebook and Twitter to standard the instantaneous influence on campaign and to examine
consumer opinion about their products
2.) Some companies, like Amazon, eBay, and Google, considered as early commandants, examining
factors that control performance to define what raise sales revenue and user interactivity.
Hadoop is an open-source software platform that enables processing of large data sets in distributed
computing environment", it discusses some concepts according to big data, the rules for building,
organizing and analysing huge data-sets in the business environment, they offered 3 architecture
layers and also, they indicate some graphical tools to explore and represent unstructured-data, the
authors specified how the famous companies could improve their business. E.g.: Google, Twitter and
Facebook show their attention in processing big data within cloud-environment
The Map() step: Each worker node applies the Map() function to the local data and writes the output
to a temporary storage space. The Map() code is run exactly once for each K1 key value, generating
output that is organized by key values K2. A master node arranges it so that for redundant copies of
input data only one is processed.
The Shuffle()step: The map output is sent to the reduce processors, which assign the K2 key value
that each processor should work on, and provide that processor with all of the map generated data
associated with that key value, such that all data belonging to one key are located on the same
worker node.
The Reduce() step: Worker nodes process each group of output data(per key) in parallel, executing
the user provided Reduce() code; each function is run exactly once for each K2 key value produced
by the map step. Produce the final output: The MapReduce system collects all of the reduce outputs
and sorts them by K2 to produce the final out-come.
Fig.2.4 shows the classical “word count problem” using the MapReduce paradigm. As shown in
Fig.2.4, initially a process will split the data into a subset of chunks that will later be processed by the
mappers. Once the key/values are generated by mappers, a shuffling process Is used to mix(combine)
these key values (combining the same keys in the same worker node). Finally, the reduce functions
are used to count the words that generate a common output as a result of the algorithm. As a result
of the execution or wrappers/reducers, the output will generate a sorted list of word counts from the
original text input.
IBM and Microsoft are prominent representatives. IBM represented many big data options that
enable users to storing, managing, and analysing data through various resources; it has a good
rendering on business-intelligence also healthcare areas. Compared with IBM, also Microsoft showed
powerful work in the area of cloud computing activities and techniques another example is Face-
book and Twitter, who are collecting various data from user's profiles and using it to increase their
revenue
Big data analytics and Business intelligence are united fields which became widely significant in the
business and academic area, companies are permanently trying to make insight from the extending
the three V's (variety, volume and velocity) to support decision making.
Databases & Types of Data and variables
Database Management System: DBMS is a software or set of Programs used to define, construct and
manipulate the data.
Relational Database Management System: RDBMS is a software system used to maintain relational
databases. Many relational database systems have an option of using the SQL.
NoSQL
NoSQL Database is a non-relational Data Management System, that does not require a fixed schema.
It avoids joins, and is easy to scale. The major purpose of using a NoSQL database is for distributed
data stores with humongous data storage needs. NoSQL is used for Big data and real-time web apps.
For example, companies like Twitter, Facebook and Google collect terabytes of user data every single
day.
NoSQL database stands for “Not Only SQL” or “Not SQL.” Though a better term would be “NoREL”,
NoSQL caught on. Carl Strozz introduced the NoSQL concept in 1998.
Traditional RDBMS uses SQL syntax to store and retrieve data for further insights. Instead, a NoSQL
database system encompasses a wide range of database technologies that can store structured,
semi-structured, unstructured and polymorphic data.
Variables:
Data consist of individuals and variables that give us information about those individuals. An
individual can be an object or a person. A variable is an attribute, such as a measurement or a label.
2.Categorical data
Quantitative Variables: Quantitative data, contains numerical that can be added, subtracted,
divided, etc.
Missing Imputations:
1. MCAR
Data which is Missing Completely At Random has nothing systematic about which observations are
missing values. There is no relationship between missingness and either observed or unobserved
covariates.
2. MAR
Missing At Random is weaker than MCAR. The missingness is still random, but due entirely to
observed variables. For example, those from a lower socioeconomic status may be less willing to
provide salary information (but we know their SES status). The key is that the missingness is not due
to the values which are not observed. MCAR implies MAR but not vice-versa.
3. MNAR
If the data are Missing Not At Random, then the missingness depends on the values of the missing
data. Censored data falls into this category. For example, individuals who are heavier are less likely to
report their weight. Another example, the device measuring some response can only measure values
above .5. Anything below that is missing.
1. Ignore the tuple: This is usually done when the class label is missing (assuming the mining task
involves classification). This method is not very effective, unless the tuple contains several attributes
with missing values. It is especially poor when the percentage of missing values per attribute varies
considerably.
2. Fill in the missing value manually: In general, this approach is time-consuming and may not be
feasible given a large data set with many missing values.
3. Use a global constant to fill in the missing value: Replace all missing attribute values by the same
constant, such as a label like “Unknown” or -∞. If missing values.
Missing Imputations:
1. MCAR
Data which is Missing Completely At Random has nothing systematic about which observations are
missing values. There is no relationship between missingness and either observed or unobserved
covariates.
2. MAR
Missing At Random is weaker than MCAR. The missingness is still random, but due entirely observed
variables. For example, those from a lower socioeconomic status may be less willing to provide salary
information (but we know their SES status). The key is that the missingness is not due to the values
which are not observed. MCAR implies MAR but not vice-versa.
3. MNAR
If the data are Missing Not At Random, then the missingness depends on the values of the missing
data. Censored data falls into this category. For example, individuals who are heavier are less likely to
report their weight. Another example, the device measuring some response can only measure values
above .5. Anything below that is missing.
1. Ignore the tuple: This is usually done when the class label is missing (assuming the mining task
involves classification). This method is not very effective, unless the tuple contains several attributes
with missing values. It is especially poor when the percentage of missing values per attribute varies
considerably.
2. Fill in the missing value manually: In general, this approach is time-consuming and may not be
feasible given a large data set with many missing values.
3. Use a global constant to fill in the missing value: Replace all missing attribute values by the same
constant, such as a label like “Unknown” or -∞. If missing values are replaced by, say, “Unknown,”
then the mining program may mistakenly think that they form an interesting concept, since they all
have a value in common-that of “Unknown.” Hence, although this method is simple, it is not
foolproof.
4. Use the attribute mean to fill in the missing value: Considering the average value of that
particular attribute and use this value to replace the missing value in that attribute column.
5. Use the attribute mean for all samples belonging to the same class as the given tuple:
For example, if classifying customers according to credit risk, replace the missing value with the
average income value for customers in the same credit risk category as that of the given tuple.
6. Use the most probable value to fill in the missing value: This may be determined with regression,
inference-based tools using a Bayesian formalism, or decision tree induction. For example, using the
other customer attributes in your data set, you may construct a decision tree to predict the missing
values for income.
Network model
Entity-relationship model