Unit - 1 Notes - Introduction To Data-Analytics PDF
Unit - 1 Notes - Introduction To Data-Analytics PDF
Bloom’s
Course Outcomes (CO) Knowledge
Level (KL)
1. Nominal
2. Ordinal
Nominal Data Type
• These are the set of values that don’t possess a natural
ordering.
• For example, color of a smartphone can be considered
as a nominal data type as we can’t compare one color
with others.
• It is not possible to state that ‘Red’ is greater than
‘Blue’.
• Other example can be the gender of a person where
we can’t differentiate between male, female, or others.
Ordinal Data Type
• These types of values have a natural ordering
while maintaining their class of values.
• If we consider the size of a clothing brand then we
can easily sort them according to their name tag in
the order of small < medium < large.
• The grading system while marking candidates in a
test can also be considered as an ordinal data type
where A+ is definitely better than B grade.
Ordinal Data Type
• These categories help us deciding which encoding
strategy can be applied to which type of data.
• for the ordinal data type, label encoding can be applied which
is a form of integer encoding.
One-Hot Encoding Scheme
• In one-hot encoding, for each level of a categorical feature,
we create a new variable.
• Each category is mapped with a binary variable containing
either 0 or 1.
Ordinal Encoding
• An ordinal encoding involves mapping each unique label to
an integer value.
1. Discrete
2. Continuous
Discrete Data Type
• The numerical values which fall under are integers or whole
numbers are placed under this category.
1. Internal Source
2. External Source
Internal Source
• When data are collected from reports and
records of the organization itself, it is known
as the internal source.
• Semi-structured Data
• Quasi-structured Data
• Unstructured data
Structured Data
• It is the data containing a defined data type,
format, and structure.
• Many Big Data solutions and tools have the ability to ‘read’
and process either JSON or XML.
• And its volumes are growing rapidly — many times faster than the rate of
growth for structured databases.
This shows that this kind of data analytics application will make
us have safer cities without police putting their lives at risk.
Applications of Data Analytics
Transportation
• A few years back at the London Olympics, there was a need for
handling over 18 million journeys made by fans in the city of London
and fortunately, it were sorted out.
• How was this feat achieved? The TFL and train operators made use of
data analytics to ensure the large numbers of journeys went
smoothly.
• They were able to input data from events that took place and
forecasted a number of persons that were going to travel; transport
was being run efficiently and effectively so that athletes and
spectators can be transported to and from the respective stadiums.
Applications of Data Analytics
• Manage Risk in the insurance industry.
• Web Provisioning: The key component of this is being able to shift bandwidth at
the right time and location. This can only be achieved by the use of data.
• A study recently carried out showed that a lack of investment in technology was
the cause customer dissatisfaction of the present generation of insurance customers
because they prefer using mobile and online channels, social media and other
recent mediums to interact with their agents.
• However, the older generation still prefers the use of the telephone.
• But for analytics, there is a need to sort, organize, analyze and offer
this critical data in a systematic manner leads to the rise of Big
Data.
What is Big Data?
• The term Big Data refers to all the data that is being
generated across the globe at an unprecedented
rate.
• Big data platform are also delivered through cloud where the
provider provides an all inclusive big data solutions and services.
What is Hadoop?
• Apache Hadoop is an open source framework that is used to
efficiently store and process large datasets ranging in size
from gigabytes to petabytes of data.
• Velocity: the rate at which new data is being generated all thanks to our
dependence on the internet, sensors, machine-to-machine data is also
important to parse Big Data in a timely manner.
– Cloud Computing
– Grid Computing
– MapReduce.
Massively Parallel Processing Systems (MPP)
• MPP (massively parallel processing) is the
coordinated processing of a program by multiple
processors.
• These processors work on different parts of the
program, with each processor using its own
operating system and memory .
• Typically, MPP processors communicate using
some messaging interface.
• In some implementations, up to 200 or more
processors can work on the same application.
Handling of Big Data using MPP
• Big data is split into many parts and the
processors works in parallel on each part of
data.
Grid computing–
uses multiple computers in distributed networks.
This type of architecture uses use resources
opportunistically based on their availability.
This architecture reduces costs for server space, but also
limits bandwidth and capacity at peak times or when there
are too many requests.
Computer clustering – links the available power into nodes that can
connect with each other to handle multiple tasks at once.
MapReduce
• MapReduce is now the most widely used, general
purpose computing model and run-time system for
distributed data analytics.
• User can easily create interactive graphs, maps, and live dashboards in
minutes.
• No coding required.
• Tableau’s Big Data capabilities makes it important and one can analyze and
visualize data better than any other data visualization software in the market.
Few examples of Tableau Output
Python
• Python is an object-oriented scripting language which is easy to
read, write, maintain and is a free open source tool.
• It was developed by Guido van Rossum in late 1980’s which
supports both functional and structured programming methods.
• Phython is easy to learn as it is very similar to JavaScript, Ruby,
and PHP.
• Also, Python has very good machine learning libraries viz.
Scikitlearn, Theano, Tensorflow and Keras.
• Another important feature of Python is that it can be assembled
on any platform like SQL server, a MongoDB database or JSON.
SAS (Statistical Analysis System)
• SAS is a programming environment and language for data
manipulation and a leader in analytics.
• It was developed by the SAS Institute in 1966 and further developed in
1980’s and 1990’s.
• SAS is easily accessible, managable and can analyze data from any
sources.
• SAS introduced a large set of products in 2011 for customer
intelligence and numerous SAS modules for web, social media and
marketing analytics that is widely used for profiling customers and
prospects.
• It can also predict their behaviors, manage, and optimize
communications.
KNIME (Konstanz Information Miner)
• KNIME is a free and open-source data analytics, reporting and
integration platform.
• KNIME allow you to analyze and model the data through visual
programming.
Apache Spark
• The University of California, Berkeley’s AMP Lab, developed
Apache in 2009.
• Apache Spark is a data processing engine and executes
applications in Hadoop clusters 100 times faster in memory
and 10 times faster on disk.
• Spark makes concepts of data science effortless.
• Spark is also popular for data pipelines and machine learning
models development.
• Spark also includes a library – MLlib, that provides a
progressive set of machine algorithms for repetitive data
science techniques like Classification, Regression,
Collaborative Filtering, Clustering, etc.
RapidMiner
• RapidMiner is a powerful integrated data science platform.
• It is developed to perform predictive analysis and other
advanced analytics like data mining, text analytics, machine
learning and visual analytics without any programming.
• RapidMiner can incorporate with any data source types,
including Access, Excel, Microsoft SQL, Tera data, Oracle,
Sybase, IBM DB2, Ingres, MySQL, IBM SPSS, Dbase etc.
• The tool is very powerful that can generate analytics based on
real-life data transformation settings,
• i.e. you can control the formats and data sets for predictive
analysis.
QlikView
• QlikView has many unique features like in-memory data
processing, which executes the result very fast to the end
users and stores the data in the report itself.