Insights Into Big Data: An Industrial Perspective
Insights Into Big Data: An Industrial Perspective
- An Industrial Perspective
Monika R
Data Analyst, Customer Success
[email protected]
Analytics is what?!
Vast amount of data Big data technologies manage huge amounts of data
Insights Can provide better insights with the help of unstructured and semi-
structured data
Decision-making Helps mitigate risk and make smart decision by proper risk analysis
Everywhere in Every Domain
• Web
• Retail
• E-Commerce
• Medical
• Financial
• Insurance
• Telecom
• Banking
• Travel & Hospitality
Types of Data across Industries
• Medical, Healthcare and Life Sciences
• Automobile and Manufacturing
• Travel and Hospitality
• Retail and Ecommerce
• Web, Social Media and Digital
• Media
• Telecommunication
• Banking, Finance and Insurance
• Energy
• Sports, Media and Entertainment
• Niche areas like autonomous driving, image, video, etc,.
Big Data and its Characteristics.
Data availability
Characteristics
Format of Handy Data
Big Data
is about
these
4Vs
Velocity is the Game Changer: Its NOT just how fast data is produced or changed,
BUT the speed at which it much be received, understood and processed.
Big Data Challenges
• Part of how Big data got the distinction as “BIG” is that it became too much
for traditional system to handle.
• Need : An infrastructure on which to run all the other analytics tools as well
as a place to store and query data.
• Evolution of NoSQL
• A typical big data storage architecture:
• Direct attached storage pools (Scalable and redundant)
• Clustered network attached storage.
CAP Theorem
• Consistency : All nodes see
the same data at the same
time.
• Availability : Every request
gets a response on
success/failure.
• Partition : System continues to
work despite message loss or
partial failure.
Data Storage Tools:
Data Pre-Processing
• The 3 various steps that would be carried out in a Data Pre-Processing
scenario would be
• Data Cleaning
• Data Transformation
• Data Reduction
Why Data Cleaning?
Data in the real world is dirty
incomplete: lacking attribute values, lacking certain attributes of interest, or containing
only aggregate data
• e.g., occupation=“”
noisy: containing errors or outliers (spelling, phonetic and typing errors, word
transpositions, multiple values in a single free-form field)
• e.g., Salary=“-10”
inconsistent: containing discrepancies in codes or names (synonyms and nicknames,
prefix and suffix variations, abbreviations, truncation and initials)
• e.g., Age=“42” Birthday=“03/07/1997”
• e.g., Was rating “1,2,3”, now rating “A, B, C”
• e.g., discrepancy between duplicate records
Why is Data Dirty?
• Incomplete data comes from:
• non available data value when collected
• different criteria between the time when the data was collected and when it is analyzed.
• human/hardware/software problems
• Noisy data comes from:
• data collection: faulty instruments
• data entry: human or computer errors
• data transmission
• Inconsistent (and redundant) data comes from:
• Different data sources, so non uniform naming conventions/data codes
• Functional dependency and/or referential integrity violation
Steps in Data Cleaning
• The Following are the steps followed in cleaning the Big data
• Fill in the missing Value
• Unified Date Format
• Converting nominal data to Numeric
• Identifying and removing the outliers and Noisy data
• Meta Data
Data Transformation
Data transformation routines convert the data into appropriate forms for
mining. They're shown as follows:
• Smoothing: This uses binning, regression, and clustering to remove noise from
the data
• Attribute construction: In this routine, new attributes are constructed and
added from the given set of attributes
• Aggregation: In this summary or aggregation, operations are performed on the
data
• Normalization: Here, the attribute data is scaled so as to fall within a smaller
range
• Discretization: In this routine, the raw values of a numeric attribute are replaced
by interval label or conceptual label
• Concept hierarchy generation for nominal data: Here, attributes can be
generalized to higher level concepts
Data Reduction
• Reduction of multitudinous amounts of data down to the meaningful parts.
• Data De-Duplication
• Sampling
• Feature Selection
• Dimensionality Reduction
• Data compression reduces the size of a file by removing redundant
information from files so that less disk space is required.
• Archiving data also reduces data on storage systems, but the approach is
quite different.
What is Data Mining?
• Data mining (knowledge discovery from data)
• Extraction of interesting (non-trivial, implicit, previously unknown and potentially
useful) patterns or knowledge from huge amount of data
• Data mining: a misnomer?
• Alternative names
• Knowledge discovery (mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, data dredging, information
harvesting, business intelligence, etc.
Data Mining Functionalities
-What kind of patterns can be mined?
• Concept/Class Description: Characterization and Discrimination
• Data can be associated with classes or concepts.
• E.g. classes of items – computers, printers, …
concepts of customers – bigSpenders, budgetSpenders, …
• How to describe these items or concepts?
• Descriptions can be derived via
• Data characterization – summarizing the general characteristics of a
target class of data.
• E.g. summarizing the characteristics of customers who spend more than $1,000 a year
at AllElectronics. Result can be a general profile of the customers, such as 40 – 50 years old,
employed, have excellent credit ratings.
Data Mining Functionalities
-What kind of patterns can be mined?
• Data discrimination – comparing the target class with one or a set of
comparative classes
• E.g. Compare the general features of software products whole sales increase by 10% in the last year with those
whose sales decrease by 30% during the same period
• Or both of the above
• Prediction
• Predict missing or unavailable numerical data values
Marketing Analytics
Customer Behavior Analytics
Data Mining Functionalities
• Outlier Analysis
• Data that do no comply with the general behavior or model.
• Outliers are usually discarded as noise or exceptions.
• Useful for fraud detection.
• E.g. Detect purchases of extremely large amounts
• Evolution Analysis
• Describes and models regularities or trends for objects whose
behavior changes over time.
• E.g. Identify stock evolution regularities for overall stocks and for the stocks of
particular companies.
Data Visualization:
• It will make the data come to life.
• They are bright and easy way to convey complex data insights and most of them
require no coding.
• The goal is to communicate information clearly and efficiently to users.
Reporting : Story Telling
• Data Scientists need to be able to influence.
• Data and insights which can shape the direction of a business should be
projected clearly and in interesting manner.
Visualization Tools Used:
The Magical Words
• Data Analytics, Data Analysis, Data
Mining, Data Science
• Let’s think about the data available to the
farmer, here’s a simplified breakdown:
1. Historic weather patterns
2. Plant breeding data and productivity for each strain
3. Fertilizer specifications
4. Pesticide specifications
5. Soil productivity data
6. Pest cycle data
7. Machinery cost, reliability, fault and cost data
8. Water supply data
9. Historic supply and demand data
10. Market spot price and futures data
Specialized areas
● Financial analytics
● Retail analytics
● Market analytics
● Social media
● HR Analytics
● Customer analytics
● Pricing analytics