Data Analyst – IBM Certification
Week 1:
Modern Data Ecosystem:
A modern data ecosystem includes an entire network of entities:
Interconnected
Independent
Continuous evolution
1. Includes data that must be integrated from disparate sources
2. Different types of analysis and knowledge to generate understanding
3. Active stakeholders to collaborate and act on insights generated
4. Tools, applications and infrastructure to store, process and disseminate data as
needed.
Data sources:
They can be structured and unstructured
Can be found in:
Texts,
Videos
Images
Click sequences
User conversations
Social media platforms
Internet of Things (IOT)
Real-time events that transmit data
Legacy databases
Data obtained from data providers
Professional organizations
The first step about data resources is:
Create a copy of the data from the original data to a repository, in this stage it is about
acquiring only the data you need.
Challenges:
Cloud computing
Bid data
Machine learning
Key protagonists in the data ecosystem
Data is key to having a competitive advantage
Data Professionals:
Data Engineers
They develop and maintain data architecture and make data available for business
operations and analysis.
They work within the data ecosystem
Extract, integrate and organize data from different data
Storytelling and data analysis
They turn data into usable data
Data Analysts:
Translate data and numbers into simple languages, clean and apply statistical data
They use data to generate insights
Data Scientists
Use data analytics and data engineering to predict the future using the past
Business Analysts
They use perceptions and predictions to make decisions
Business Intelligence Analysts
Defining data analysis
Data analysis is the process of collecting, cleaning, analyzing and extracting data.
Interpret the results
Report findings
We find patterns within the data and correlations between different data points.
Through these perceptions
Typed
Descriptive analysis What happened?
Diagnostic analysis Why did it happen?
Predictive analysis What will happen next?
Prescriptive analysis What should we do about it?
The data analysis process
1. Understanding the problem you want to solve, where you are and where you would
like to be?
2. Establish a clear goal: Define what will be measured and how it will be
3. Data collection: Identify the data you require, the tools
4. Clean the data: Clean the issues to standardize the data
5. Analyze and extract data: Manipulate data to understand correlations and trends
6. Interpret the results: Evaluate whether your analysis is defensible against objections or
has limitations
7. Present your discovery: Impact presentation
Views: What is data analysis?
Professionals as they define it
The process of collecting information and then analyzing information to confirm hypotheses
Storytelling with data
Use of information to make decisions
Reading:
Data analysis vs. data analytics
The meanings in the dictionary are:
Analysis – detailed examination of the elements or structure of something
Analytics – The systematic computational analysis of data or statistics
Analysis can be done without numbers
Analytics needs data even if it is not used to perform numerical inference
The role of the data analyst
Depends on the type of organization and data management
Acquire data
Create queries to extract required data
Filter clean, standardize and reorganize data for data analysis
Use statistical tools
Statistical techniques to identify patterns
Analyze patterns
Prepare reports
Create appropriate documentation
Skills
Techniques:
Experience using spreadsheets (Excel and Google Sheets)
Proficiency in statistical analysis and visualization tools and software IBM
Cognos, IBM Spss, Oracle visual analyzer, Power BI, SAS and tableu.
Proficiency in R, Python, C++ programming; Java and Matlab
Good knowledge of SQL and ability to work with data in relational and NoSQL
databases
Ability to access and extract data from data repositories: DataMart, Data
Warehouse, Data Lakes and Data Pipelines.
Familiarity with Big data processing: Hadoop, Hive and Spark.
Functional:
Competence in statistics: Analyze, validate, identify fallacies and logical errors
Analytical skills: Reach and interpret data, theorize, create forecasts
Problem solving skills: Coming up with possible solutions to a given problem
Probing Skills: Identify and define the state of the problem and the desired
resolution
Data visualization skills: Create clear presentations
Project management skill
Soft skills:
DA is a science and an art
Collaborative work
Effectively communicate presentations
Generate compelling stories
Above all, be curious about data analysis
Intuition as a must, recognize patterns and past experiences
Week 2:
The data ecosystem and languages for data professionals
Data Analyst Ecosystem Overview
It includes infrastructure, software, tools, frameworks and processes for:
Collect, clean, mine and visualize data
Data:
Categorized in
Structured : rigid format that can be organized in columns and rows
Semi-structured: Mixing of data, example: Emails
Unstructured : Qualitative information that cannot be reduced to columns and rows,
for example, photos, videos.
File formats to collect data
Relational database
Non-relational database
APIs
Web Servers
Data Stream
Social platforms
Devices with Sensor
Data repositories
Databases
Data Warehouse
Data Marts
Data Lakes
Big data stores
The type of data, format and resources determines which repository will be needed to:
Idioms
Query Languages
Example: SQL for query and data management
Programming languages
Python data application development
Shell and Scripting languages
For repetitive operations
Type of data
What is data?
Data is disorganized information that is processed to make it meaningful.
Data encompasses facts, observations, perceptions, numbers, characters, symbols, images that
can be interpreted to obtain meaning.
Categorize them by structure:
Structured: SQL OLTP, spreadsheets, RFID, Network and servers
Semi-structured:
Emails
XML
Binaries
TCP
Zips
Integrated data
Unstructured
Websites
Images
Social networks
Videos and audio
Domuetos
ppt
Surveys
Types of formats
Understand benefits and limitations to make good decisions.
Standard formats
CVS separated by commas TVS by tab
XML or Excel xlsx
XML
PDF
Ajvascript or Json
Data sources
Relational databases: SQL Oracle mysql IBM db2
Flat files, Spreadsheets,XML databases External demographic and economic data Oint of sale
data
Google sheets, apple numbers
Apis application program interfaces, twitter and facebooks
Wep Sracping Extract web data product details minotiras, e commerces, public data, machine
learnign models
Scrapy panda selenium beautiful soup
Data stream and feeds: IOT GPS data, computer programs, have timestamp and geolocation
Kafka apache apache storm
RSS to collect forum data
Languages for professionals
Query languages:
Designed to access and manipulate data such as
SQL intersar update, platform independent
Programming: Develop applications and applications of Python R Java behaviors
Python Liberrias
Panda to clean data and analyze
Numpy and Scipy for statistical analysis
Beatifulsoup and Scrapy for web scraping collect
Matplotlib and seaborn to display presentations in the form of bar graphs, historograms and
pie charts
Shell scripting For repetitive and operational actions: Uinx Linux, powershell
Opency for images
Understanding data repositories and big data platforms
Overview of data repositories
Repository to collect and organize used for business operations or mining reporting and
analysis.
Different types of repostirios
Databases
Data collection for data entry, storage, search, retrieval and modification.
(DBMS) query query function
Different types of databases
Datatype
Structure
Consultation system
Latency
Transaction speed
and the use
RDBMS relational rows and columns, well defined structure, used for relational SQL
Non-relational NoSQL, diversity, speed, cloud computing, iot m flexibility, scale, free form,
used to process bid data
Data Warehouse:
Central repository comes from different sides, extrear transform and load etl, for analytics and
BI, etl helps to extract from different datas, datamarks and data lakes.
Big data stores
Distributed computing, data lakes.
Enables better efficient and credible data archiving
RBDMS (Relational Database Management System)
A relational database is a collection of data structured in a related table.
Rows are the records
Attribute columns: ID Name, address, phone number
This allows you to relate tables, you must understand the relationship of the data to generate
better decisions.
Data relationships ideal for optimizing, storing and retrieving and processing a large number of
data, relational relationships can be defined between tables.
Each table is unique in SQL
You can search for specific data, great consistency and integrity
Recovery of millions of records in a short time.
From laptops to the cloud.
Internal support
Commercial
Closed resource
1. Relational databases: IBM SQL my SQL Oracle PostgreSQL
2. In the cloud: Amazon RDS, Google coludm IBM D2, azure SQL, Oracle
Advantages:
Create meaningful information by putting tables together
Flexibility
Minimize redundancy
Even a copy and recovery
You can export the will give
Atomized, consistency, isolation and durability, exact
1. OLTP Online transaction processing application
2. Data warehouse: Optimized by OLP online analytical processing
3. IOT provides ability and speed to collect data
Limitation:
Does not work for semi-structured and unstructured data
Data migration between RBDMS can only occur when they have similar schemas
NoSQL
Not only SQL is non-relational database provides flexible schemas for storing and retrieving
data
It gained popularity in the era of big data, high volume and mobile applications.
Created for specific data to program and create management of modern applications
4 types of NoSQL
Key Value store : Collection of even values, represents a JSON attribute, user preferences, real
time recommendations
Redis Memcached Mymbos
Document Based
Store and retrieve as a simple preferred document for ecommerce, medical, crm
Mongo DB, document DB, couch DB, coludDB
Column Based: Stores data in column cells
Familiar columns, access is fast, IOT Climate. Quickly answers.
cassandra hbase
Graph Based: Graphic model to represent data
Visualize and analyze to find connections
To work with connected and interconnected data
Great for social networks, recommendations, access,
NeoF4J Cosmos DB
Advantages:
Ability to handle large values of E;S and N
Ability to run on distributed systems
Take advantage of the cloud
Simple design and great control, flexible agile and iterate faster
Differences:
RDBMS Rigid, expensive, ACID compliance
No relational databases: flexible, convenience and no cost, NO ACID, new technology.
Data Marts, Data Lake, ETL and Data Pipelines
Datawarehouse: Multipurpose storage, opt for when you have too much data, for analytics
Datamarts: A subsection providing relevant information to interested enterprise teams,
isolated security
Specific information for analytics
Data lake: E, S; , native format, Generate access long layers of data, retains all single data
resources, all types of data, advanced predictive analytics
ETL: Extract, transform and load
<<you collect identity data, for reporting, clean it to a usable format, generic process
Extract Stich Blendo: Data Collection
Stream Processing:Samza Kafka
Transform: Date formats, filter data you don't need, segment, place where it is transported,
remove duplicate data.
Initial: all data and repostiroi
Updated Updates
Load verification: Null values, load faults,
Data Pipeline : Move data from one system to another, particularly for data that needs
continuous updating, supports fast queries, Apache data Flow Kafka
Big data fundamentals
Foundations of big fata
Everyoe leaves a trace
Records a loto f data
Refers t the Dynamic large diisparete of data being created by peolpe
It needs to be collected to generate insights for business.
5 V of Big data:
Speed : how it accumulates
Volume : Scale of data
Variety Diversity of data: E, S ; N, they come from different resources machines, people,
processes
Veracity: Quality and origin of data complete integrated consistency, ambiguous
Value: Turn data into value, it is not just profit, it can have more benefits for everyone.
Big data processing tools
They provide a way to work with E, S and N
Open resources:
Haddop : Store rprocess big data
Can scale nodes realistically scalable and cost effective
E, S, N real time service
HDFS big data storage
Fast recovery from hardware errors
HIve: query and analysisi Data warehouse
Readinf writnig and manage large data
Reading base
Spark; Framework for complex data analytics
Volumes of data from wide range of applications
Analytics
Machine learning
Ajva Scala Python R SQL
Complex analytics
Week 3
Data collection:
Identifying data for analysis
Problem and the desire to solve
Where are you and where do you want to be?
What will be measured
How will it be measured?
Identify the data you need
Step 1: Determine the data you want to collect
The specific information you need
Possible resources for this data
Step 2: Define a plan for collecting data
Define dependencies, risks and mitigations
The beginning and end of the data
Step 3 : Determine your data collection methods
Data type +time volume, resources
Verify quality, security and privacy
Data needs to be error-free, accurate, accessible and measurable.
Data governance: Security, Regulation and compliances
Data Privacy: Confidentiality license for use, checks validation auditable Trail, compliance
Identifying the right data is very important
Data sources
They can be internal or external
Could be
Primary: Directly from internal sources CRM, HR, or work Flow
Secondary: INformation from external databases, research articles, financial records,
third party: Data purchased from aggregators who collect data
Databases: Could be the three
Website: Publicy
Social Media sites and interactive platforms: FB IG Google Yuotube, qunaititve and
qualitative
Sensor data: Wearable devices, Smart buildings
Data Exchange: Voluntary sharing, organizations and governments
Surveys; Information from a select group of people
Census: Gathering household data
Interview: Opinion and experiences
Observation studies.
Today they are dynamic and diverse
How to collect and import data
Different methods for gathering data
SQL Databases: Extract information from realtional databases
API; Popular for extracting data from various data
Used for data validation
Web: Web scraping to download specific data from web pages with parameters
Podcast texts, videos, products, RSS capture updated data
Sensor data: IOt GPS instruments
Data exchanges: Allow the Exchange of data, Facilitates the exchange of data, provides data
licenses
Raw data transformation
What is data wrangling?
Data muniging iterative ET validation process to create credible and meaningful data
Data collected from various data
Capture many tasks for analysis
4 steps
Discover: Create plan to clean, structure and organize
Transformation: transform
the data to be structured: Relational data, Apis, change the schema, join in columns
and join combine rows
o normalize , Clean up the data that is used and reduce redundancy, and
inconsistency combine multiple tables into one table
o clean it : Correct irregularity, incompleteness, biases, null values outliers
and Enriching . Add data that makes your data more meaningful
Validation: Check data quality, verify consistency, security and consistency
Publishing: Deliver project results
Tools for data wrangling
Software and tools like
Excel Power Query Google sheets
Spreadsheets: Helps identify problems, clean and transform data
Open refine: Open resource to import and export data in TSV CSV XLS XmL and Json,
Clean will convert from one format to another and extend web services, easy to use and learn
Google Data prep: Intelligent cloud, for E and N data, it is fully systematized, it is very easy to
use, it automatically detects schemes, anomalies
Watson Sstudio refinery: Available in IBM Watson studio, allows you to discover, clean and
transform data creations, detects data types and classifications automatically
Trifacta Wrangler: It is an interactive cloud to clean and transform data, it takes messy data
and cleans it and can export to Excel, table or
Python: Great libraries
Jupyter: Widely used open resource for cleaning data, statistics and
visualization
Numpy; The most basic and easy to use, multidimensional arrays, matrices and
mathematical functions
Pandas: Fast and easy analysis operations, helps prevent common errors
A: It also has bookstores.
Dyplir Syntax
Data table Helps to add long datas
Jsonlite To interact with APIs
Data Cleaning
Data quality: Low data quality generates weakness and weak objectives
missing data
Inconsistent
Incorrect data
Data cleanig from transformation:
Inspection: Detect the different types of issues and errors
Validate your data against rules
Data profiling Profile of data helps to structure, content, and interrelationships
Cleaning Data conversion, missing values, remove duplicate data, syntax errors, outliers should
be examined to see if they are included
Verification Inspections if you have been effective and accurate in achieving the cleaning
result.
All cleaning must be documented, reasons for changes.
Data cleaning
How much work is involved in cleaning, preparing and cleaning data?
A great proportion.
Week 4: Data Analysis and Mining
Overview of Statistical Analysis
Statistics: Branch of mathematics, collecting, analyzing and interpreting data.
Make decisions based on data
Sample: Representative selection of the total population
Population: Group of people with similar characteristics
Statistical methods help ensure:
Data interpreted correctly
Create meaningful relationships
Two different types of statistics
Descriptive: Summarize information about the sample
Simple interpretation with graphics, make it easy to understand
Used to calculate:
Central trendy: Half fashion and mean
Dispersion: Measure of variability, Variance, Standard deviation, range
Skewness: The measurement of value to see if it is symmetrical
Inferential: Generate inferences from the population
Hypothesis Testing
confidence intervals:
Regression analysis:
Softwares: SAS, SPSS, Stat Soft
What is Data Mining?
Process of extracting knowledge from data, it is the heart of data analysis
Identify correlations, patterns, trends.
Patterns: Regularities or common things
Trends: Data set trend that changes over time
Data mining techniques:
Classification: By attributes in categories
Clustering: Gather the data into groups
Anomalies: Finding strange patterns
Association Rule Minig: Establish the relationship between two data events
Sequential Patterns: Chart a series of events that take place
Affinity Group: Discover Purchase Occurrences
Decision trees: Classification model to understand the relationships of inputs and
outputs
Regression: Identify the nature of the relationship between two variables
Tools for Data Mining
Spreadsheets: For simple tasks, accessible and easy to read
R : regression, classification, tex mining TM and Twitter
Integrated development environment
Python: Used for data mining,
o Panda: For data structure and analysis, any data format, mean media mode
range
o Numpy Computational Mathematics for Python
o Jupyter; Chosen by data scientists
SpSS Popular in social sciences, trends, validation, requires license to use. Requires
minimal code to use
Watson studio: IBM has public cloud, create machine learning, predictive models
SAS: Comprehensive framework to identify patterns in data with modeling techniques
Analyze large amounts of data, explore relationships and anomalies.
Communicate data analysis results
Overview of communication and sharing of Data Analysis results
Understand the problem that needs to be solved and what you want to achieve.
Story
Visualization
Data
Who is my audience?
What matters to them?
That would help them believe me
You must consider which pieces are most important to the audience.
Share your resources, reports and hypotheses
Top Down
bottom up
Graphs, tables and diagrams can be used
Present the data with a narrative
Insights: Data Analytics Storytelling
Storytelling: It is very important, as humans we understand the world through stories.
Combine simple story with complexity.
Introduction to data visualization
Communicate information with visual elements.
Provide a summary of relationships and patterns in the data.
What is the relationship I am trying to establish?
Do I need to show the audience the correlation of variables?
What is the question I am trying to answer?
See if you need a still or interactive display.
What questions might you have?
Type of graphics:
Bars: Compare the relationships of a whole
Columns: Changes over time
Standing graph: Proportion of the parts, each proportion represents a category.
Lines: To show how a value changes in relation to a continuous variable, patterns, trends.
Data visualization: allows Dashboards to present both operational and analytical data.
Easy to understand
They make collaboration easy
They allow you to generate reports
Introduction to Visualization Software or Dashboards
Software
Spreadsheets:
Excel: used to represent graphs, easy to learn,
Google Sheets: Help choose the best chart
Jupyter:
Python Sheets
Matplotlib: 2d and 3d plots
Bokeh: Interactive graphics
Dash To create interactive frameworks. Does NOT require knowledge of HTMl and
javascript
Python
A:
Create basic and advanced visualizations
Shiny is web Aoos
IBM Cognos Analytics: Import customer visualizations, forecasts, recommendations
from your data
Tableu: Interactive products, graphs in worksheets, publish your results in Excel
stories, text, Google analytics, relational databases
Power BI: Based on analytics, powerful and flexible Excel SQL platform, Cloud, allows
you to collaborate on dashboards even from cell phones.
The ease and purpose you need to know.
Week 5 Learning Opportunities and Paths
Career Opportunities in Data Analytics
Industry
Government
Academy
Finance
Insurance
Health
Retail
YOU
Positions: Years, experience
Associate or Junior
Data Analyst
Senior analyst
Lead Analyst
Principal Analyst
Gain specialization in an area
You could start by knowing just a query and programming program.
Points of view
Integrity, making sure information is correct.
Someone who knows how to communicate.
Programming Skills, Python, SQL, R
Ability to work with data.
Detail oriented
Go beyond achievements, high aspirations.
Think outside the box
Dynamic and adaptive
Pick up technical skills quickly.
The multiple paths in Data Analysis
Coursera Edx and Udecity
Statistics, spreadsheets, SQL, Python Problem Solving storytelling impact presentations.
Career Options
Machine Learning
Data Scientist
1. List at least 5 (five) data points that are required for the analysis and detection
of credit card fraud. (3 points)
2. Identify 3 (three) errors/issues that could affect the accuracy of your findings,
based on a data table provided. (3 points)
3. Identify 2 (two) anomalies or unexpected behaviors that make you believe that
the transaction may be suspicious, based on a data table provided. (2 points)
4. Briefly explain your starting key from the data visualization table provided. (1
point)
5. Identify the type of analysis you are performing when analyzing historical credit
card data to understand what a fraudulent transaction looks like [Hint: The four
types of analysis include: Descriptive, Diagnostic, Predictive, Prescriptive] (1
point)