Download
Download
Data analyst
translate data and numbers into plain language, so organizations can make decisions. Data analysts inspect and clean data for:
deriving insights
identify correlations
find patterns
apply statistical methods
to analyze and mined data and visualize data to interpret and present the findings of data analysis.
New technologies
like cloud computing, machine learning, and big data have a significant influence on the data ecosystem, providing access to limitless storage, powerful computing,
and advanced tools for data analysis.
Data Analytics
Data analytics is the process of gathering, cleaning, analyzing and mining data, interpreting results, and reporting the findings.
Definition Detailed examination of elements or structure of something Systematic computational analysis of data or statistics
Use of Can be done without numbers or data (e.g., business analysis, Almost invariably implies the use of data for numerical manipulation and
Number psychoanalysis, etc.) inference
Historical
Often based on inferences from historical data Not limited to historical data; can include predictive elements
Data
Proficiency in spreadsheets, statistical and visualization tools, Understanding of statistics, analytical Collaboration, effective communication,
programming, querying languages, and working with various techniques, problem-solving, data visualization, storytelling with data, stakeholder engagement,
data repositories and big data platforms. and project management. curiosity, and intuition.
Types of Data
Data is unorganized information that is processed to make it meaningful.
Characteristics Well-defined structure, tabular format, Some organizational properties, Lacks specific structure, no mainstream database
schemas metadata-driven fit
Examples of Sources SQL databases, Spreadsheets, Online forms, E-mails, XML, Binary executables, Data Web pages, Social media feeds, Images,
Sensors, Logs integration Audio/Video, Documents
Data professional work with a variety of data file types, and formats including delimited text files (CSVs and TSVs), Microsoft Excel XLSX, XML, PDF, and JSON.
These formats are used for storing, organizing, and sharing data in different ways, offering flexibility and compatibility with a wide range of applications and systems.
Data Sources
Relational Databases Systems like SQL Server, Oracle, MySQL, and IBM DB2 used for structured data storage.
Flat Files & XML Datasets Plain text formats with delimited values (CSV, TSV) or hierarchical structures (XML) for data organization.
APIs and Web Services Interfaces for interacting with data providers or applications, returning data in various formats.
Web Scraping Techniques for extracting specific data from web pages based on parameters, using tools like BeautifulSoup, Scrapy, and Selenium.
Data Streams Continuous flows of data from various sources (IoT devices, GPS data, web clicks, etc.) often timestamped and geo-tagged.
RSS Feeds Sources for capturing updated data from forums and news sites, streamed to user devices via a feed reader.
Data Repositories
A Data Repository is a general term that refers to data that has been collected, organized, and isolated so that it can be used for reporting, analytics, and also for archival
purposes. The different types of Data Repositories include:
Databases, which can be relational or non-relational, each follow organizational principles to show the kind of data they can store and the tools used to query, organize,
and retrieve data.
Data Lakes, that serve as storage repositories for large amounts of structured, semi-structured, and unstructured data in their native format.
Big Data Stores, that provide distributed computational and storage infrastructure to store, scale, and process very large data sets.
Data Warehouses, that consolidate incoming data into one comprehensive storehouse.
Data Marts, that are essentially sub-sections of a data warehouse, built to isolate data for a particular business function or use case.
ETL process is an automated process that converts raw data into analysis-ready data by:
Data Pipeline, sometimes used interchangeably with ETL, encompasses the entire journey of moving data from the source to a destination data lake or application, using
the ETL process.
Data Sources, can be internal or external to the organization, and they can be primary, secondary, or third-party, depending on whether you are obtaining the data directly
from the original source, retrieving it from externally available data sources, or purchasing it from data aggregators.
Data that has been identified and gathered from the various data sources is combined using a variety of tools and methods to provide a single interface using which data
can be queried and manipulated.
The data you identify, the source of that data, and the practices you employ for gathering the data have implications for quality, security, and privacy, which need to be
considered at this stage.
Data Wrangling
Data Wrangling is an iterative process that involves data exploration, transformation, and validation.
* Structurally manipulate and combine the data using Joins and Unions.
* Denormalize data, that is, combine data from multiple tables into a single table so that it can be queried faster.
* Clean data, which involves profiling data to uncover quality issues, visualizing data to spot outliers, and fixing issues such as missing
values, duplicate data, irrelevant data, inconsistent formats, syntax errors, and outliers.
* Enrich data, which involves considering additional data points that could add value to the existing data set and lead to a more
meaningful analysis.
Statistical Analysis
Statistical Analysis involves the use of statistical methods in order to develop an understanding of what the data represents.
Descriptive statistical analysis: provides a summary of what the data represents. Common measures include Central Tendency, Dispersion, and Skewness.
Inferential statistical analysis: involves making inferences, or generalizations, about data. Common measures include Hypothesis Testing, Confidence Intervals, and
Regression Analysis.
Data Mining
Data Mining, simply put, is the process of extracting knowledge from data. It involves the use of pattern recognition technologies, statistical analysis, and mathematical
techniques, in order to identify correlations, patterns, variations, and trends in data.
Data Visualization
Data visualization is the discipline of communicating information through the use of visual elements such as graphs, charts, and maps. The goal of visualizing data is to
make information easy to comprehend, interpret, and retain.