Aa MDM MST
Aa MDM MST
Data analytics is a multidisciplinary field that involves the collection, transformation, and organization of data to
draw conclusions, make predictions, and drive informed decision-making.
Data analytics is the science of analysing raw data to make conclusions about that information.
Various approaches to data analytics include descriptive analytics, diagnostic analytics, predictive analytics, and
prescriptive analytics.
Data analytics help a business optimize its performance, perform more efficiently, maximize profit, or make more
strategically-guided decisions.
Statistics is a branch of mathematics dealing with data collection and organization, analysis, interpretation and
presentation.
Descriptive statistics uses the data to provide descriptions of the population, eIther through numerical calculations
or graphs or tables.
The process of detecting and correcting corrupt or inaccurate records from a database is said to be Data Cleaning
The process of changing data to make it more organized and easy to read is known as Data Manipulation
Machine Learning is a concept which allows the machine to learn from examples and experience, and that too
without being explicitly programmed.
A company can also use data analytics to make better business decisions and help analyze customer trends and
satisfaction, which can lead to new and better products and services.
1. The process involved in data analysis involves several steps: Determine the data requirements or how the data is
grouped. Data may be separated by age, demographic, income, or gender. Data values may be numerical or divided
by category. Collect the data. This can be done through a variety of sources such as computers, online sources,
cameras, environmental sources, or through personnel.
2. Organize the data after it's collected so it can be analyzed. This may take place on a spreadsheet or other form of
software that can take statistical data.
3. Clean up the data before it is analyzed. This is done by scrubbing it and ensuring there's no duplication or error and
that it is not incomplete. This step helps correct any errors before the data goes on to a data analyst to be analyzed.
Regression Analysis: This entails analyzing the relationship between one or more independent variables and a
dependent variable. The independent variables are used to explain the dependent variable, showing how changes in
the independent variables influence the dependent variable.
Factor Analysis: This entails taking a complex dataset with many variables and reducing the variables to a small
number. The goal of this manoeuvre is to attempt to discover hidden trends that would otherwise have been more
difficult to see.
Cohort Analysis: This is the process of breaking a data set into groups of similar data, often into a customer
demographic. This allows data analysts and other users of data analytics to further dive into the numbers relating to
a specific subset of data.
Monte Carlo Simulations: Models the probability of different outcomes happening. They're often used for risk
mitigation and loss prevention. These simulations incorporate multiple values and variables and often have greater
forecasting capabilities than other data analytics approaches.
Time Series Analysis: Tracks data over time and solidifies the relationship between the value of a data point and the
occurrence of the data point. This data analysis technique is usually used to spot cyclical trends or to project financial
forecasts.
Here are some of the reasons why Data Analytics using Python has become popular:
Python is easy to learn and understand and has a simple syntax.
The programming language is scalable and flexible.
It has a vast collection of libraries for numerical computation and data manipulation. Python provides libraries for
graphics and data visualization to build plots.
It has broad community support to help solve many kinds of queries.
NumPy: NumPy supports n-dimensional arrays and provides numerical computing tools. It is useful for Linear algebra
and Fourier transform.
Pandas: Pandas provides functions to handle missing data, perform mathematical operations, and manipulate the
data
Matplotlib: Matplotlib library is commonly used for plotting data points and creating interactive visualizations of the
data.
SciPy: SciPy library is used for scientific computing. It contains modules for optimization, linear algebra, integration,
interpolation, special functions, signal and image processing.
Scikit-Learn: Scikit-Learn library has features that allow you to build regression, classification, and clustering models.
Charts and Visualization-Any set of information may be graphically represented in a chart. A chart is a graphic
representation of data that employs symbols to represent the data, such as bars in a bar chart or lines in a line chart.
Conditional Formatting Patterns and trends in your data may be highlighted with the help of conditional formatting.
Pivot Table In order to create the required report, a pivot table is a statistics tool that condenses and reorganizes
specific columns and rows of data in a spreadsheet or database table. The utility simply “pivots” or rotates the data
to examine it from various angles rather than altering the spreadsheet or database itself.
Pandas generally provide two data structures for manipulating data, They are:
Series- Pandas Series is a one-dimensional labeled array capable of holding data of any type (integer, string, float,
python objects, etc.). Pandas Series is nothing but a column in an excel sheet.
Dataframe
Exploratory Data Analysis (EDA) is a process of describing the data by means of statistical and visualization
techniques in order to bring important aspects of that data into focus for further analysis. It involves inspecting the
dataset from many angles, describing & summarizing it without making any assumptions about its contents.
1.Univariate Analysis: Univariate analysis examines individual variables to understand their distributions and
summary statistics. This includes calculating measures such as mean, median, mode, and standard deviation, and
visualizing the data using histograms, bar charts, box plots, and violin plots.
Bivariate Analysis: Bivariate analysis explores the relationship between two variables. It uncovers patterns through
techniques like scatter plots, pair plots, and heatmaps. This helps to identify potential associations or dependencies
between variables.
3. Multivariate Analysis: Multivariate analysis involves examining more than two variables simultaneously to
understand their relationships and combined effects. Techniques such as contour plots, and principal component
analysis (PCA) are commonly used in multivariate EDA
4. Visualization Techniques: EDA relies heavily on visualization methods to depict data distributions, trends, and
associations. Various charts and graphs, such as bar charts, line charts, scatter plots, and heatmaps, are used to
make data easier to understand and interpret
5.Outlier Detection: EDA involves identifying outliers within the data—anomalies that deviate significantly from the
rest of the data. Tools such as box plots, z-score analysis, and scatter plots help in detecting and analyzing outliers.
6. Statistical Tests: EDA often includes performing statistical tests to validate hypotheses or discern significant
differences between groups. Tests such as t-tests, chi-square tests, and ANOVA add depth to the analysis process by
providing a statistical basis for the observed patterns
DATA CLEANING
Dirty data is essentially any data that needs to be manipulated or worked on in some way before it can be analyzed.
Some types of dirty data include:
Incomplete data—for example, a spreadsheet with missing values that would be relevant for your analysis. If you’re
looking at the relationship between customer age and number of monthly purchases, you’ll need data for both of
these variables. If some customer ages are missing, you’re dealing with incomplete data.
Duplicate data—for example, records that appear twice (or multiple times) throughout the same dataset. This can
occur if you’re combining data from multiple sources or databases.
Inconsistent or inaccurate data—data that is outdated or contains structural errors such as typos, inconsistent
capitalization, and irregular naming conventions. Say you have a dataset containing student test scores, with some
categorized as “Pass” or “Fail” and others categorized as “P” or “F.” Both labels mean the same thing, but the
naming convention is inconsistent, leaving the data rather messy.
Key steps
Discovery: This initial step involves understanding the data and identifying the questions you want to answer. It
includes locating the data sources and examining the data's current form to determine how it needs to be cleaned
and structured1.
Transformation: This step involves several sub-processes: Structuring: Ensuring that datasets are in compatible
formats for analysis. Normalizing and Denormalizing: Organizing data into a coherent database and combining
multiple tables if necessary.
Cleaning: Removing errors, duplicates, and outliers to ensure data accuracy. Enriching: Adding additional data or
metadata to enhance the dataset.
Validation: This step checks the transformed data for consistency, quality, and security. It often involves automated
processes and may require programming skills.
Publishing: The final step involves saving the cleaned and structured data in a format suitable for sharing and further
analysis.
Importance of Data Wrangling Data wrangling is crucial because it prepares data for the data mining process, which
involves looking for patterns or relationships in the dataset. High-quality, well-prepared data leads to more accurate
and valuable insights, enabling better decision-making
STEPS IN WRANGLING
1. Collection The first step in data wrangling is collecting raw data from various sources. These sources can include
databases, files, external APIs, web scraping, and many other data streams. The data collected can be structured (e.g.,
SQL databases), semi-structured (e.g., JSON, XML files), or unstructured (e.g., text documents, images).
2. Cleaning Once data is collected, the cleaning process begins. This step removes errors, inconsistencies, and
duplicates that can skew analysis results. Cleaning might involve: Removing irrelevant data that doesn't contribute to
the analysis. Correcting errors in data, such as misspellings or incorrect values. Dealing with missing values by
removing them, attributing them to other data points, or estimating them through statistical methods. Identifying and
resolving inconsistencies, such as different formats for dates or currency.
3. Structuring After cleaning, data needs to be structured or restructured into a more analysis-friendly format. This
often means converting unstructured or semi-structured data into a structured form, like a table in a database or a
CSV file. This step may involve: Parsing data into structured fields. Normalizing data to ensure consistent formats and
units. Transforming data, such as converting text to lowercase, to prepare for analysis.
4. Enriching Data enrichment involves adding context or new information to the dataset to make it more valuable for
analysis. This can include: Merging data from multiple sources to develop a more comprehensive dataset. Creating
new variables or features that can provide additional insights when analyzed.
5. Validating Validation ensures the data's accuracy and quality after it has been cleaned, structured, and enriched.
This step may involve: Data integrity checks, such as ensuring foreign keys in a database match. Quality assurance
testing to ensure the data meets predefined standards and rule
6. Storing The final wrangled data is then stored in a data repository, such as a database or a data warehouse, making
it accessible for analysis and reporting. This storage not only secures the data but also organizes it in a way that is
efficient for querying and analysis.
7. Documentation Documentation is critical throughout the data wrangling process. It records what was done to the
data, including the transformations and decisions. This documentation is invaluable for reproducibility, auditing, and
understanding the data analysis process.
Benefits of Data Wrangling -Improved Data Quality Enhanced Analytical Efficiency Facilitation of Advanced Analytics
and Machine Learning Data Integration from Multiple Sources Compliance and Data Governance Empowered
Decision-Making Scalability.
Time series data is a sequential arrangement of data points organized in consecutive time order. Time-series analysis
consists of methods for analyzing time-series data to extract meaningful insights and other valuable characteristics of
the data.
Continuous time series data involves measurements or observations that are recorded at regular intervals, forming a
seamless and uninterrupted sequence.
Consists of measurements or observations that are limited to specific values or categories. Unlike continuous data,
discrete data does not have a continuous range of possible values but instead comprises distinct and separate data
points.
Time series can also be irregular without a fixed unit or time or offset between units.
Timestamps, specific instants in time • Fixed periods, such as the month January 2007 or the full year 2010 • Intervals
of time, indicated by a start and end timestamp. Periods can be thought of as special cases of intervals • Experiment
or elapsed time; each timestamp is a measure of time relative to a particular start time. For example, the diameter of
a cookie baking each second since being placed in the oven
In order to create the required report, a pivot table is a statistics tool that condenses and reorganizes specific columns
and rows of data in a spreadsheet or database table. The utility simply “pivots” or rotates the data to examine it from
various angles rather than altering the spreadsheet or database itself.
Pandas Series is a one-dimensional labeled array capable of holding data of any type (integer, string, float, python
objects, etc.). Pandas Series is nothing but a column in an excel sheet. Pandas Series can be created from the lists,
dictionary, and from a scalar value etc.
Techniques that are used for predictive analytics are:
Linear Regression
Time Series Analysis and Forecasting
Data Mining
Predictive modeling
Decision Analysis and optimization
Transaction profiling
Common examples of Descriptive analytics are company reports that provide historic reviews like:
Data Queries
Reports
Descriptive Statistics
Data dashboard
For example, Prescriptive Analytics can benefit healthcare strategic planning by using analytics to leverage
operational and usage data combined with data of external factors such as economic data, population
demography, etc.
Data discovery
Data mining
Correlations
A marketing firm conducts a survey with multiple questions about customer preferences for a product. Instead of
analyzing each question separately, factor analysis groups related questions into broader factors like "Brand
Loyalty," "Product Quality," and "Pricing Sensitivity." This helps the company understand key influences on
customer decisions and improve their marketing strategies accordingly.
An e-commerce company tracks customers who made their first purchase in January, February, and
March separately. By analyzing their purchasing behavior over the next six months, the company identifies
which cohort has the highest retention rate and adjusts marketing strategies accordingly.
A financial analyst uses Monte Carlo simulations to predict stock portfolio performance over the next year. By
simulating thousands of possible market conditions (such as different interest rates, inflation rates, and stock price
movements), the analyst estimates the likelihood of different returns and advises investors on risk management.
A retail company uses time series analysis to predict monthly sales trends based on past sales data. By analyzing
seasonal spikes (e.g., higher sales during holidays) and long-term trends, the company optimizes inventory and
marketing strategies.
CREATE 1D ARRAY
CREATE 2D ARRAY
CHECK SHAPE
BETWEEN 0 AND 1
REPEAT ARRAYS
SUM