AI and ML Notes
AI and ML Notes
Artificial Intelligence, often abbreviated as AI, is a branch of computer science that aims to
create intelligent systems that can reason, learn, and act autonomously. It seeks to mimic human
intelligence processes, such as learning, problem-solving, perception, and language
understanding.
From self-driving cars to medical diagnosis, AI-powered systems are becoming increasingly
sophisticated and impactful. The core goal of AI is to develop machines that can think and act
like humans, enabling them to perform tasks that were once thought to be the exclusive domain
of human intelligence.
Machine Learning, a subset of AI, focuses on algorithms that allow computers to learn from data
without explicit programming. It involves training models on large datasets to recognize patterns
and make predictions or decisions. ML algorithms can be broadly categorized into three main
types:
AI has the potential to revolutionize various aspects of our lives, but it also raises significant
ethical concerns. Here are the key points to consider:
Importance of Ethics in AI
Fairness and Bias: AI systems trained on biased data can perpetuate discrimination and
inequality. Ethical AI ensures that algorithms are fair and unbiased.
Transparency and Explainability: AI models often operate as black boxes, making it
difficult to understand how they arrive at decisions. Ethical AI requires transparency and
explainability to build trust.
Privacy and Security: AI systems often collect and process large amounts of personal
data. Ethical AI safeguards user privacy and ensures data security.
Accountability: Determining who is responsible for the actions and consequences of AI
systems is a complex ethical challenge. Ethical AI requires clear guidelines for
accountability.
Machine learning algorithms are broadly categorized into two main tasks based on the type of
prediction they make: regression and classification.
a. Regression
b. Classification
9. Applications of Regression
Data Loading
Definition: Data loading refers to the process of importing data from various sources into a
system or software for analysis and processing.
Key Points:
Sources: Data can come from diverse sources like databases, spreadsheets, APIs, or real-
time streams.
Formats: Data can be in various formats such as CSV, JSON, XML, or binary.
Tools and Libraries:
o Python: Pandas, NumPy
o R: readr, readxl
o SQL: Database connectors
Challenges:
o Data Quality: Inconsistent formats, missing values, outliers.
o Data Volume: Large datasets can be computationally expensive to load.
o Data Velocity: Real-time data streams require efficient loading mechanisms.
Data Summarization
Definition: Data summarization involves condensing large datasets into smaller, more
manageable forms while preserving essential information.
Key Points:
Techniques:
o Descriptive Statistics: Mean, median, mode, standard deviation, quartiles.
o Frequency Distributions: Counting occurrences of values.
o Aggregation: Grouping data and calculating summary statistics for each group.
o Dimensionality Reduction: Techniques like PCA to reduce the number of
variables.
Tools and Libraries:
o Python: Pandas, NumPy
o R: dplyr
Purpose:
o Understanding Data: Gain insights into data distribution, trends, and patterns.
o Data Cleaning: Identifying and handling outliers, missing values, and
inconsistencies.
o Feature Engineering: Creating new features from existing ones.
o Model Building: Selecting relevant features and training models.
Big Data
Definition: Big data refers to extremely large data sets that may be analyzed computationally to
reveal patterns, trends, and associations, especially relating to human behavior and interactions.
Key Characteristics:
Challenges:
PREPROCESSING
DATA: Data is a raw and unorganized fact that required to be processed to make it meaningful.
Generally, data comprises facts, observations, numbers, characters, symbols, image, etc.
Virtually any type of data analysis, data science or AI development requires some type of data
preprocessing to provide reliable, precise and robust results for enterprise applications.
Machine learning and deep learning algorithms work best when data is presented in a format that
highlights the relevant aspects required to solve a problem.
Feature engineering practices that involve data transformation, data reduction, feature selection
and feature scaling help restructure raw data into a form suited for particular types of algorithms.
This can significantly reduce the processing power and time required to train a new machine
learning or AI algorithm or run an inference against it.
One caution that should be observed in preprocessing data: the potential for reencoding bias
into the data set. Identifying and correcting bias is critical for applications that help make
decisions that affect people, such as loan approvals
1. Data profiling. Data profiling is the process of examining, analyzing and reviewing data
to collect statistics about its quality.
2. Data cleansing. The aim here is to find the easiest way to rectify quality issues, such as
eliminating bad data, filling in missing data or otherwise ensuring the raw data is suitable
for feature engineering.
3. 3. Data reduction. Raw data sets often include redundant data that arise from characterizing
phenomena in different ways or data that is not relevant to a particular ML, AI or analytics task.
Data reduction uses techniques to transform the raw data into a simpler form suitable for
particular use cases.
4. 4. Data transformation. Here, data scientists think about how different aspects of the data need
to be organized to make the most sense for the goal.
DATA VISUALIZATION:
Data visualization is the practice of translating information into a visual context, such as a map
or graph, to make data easier for the human brain to understand and pull insights from.
The main goal of data visualization is to make it easier to identify patterns, trends and outliers in
large data sets. The term is often used interchangeably with information graphics, information
visualization and statistical graphics.
Data visualization is one of the steps of the data science process, which states that after data has
been collected, processed and modeled, it must be visualized for conclusions to be made.
Visualizations of complex algorithms are generally easier to interpret than numerical outputs.
Types of Visualizations: