0% found this document useful (0 votes)
522 views17 pages

Foundation of Data Science AAKASH

This document discusses why Python is a popular language for data science and machine learning. It also provides steps for creating a synthetic dataset in Microsoft Excel for educational purposes related to data science concepts. Python is popular due to its ease of use, extensive libraries, active development and scalability. It allows users to focus on solving problems rather than wrestling with programming intricacies. Creating a synthetic Excel dataset involves defining objectives, opening Excel, adding headers to describe attributes, and filling in rows with synthetic data generated using Excel functions.

Uploaded by

Jayant Rathee
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
522 views17 pages

Foundation of Data Science AAKASH

This document discusses why Python is a popular language for data science and machine learning. It also provides steps for creating a synthetic dataset in Microsoft Excel for educational purposes related to data science concepts. Python is popular due to its ease of use, extensive libraries, active development and scalability. It allows users to focus on solving problems rather than wrestling with programming intricacies. Creating a synthetic Excel dataset involves defining objectives, opening Excel, adding headers to describe attributes, and filling in rows with synthetic data generated using Excel functions.

Uploaded by

Jayant Rathee
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

STEP TOWARDS SUCCESS

AKASH'S
Guru G Singh Indra Prastha University Series

OLVED PAPERS
REVIOUS YEARS SOLVED QUESTIONPAPERS]

THIRD SEMESTER JB.Tech


(AIML/AIDS/IOT -203, 211.213)
Foundation of Data Science
Universal Human Values-II
Critical Reasoning & Systenms Thinking

Rs.81.00/-|
AKASH BOOKS
NEW DEL H
(AIMIJAIDS) 20122-1
of Data Science
6-2022 Third Semester, Foundations IP. University-tB Techj-Akash Hook.
with frequent updates
modeling involves the following: is to build a mod Python in actively developed, modern and relevant,
Data
predictive data modeling, the gonlinvolves 6. Aetive Dovelopment the language remains
1. Predietive Modeling In future or unseen data. This olen supervie ennures that
and improvement,Thisadvancementa in Al and MI.
that can make predictions aboutmodel is trained on historical data with known oulcom incorporating the latest has translated into
Python's popularity in academiaand machine learning
learning techniques, where the 7. Industry Adoption: use Python for data science
to make predictions on new data
data modeling aims to ereate modela induntry adoption Many companies for job seekers in these helds. language for certain
2. Descriptive Modeling: Descriptive in the data. This can include clusterinr taska, manking it a valuable skill itself s not the fastest faster languares like
that deseribe the relationships and patterns 8. Scalability: While Python allows for easy integration with
dimensionality reduction, or other unsupervised techniques. computationally intennive taska, it
critical performance-eritical componenta in
preseriptive data modeling, the focus ís on C++ or Java. This moansthat you can write
logic and prototyping
3. Prescriptive Modeling: In higher-level
recommending actions or solutions based on the data and the model's understanding of theso langungos while still using l'ython for
of Data Science subject,
problems and recommendation systems. MI. as a part of the Poundations
In the context of Al andavailability libraries and frameworks make
the system. This isoften used in optimization of specialized
UNIT-I Python's simplicity and the and learning It allows students to grasp fundamental
Q.2. (a) What makes python a popular language for data
science and it an ideal choice for teaching them to real-world problems without getting bogred
machine learning? Deseribe. (6) concopts quickly and start applying
Additionally, the extensive community and resources
and machine down by language intricacies.seek help and deepen their understand1ng of Al and ML
Ans. Python is an immensely popular lunguage for data science to make it easier for studenta to
learning. particularly in the context of Al and ML. This popularity can be attributed
synthetic dataset in Microsoft Excel?
concepls. (5)
several key factors:
Q.2. (b)How can you create a in Microsoft Excel for educational purposes, such
1. Ease of Learning and Readability: Python is known for its simple and Ans. Creating a synthetic dutaset of Data Science subject,
readable syntax. This makes it accessible to newcomers and experienced programmers in the context of a Foundations
alike. As a result, data scientists and muchine learning engineers can quickly grasp as toaching Al and ML concepts
data that mimics real-world
scenarios.
generating sample
the fundamentals and focus on solving complex problems instead of wrestling with the involves
to ereate a simple synthetic dataset using Excel:
Here are the steps of your synthetie
language itself. Determine the specific objectives illustrate
2. Abundant Libraries and Frame works: Python boasts a rich ecosystem of 1. Define the Purpose: in Al and ML are you trying to or teach?
librarics and frameworks that cater specifically to data science and machine learning. dataset, What concepts or techniques
you in generating appropriate synthetic data.
Knowing the purpose will guide
Some of the most notable oncs include: Excel on your computer.
NumPy: Provides support for arrays and matrices, essential for numerical 2. Open Excel: Launch Microsoft row (usually row 1), create headers for each
computations. 3. Crente Henders In the irst These headers should describe the
characteristics
Pandas: Ofers data manipulation and analysis tools through data structures attribute or feature in your dataset. teaching a classifcation problem, you might have
like DataFrames. of your data. For example, if you're Status."
attributes like "Age," "Income," and "Marital
Alatplotlib and Seaborn: Used for data visualization and plotting. Data: Fill in rows below the headers with synthetic data. You can
Scikit-l.carn: Provides a wide range of machine learning algorithms with a 4. Generate formulas to generate data that its your educational
use Excel's built-in functions and
consistent API. synthetic data:
objectives. Here are some ways to generate to generate random numbers
TensorFlow and PyTorch: Leading deep learning frameworks for building and Random Numbers: Use Excel's RAND0 function these random values as needed.
training neural networks. scale and manipulate
between0 and 1. You can between 18 and 65, you can use the formula
These hbraries greatly expedite the development process by providing pre For example, to generate random ages
implemented functions and tools for common tasks. '=INT(RANDO»(65-18+1)+18.
categorical variables, you can
3. Community and Doeumentation: Python has a massive and active Categorical Data: If your dataset includesfeature to create drop-down lists.
community. This translates into comprehensive documentation, tutorials, forums, and manually enter values or use Excel's Data Validation
a plethora of online resources. Data scientists can readily ind help and solutions to on specifc patterns or
their problems, reducing development time. Formulas: Create formulas to generate data based '=1F(B2>50000,"High","Low")
relationships. For instance, you can use a formula like
4. Versatility: Python is a general-purpose programming language, which means a certain threshold.
it can be used for a wide range of tasks beyond datu science and machine learning. This to categorize income as "High" or "Low" based onvalues for specific patterns, you can use
versatility allows data scientists to integrate their machine learning models with other Duplicate Data: lfyou noed to replicate
parts of asoftware project or create end-to-end data pipelines with ease. Excel's copy and paste functions.
generated data adheres to realistic
5. Interoperability: Python plays well with other languages. This is crucial for 6. Data Validntion: Ensure that the reasonable range, and income values
data scientists who often need to integrate their work with constraints. For instance, ages should be within a
existing systems or utilize
specialized librarieswritten in other languages. should be non-negative.
Java integration, for example, facilitates this. Python's robust support for C/C++ and 6. Data Formatting Format the data appropriately. For example, you can
text columns as text.
format numerical colun:ns as numbers, dates as dates, and
(AIMI/AIDS)
S- 2022 Third Semester, Foundations of lDatn Seienco Akash Honha
IP Univeraity (H Techl
7. Data Size: Determine how many rows of data you need lor your educatione srinnen
thinnint
often riree ereuliveehailengee
purposes. Create as many rows as necessary to illustrate the conceptsefleetively. solving in data
10. Crentivity: oblemto explore new apyrnehes nd atapt tn unipie
Data scientiate shosld be nble
8. Suve the Workbook: Save your Excel workbook with an appropriato nan sensitive and pernnai data
Data srientiets handle deriainn
and file extension (e g, xlsx), 11. Kheal Awnreness privaey and hias, ahould always t part of their
9. Doeumentation: It's essential to document the synthetie dataset, includine Ethical considerations,such as
the purpose,data generation methods, and any assumptions made during the proce making procon on muitgin prajes
Data scientista ofen work
This documentation will help students understand the dataset's context. 12. Project Manmgement:manwgement skile help in settin prioritis, meetir
10. Share or Use in Your Teaching: You can share the Excel workbook with your simultaneounly Strong projeet
students or use it as part of your teaching materials to demonstrate various Al and ML. dendlines,,anddelivering results may
be chaliengng and with
Data eience projerts canpatient
concepts, such as data preprocessing, feature engincering, and model training. 13, Resilience and Patience:
resulta. Being persistent and when fared
Creating synthetie datasets in Exel is a useful edueationaltool to help students not always yield immediate
gain hands-on experience and abetter understanding of data manipulation and analysis
setbacks in important.
Data sciene projerts are typically eollaborative efTorts Being a
within the context ofAl and ML. 14. Team Player: and willing to ahare
knowledge is valusble
team player, open to feedbuck, Understanding the business objectives and haw their
data
Q.2.(e) What are the most important traits of a successfuldata scientist? 16. Business Acumen: rcientists should align
achieving them is crucial Data
(6) science can contribute to goals.
Ans. In the context of the Al & ML branch as part of the. Foundations of Data work with the organization's in the AI & ML branch,
Foundations of Data Science subject for success in the held
Science subject, a successful data scientist should possess a combination of teehnical In the context of the is essential to prepare studenta
skills, domain knowledge, and personal attributes. tenching these traits and skills real-world students to
projects, and opportunities for
Here are some of the most important traits and qualities of a successful data Providing practical exercises, can help them develop these qualities and becone
scientist: work on interdisciplinary teams
1. Strong Analytical Skills: Data scientists need to be adept at analyzing effective data scientists. processing unstructured
common techniques for
complex data, identifying patterns, and drawing meaningful insights. This involves the Q.3. (a) What are some (5)
ability to think critically and solve problems using data-driven approaches. data in data science? science workflow, involving the
critical step in the data
2. Statistical Knowledge: Proficiency in statistics is crucial for understanding Ans. Data processing is a preparation of raw data for analysis. Here are some
data distributions, hypothesis testing, and making informed decisions based on data. cleaning, transformation, and science:
used in data processing in data
It's foundational for many Al and ML techniques. common techniques and processes
3. Programming Skillss Data scientists should be proficient in programming 1. Data Cleaning:
data using techniques like mean,
languages like Python or R. They need to write code to clean,preprocess, and analyze -Handling Missing Values: Impute missing
imputation methods.
data, as well as develop machine learning models. median, mode, or more advanced
4. Data Wrangling: Data often comes in messy, unstructured formats. Successful -Outlier Detection and Treatment: Identify and handle outliers through techniques
(Interquartile Range), or machine learning models.
data scientists have the skills to clean and preprocess data, which can be a time like z-scores, 1QR consistency and integrity by validating values
consuming but crucial part of the job. Data Validation: Ensure data
5. Machine Learning andAI Knowledge: An understandingofmachine learning against predefined rvles or constraints.
algorithms, deep learning, and Al techniques is essential. This includes knowing when -Deduplication: Remove duplicate records from the dataset.
and formats to ensure
and how toapply different algorithms and models to solve specifc problems. -Standardization: Standardize units of measurement
6. Domain Expertise: In many cases, data scientists work within specific
industries, such as consistency.
healthcare, finance, or marketing. Domain expertise allows them to 2. Data Transformation:
understand the context of the data and make more meaningful recommendations. 0 to 1)to avod
7. Data Visualization: Data scientists should be able to effectively
-Normalization: Scale numericul features to a standard range (eg.,
communicate dominance of certain variables.
their indings to both technical and non-technical audiences. Data visualization skills variables into numerical format
using tools like Matplotlib, Seaborn, or Tableau are valuable for this purpose. -Encoding Categorical Data: Convert categorical binary encoding.
8, Communication Skills: Data scientists need to collaborate with using techniques like one-hot encoding, label encoding, or
cross -Feature Engineering: Create new features from existing ones to improve model
functional teams, including engineers, product managers, and business stakeholders. aggregations, or domain
The ability to communicate complex findings and solutions in a clear and concise performance. This muy involve mathematical transformations,
is erucial. manner specific knowledge.
tasks like removing
9. Continuous Learning: The field of data science is rapidly evolving. -Text Preprocessing: Tokenize and clean text data, including
data scientists are committed to continuous learning and staying up-to-dateSuccessful
with the stop words, stemming, and lemmatization.
time columns,
latest tools, techniques, and research. -Dute and Time Parsing: Extract relovant information from date and
such as yeur, month, day, or day of the week.
Science (AIMIJADg
Third Semester, Foundations of Data 2022-6
4-2022 LP. University-(B Teehl-Akash Books
Clustering can reduce the complexity of large datasets by
2. Data Reduction:
points into clusters. This simplifies data analysis and makos it () Data Quality Improvement: Raw data ofen contains nisaing values, outliers,
grouping similar data is of
moro manageable. and noisy data. Dnta cleaning helps rectify these issues, ensuring that the data
3. Anomaly Detection:
Clustering can also be used to identify outliers or hign quality and can be used for meaningful analysin.
belong toany cluster can be considered as potentiat
iese Data points that do not (i) Accuracy: Accurate data iscrucial for building reliable models and drawing
anomalies. aceurate conclusions. Data cleaning aims to correct error that could lead to incorrect
erommendation Systems: In recommendation systems, clustering is ofen or biased results.
Bsed togroup users or items with similar preferences. This can be useful for (ilt) Consistency: Inconsistencies in data formatting, units of measurement, or
recommendations bused on user behavior or content similarity coding can hinder analysis. Data cleaning standardizes and harmonizes the data for
Common clustering algorithms in data science inchude: consisteney.
L. K-Means K-Means is one of the most popular and widely used clusterin (iv)Completeness: Missing data can lend to incomplete analyses Data cleaning
algorithms. It partitions data points into 'k clusters based on their proimity to E involves strategies to deal with missing values, such as imputation, to ensure that the
controjds of those clusters. It's etticient and works well for spherical clusters. dataset isas complete as pOssible.
Hierarchical Clustering: Hierarchical clustering buildsa tree-like structuro Data Visualization:
of clusters, starting with individual data points and merging them into
larger clusters, Data visualization is the graphical repre sentation of data to convey information
It can be agglomerative (bottom-up) or divisive (top-down). effectively. It involves creating visual representations such as charts, graphs, maps, and
8. DBSCAN (Density-Based Spatial dashboards to help people understand the patterns, trends, and insights within data.
DBSCAN groups together Clustering of
data points that are close to cachApplications with Noise): Data visualization is essential in data science for the following reasons:
other in terms of density. It
can find clusters of arbitrary shapes and is robust to noise. (i) Insight Discovery: Data visualization enables data scientists and analysts to
4. Gaussian Mixture Model discover patterns, trends, and relationships within the data that might not be evident
that the data is generated from a(GMM): GMM is a probabilistic model that
mixture of several Gaussian distributions.assumes from raw numbers. Visualizations can highlight key insights.
identify clusters with different shapes It can (i) Communication: Visualizations are a powerful way to communicate complex
and sizes.
Q.1. (c) What is data exploration and why is it data ffndings to both technical and non-technical stakeholders. They provide a clear aFd
What is the purpose of data important in data science? intuitive means of conveying information.
Ans. Duta exploration, alsonormalization
in data science? (3)
known as exploratory data analysis (EDA), is a crucial (iii) Exploratory Data Analysis (EDA): Data visualization is an integral part of
and often the initial step in the data science EDA. Visualizations can help in the initial understanding of data and the generation of
investigation of data to understand its structure, process. It involves the preliminary hypotheses for further analysis.
Data exploration is important in data sciencepatterns,
for several
and characteristics.
(iv) Decision-Making: Visual representations of data aid in decision-making
1. Understanding the Data: Data exploration helps datareasons: processes by providing a visual summary of relevant information. This is valuable in
understanding of the dataset they are working with. This includes scientists guin an initial business, research, and policy-making.
data's size, format, and the meaning of each understanding the Q.1. (e) What is data aggregation and how is it used in data science? What
variable orfeature.
2. Identifying Data Quality Issues: During data
exploration, data scientists isdata modeling and how is it different from analysis? (3)
can identify and address data quality issues such as missing values, outliers, and Ans. Data Aggregation: Data aggregation is the process of combining and
inconsistencies. This is essential for ensuring the data's reliability and accuracy. summarizing data from multiple sources, records, or variables into a single, more compact
3. Detecting Patterns and Relationships: Exploring the data allows you to form. It is used in data science to simplify the data and make it more manageable, ofen
identify patterns and relationships within the dataset. This can with the goal of extracting meaningful insights or preparing data for analysis.
between variables, trends over time, and other meaningful insights.include correlations Here are some common use cases for data aggrogation:
4. Feature Selection: Data exploration can guide the (i) Summarizing Data: Aggregation can be used to calculate summary statistics
by identifying which variables are relevant and informative process of feature selection
task, and which cAn be omitted to simplify the for the analysis or modeling such as means, medians, sums, counts, or other measures for specific groups or categories
6. Data Visualization: Data exploration problem. within the data.
often
techniques, which can make it easier to convey insights involves data visualization (ii)Temporal Aggregation: Time-series data can be aggregated to different time
and trends to stakeholders.
Visualizations can help in understanding data at a glance and communicating intervals (e.g., daily, monthly)to create a higher-level view of the data for trend analysis
effectively. indings and pattern recognition.
Q.1. (d) What is data cleaning and why is it necessary in the (iii) Spatial Aggregation: Geospatial data can be aggregated to different levels of
process? What is data visualization and what are its benefits in data data science granularity (e.g., at the country, state, or city level) for regional analysis.
science? Data Modeling: Data modeling in the context of data science refers to the process
Ans. Data Cleaning: Data cleaning, also known as data (3) of creating and using mathematicalor computational models to make predictions, draw
preprocessing
wrangling, is the process of identifying and correcting errors, inconsistencies, and or data inferences, or gain a deeper understanding of adataset. It is different. from data analysis,
inaccuracies ina dataset to improve its quality and reliability. Data cleaning is necessary which focuses on exploring and describing the data's current state and patterns.
in the data science process for the following
reasons:
Science (AIMLJAIDS)
10-2022 Third Semester, Foundations of Data Teeh-Maah Bonka
2-11
LP Univmraity-(1
3. Data Integration:
multiple sources or tables hel
-Merging and Joining: Combine data fromdataframes. tert,
and Extraction 1 varinss snIrres, ineluding
operations like joins(e g., SQLjoins) or merging 1 Data Collection Unstrctured data can come from data frnm these soreee
-Concatenation: Stack or concatenate data vertically or horizontally, ofen uBed i DalaSourcrs and extracting seraping. sneial media
video Collectingdealing
images,asdio, and eapecially when with weh data i ofen
time series data.
summar. cnn be challenging,
Data Volume Unstrturednllert and etore
"Data Aggregation: Group data by specifie attributes and caleulate content
data, or multimedinmaking it computationally intenstvn to
statistics or aggregations. Iarge in volume,
4. Data Reduction: 2. Data Preprocessing: frnn onstrurtursd
meaningful information handiing
data, extracting diferent
-Dimensionality Reduction: Use techniques like Principal Component Annlysis -Text Paraing: In text
tokenization, stop word removal, stemming., and
(PCA) or t-SNE toreducethe number offcatures while preserving importantinformation text involves charncter encodings.
videns roquires techniques
-Sampling. Reduce the size of large datasets by selecting u representative sample. languages and Analyzing images andextraction.
Image and Video Processing:
detection, and video frame recognitinn
5. Data Reshaping:
likeimage segmentation, object audio data involves speech
-Pivoting and Melting: Reshape data between wide and long formats to suit different Audio Processing: Analyzing
. Speech andextraction.
analysis needs. and audio feature
Stacking and Unstacking: Rearrange multi-level index dataframes for better Quality:
3. Data Cleaning and Information: Unstructured data may
contain naise,
readability. Irrelevant cleaning and validation.
- Noise and inconsistencies, requiring data data can be
6. Data Scaling: irrelevant content, or Identifying anomalies or outliersin unstructured
-Min-Max Scaling: Scale numerical features to a specified range. - Data Anomalies: structure.
-Standardization (2-Score Scaling): Transform data to have a mean of 0 and a challenging due to the lack of a clear
standard deviation of 1. 4. Feature Extraction: dimensionality, making it
often has high
7. Data Splitting: Dimensionality: Unstructured data creating a large number of features.
-Train-Test Split: Divide data into training and testing sets to evaluate machine features without unstructured
challenging to extract relevant Extracting meaningful features from
learning models. Semantic Understanding: domain and context.
. understanding of the
-Cross-Validation: Split data into multiple subsets (folds) for model validation and data often requiresa deep
hyperparameter tuning. 5. Data Integration: data, such as
different types of unstructured
8. Time Series Preprocessing Combining Data Types: Integrating can be complex
Lag Features: Create lagged versions of time series data for time-dependent comprehensive analysis (e.g.. text
modeling. text and images, for information from multiple modalities
Multimodal Data: Combiningnecessary for some applications.
-Rolling Statistics: Caleulate rolling means, medians, or other statistics to capture and images) can be challenging
but
trends and patterns. 6. Computational Resources: computationally
9. Handling Imbalan ced Data: unstructured data can be
Kesource Intensive: Processing and distributed computing resources.
-Resampling: Address class imbalance by oversampling the minority class or hardware
intensive, requiring powerfulUnderstanding:
undersampling the majority class.
-Synthetic Data Generation: Generate synthetic data points for minority classes 7. Natural Language contains semantic ambiguity, humor,
using techniques like SMOTE (Synthetic Minority Over-sampling Technique). Semantic Ambiguity: Text data often that are challenging to interpret
aceurately.
10.Data Validation and Quality Control: sarcasm, and context-dependent meanings
-Data Profiling: Examine data distributions,summary statistics, and data quality 8. Scalability and Performance: unstructured data while maintaining
issues. Scalability: Handling large volumes of
processing eficiency is a challenge.
-Data Quality Checks: Implement automated checks for data quality, such as require real-time or near-real-time
ensuring data consistency and adherence to constraints. Real-Time Processing: Some applications performance.
high
These data proccssing techniques are fundamental to the data science process analysis of unstructured data, demanding
and play a crucial role in preparing data for modeling, analysis, and visualization. 9. Privacy and Security: contain sensitive inforunation, requiring
Depending on the specific project and data, data scientists may employ a combination - Data Privacy: Unstructured data may
of these techniques toensure the data is suitable for the task at hand. robust privacy protection measures.
unstructured data during processing and
Q.3. (b) What are the challenges in processing unstructured data in data - Security: Ensuring the security of
science? (6) storage is critical.
Ans. Processing unstructured data in data science presents several challenges due 10. Machine Learning and Models:
learning or deep learning model for
to the lackof a predefined structure, making it more complex and requiring specialized - Model Selection: Choosing the right machine data type and task.
techniques and tools. Here are some of the key challenges: unstructured data can be challenging, as it depends on the
END TERM EXAMINATION
MID TERM EXAMINATION (JAN. 2023] TECHJ
THIRD SEMESTER (B. (AIDSIAIMLIOT-203]
THIRD SEMESTER (B. TECH] FOUNDATIONS OF DATA SCIENCE Max. Marks: 75
FOUNDATIONSOF DATA SCIENCE (AIDSIAIMLIOT-203] Time: 3 Hrs. which is compulsory.
in all including Q. No. 1
Time l5 Hrn. Max. Marke 30 Notet Attempt five questions
each
from unit.
Select one question
Note: Question No. 1 is compulsory. Attempt any two questions from tbe Q.1. Attempt allquestions. a popular
remaining questioas science? What is Seaborn and why is it
Q.1. (a) What is data
(3)
visualization in Python?
Q.1. (a) Why do we sed Machine Learaing in data science? (2.5) library for data interdisciplinary ield that uses scient1fic
science is an from
Ans. Data Science: Data and systems to extract insights and knowledge statistics,
Q.L (b) Bgplaia supervied and unsmpervised machine Learning? (2.5) methods, processes, algorithms, domains such as
data. 1lt combines various
structured and unstructured data engineering to
Q.L. (e) What is the role of decision- makingin data modeling? (2.5)
science, machine learning, domain knowledge, andscience is to turn data
computer data. The primary goal of data
Q.1. (d) Write any four functions of Scipy Library. (2.5) analyze and interpret complex
predictions, and recommendations that can inform decision
into actionable insights,
Q4)What advantages does NumpyArray offer overnested Python List making and solve real-world
problems.
library built on top
(6)
Seaborn: Seaborn is a popular Python data visualizationSeaborn is specifhcally
used data visual1zation library.
Q2 b) Write a Python program for random number generation? (6) of Matplotlib, another widely graphics It provides a
and attractive statistical informative statistical
designed for creating informative aesthetically ploasing and
Q () Write code to deaign any reoommender ystem. (6)
high-level interface for creating
visualizations with minimal code. visualization in
Q3 (b) Discus in detail Pandau inpython with suitable examples. (6) Seaborn isa popular library for data
Here are some reasons why
94 (a) Explain the steps for the predictive model usingspython. (6) Python: of creating complex
Seaborn simplifes the process analysts to generate
1. Simpliied Syntax: easier for data scientists and
4 6)Deecribe ploting and viualiations concepts in python. (6)
statistical plots and graphs, making it and code.
informative visualizations with less offort pulettes and
Defaults: Seaborn comes with appealing default color
2. Stylish aesthetically pleasing plots without the need for
themes, which make it easier to create
extensive customization. Pandas
Seaborn seamlessly integrates with
3. Integration with Pandas: structure for data manipulation in Python.
DataFrames, which is the primary data more eficientiy.
This integration allows you to work with data support for various statistical
4.StatisticalVisualizations: Seaborn offers built-in
box plots, and regression plots.inThese plots are ofen
plots like bar plots, violin plots,relationships and distributions data.
used in data analysis toexplore feature that allows you to create
6. Facet Grids: Seaborn provides the FacetGrid
each corresponding to a subset of the data based on one or
multiple plots side by side,
data from different angles.
more categorical variables. This is useful for exploring and what are
Q.1. (b) What is the purpose of clustering in data science (3)
some common algorith ms used in clustering?
in data science and machine learning Its
Ans, Clustering is a fundamental task
on some sim1larity or
primary purpose is to group similar data points together bascd
distance metric. The main goals of clustering are:
1.Pattern Discovery: Clustering helps discover hidden patterns or structures
segments
within data. By grouping similar data points, it cun reveal natural clusters or
in the data, which can provide valuable insights.
Data Science (AIMVAIDS)
ThirdSemester, Foundatiens of
TP University jBTechl Akash Itonks
12-2022
for
and labeling unstructureddata Nurvined
-Training Data Annotating
can be time consuming and
11.
- lnterdisciplinary
resource intensive.
Demain Knowledge Skills:
learning
Processing unstructured data often requires domain.ejecite
12. Programmin and Neripting
Python or is esntial for data
l'ofieney in progamin langnges
manipulation, analyi,snd mtel derluuest
data rutrieral, manipalatinn, and
etuial foe
11. Database Nk ill iQ1. akilla are
etlectively. relational databases
expertise te interpret and analvze the data 'nderatand the donain r induty in whirh ya
working with
14. Domaln Knowledge levant
12. Data lInterpretability:
explaining the results of annlysisve working Domsin knowledge in rurial for framing uestiuns, selertig
. Interpreting Results Understand1ng and interpreting reeulta efeetively
complex modela l eaturon, and rose (onetinnal
unstructured data can be challenging. particularly when uning Communieation and Collabortion Collabrate with
16. effectively eommaniate data drirnn
deep neural networks
unstruetured data in data science is ossenti teans, including non technial stakeholders, and
Despite these challenges, processing insights andrecommendations
because it holds valuable insights and is prevalent in many real world applieation Skills lequired:
analysis, sentimentanalysis, and understandingofstatitireievitaifor hypthesie
such as natural language processing, image and videocombination of domain knowledge 1. Statistical Knowledge:A solid
more. Overeoming these challenges ofen involves a data.
testing, mnodel selection,and interpretingin machine loarning algorithms and echniques
specialized tools, and creative problem solving 2. Machine Learning: Proficiency clasifcatinn, clustering.
data scientist? What
Q.3. (e) What are the primary responsibilities of a (6) including supervised and unsupervined learning. regressin,
skills does a data scientist nced?
derive insights. and deep learning. manipulation librariee like NiunPy and
Ans. Adata scientist plays a erucial role in leveraging data to The primary 3. Data Manipulation: Skills in data analysis
solve complex problems, and inform decision-making in various domains. for dnta cleansing, transformation, and
tasks. and they n l PandasData Visualization: Ability to ereate informative and visually
appealine plota
responsibilities of adata scientist encompass a wide range ofresponsibilities and the 4,
and Seaborn
diverse skill set to excel in their role. Here are the primary andgraphs using libraries ike Matplotlib skills in languagrs like Python or R. with
essential skills required for a data scientist 6. Prop amming: Streng programming maintainable code
Primary Responsibilities: the ability to write clean, efficier.t, and
sources such ns dataneres AD). databases and SQL for data
1. Data Collection: Gather data from various 6. Database and SQUA Knowledge of relational
web scraping, or data streams. Ensure data quality and consistency. retrieval and manipulation.
2. Data Cleaning and Preprocessing: Prepare data for analysis by cleaning, 7. Big Data Technologies: Familiarity with big data technologies like Hadonp
transforming, and structuring it. Handle missing values, outliers, and dnta processing large-scale datasets.
and Spark for Proficiency in relevant data science tools and librarie,
inconsistencies
8. Tools and Framneworks:
3. Exploratory Data Analysis (EDA: Conduct EDA to understand the data's
Visualization tonl including seikit-learn, TensorFlow. PyT'orch, and Jupyter notebook
characterstics, distrnibutions, patterns,and potential correlations, communication skills to con vey
9. Communication: Strong written and verbal and non-technical audiences
are often used for this task.
findings and insights effeetively to both technical
data to imnrove complex problems analytically and
rcature Engineering: Create new features from cxisting 10. Problem-Solving: The ability to approach
model performance. This involves domain knowledge and ereativity. develop innovative solutions.
5. Model Development: Develop and implement machine learning models and 11. Business Acumen: Understanding ofgoas.business objectives and the ability to
algorithms to address specific business or research problema. This includes sclecting align datascience efforts with organizational
appropriate models, feature selection, and hyperpurameter tuning. privacy issues related to
12, Ethical Considerations: Awareness of ethical and
6. Model Evaluation and Validation: AssCSs the performance of machine data collection, handling, and analysis.
learning models using metrics like accuracy, precision, recall, Fl-score, and cross
13. Continuous Learning: A commitment to staying updated with the latest
validation techniques. Detect and address issues like overfitting or underfitting. developments in data seience and machine learning
7. Data Interpretation: Interpret model results and provide actionable insights expected to adapt
Data scientists oflen work in multidisciplinary teams and aretechnical
to stakeholders. Explain complex technical findings in a clear, understandable manner. data challenges. The combination of expertise,
8. Experimentation and MB Testing: Design and conduct experiments to domain to evolving technologies and communication skills is key to success in this role
knowledge, and eflective
evaluate the impact of changes or interventions. A/B testing is common in web and UNIT I
product optimization.
9. Machine Learning Pipelines: Develop and deploy end-to-end machine Q.4. (a) 1How can you load and explore a dataset in Python using Pandas?
(5)
learning pipelines that automate data processing, model training, and deployment. Python using Pandas involves reading
Ans. Loading and exploring u dataset in
10. Data Visualization: Create visualizations and reports to communicate the data into u Pandas DutaFrame and then performing preliminary cxploratory data
findings effectively.Tools like Matplotlib, Seaborn, Tableau, or Power Bl are oflen used. analysis (EDA) to understand its structure and content
11. Big Data Technologies: Work with big data frameworks and tools like Here's a step-by-step guide on how to do this:
Hadoop, Spark, and distributed databases when dealing with large datasets.
2022-17
Science (AIMl/AIDS) LP University (BTech)Akash Honka
Found1tions of Data blocks to handie
Third Semester, Multiple 'except' llockr You can have multiple erepe'
16-2022 1. exceptions
Learning-Based Imputation: adversarinlI different typesof
7. Deep models, ike autoencoders or generative networka Valuekrror
-Utilize deeplearning data for missing
values. try
# Code that may
raise exceptions except
(GANs), togenerate synthetic Handle ValueError except ZeroDiviionError
8. Time-Series Handling: illing, backward filling, Handle ZeroDivision Error
methods like forward data. rai statement te
-In time-series
data, use the can rase exceptions uing the
temporal characteristics of 2. Raising Exception: You
interpolation based on the indicnte errors explicitly.
9. Custom Strategies: specific charucteristics ot if some_condition:
strategies for imputation based on the custom exception message
-Design custom missingness. raise ValueError("ThisisaHierarchy: Python has a hierarehy of ezcepBians, with
the
your data and the nature of s. Exception Handling of speciheity.
10. Combine Approaches: data, it's often beneficial ta "BaseException' at the top You can cateh exceptionsat d1fferent levels
and the nature of missing
-Depending on the datasettechniques accuracy. catching more specifie ones
to enhance imputation with broader exceptions elasses by cresting a new
combine multiple imputation the nature of the data, the reasong 4. Custom Exceptions: You can deine custom exception
built-in excetinn ctass
to use depends on ora more specific
The choice of whichtechnique or modeling. It's important
impact on downstream analysis assess the potential bias or
to that inherits from 'Exeeption'
classclass CustomError(Exception): pass
for missingness, and the of cach technique and
carefully consider the implications Additionally, documentation of the missing
distortion it may introduce to your data.
ensure transparency and
reproducibility in data try:
if some_condition:
data handling process is essential to exception ") except Custom Error as ce
analysis. (6) raise Custom Error("This isa custom
(cel")
exceptions in Python? print(f°Custom exception caught:
Q.4. (c) Hlow do you handle code. It allows you
allows you to gracefully deal with errors and for robust and fault-tolerantand take approprnate
Exception handling is essentialinformative
Ans. Handling exceptions in Python it from crashing. Python provides
a provide error messages,
unexpected situations in your code, preventing using the 'try', 'except, `else', and to gracefully handle errors,
unexpected situations.
straightforward mechanism for exception handling actions to recover from How can we
and why is it important? Python?
"inally blocks.Here's how you can handle exceptions
in Python: o6, (a) What is feature scaling
numerical feature in a dataset in 5)
try: visualize the distribution of a preprocessing technique used in machine learning ta
# Code that may raise an exception result =
10/0 Ans, Feature scaling is arange of independent variables or features of data I's
#Example: Division by zero except ExceptionType as e: standardize or norn.alize the the scale of input
learning algorithms are sensitive to
Code to handle the exception printíf'An exception
occurred: (el") important because many machine that all features contribute equally to the learning
else: features. Feature scaling ensures
features from dominating due to their
larger scale Here
print("No exception occurred.") process and prevents certain
# Code to execute if no exception occurs for feature scaling:
finally: are two common methods (Z-Score Scaling):
or not print("Finally block Standardization
#Code that always runs, whether an exception occurred 1. mean of Oand a standard devnation
-Standardization scales the features to have a d1stribution
executed.")
transforms the data so that it has a normal
In the code above: of 1. It standard_deviation'
In this case, we're
-The try block contains the code that may raise an exception.exception. -Formula for stundardization: 'z = (x - mean)/
attempting to divide by zero, which will raise a ZeroDivisionError' 2. Min-Max Scaling: preserves
-The 'except' block follows the 'try' block and specifies the type of exception to catch toa specified range, typically (0, 1|. It
-Min-Max scaling scales the features squashes
inside the 'except' block
('ExceptionType'). If an exception of that type occurs, the code FileNotFoundError) it into a smaller range.
or the original distribution of the data but min) / (max - min'
will run. You can catch specific exceptions (e.g., ValueError, -Formula for Min-Max scaling: 'x_scaled = (x -
use a more general 'Exception' to catch any exception. choice between standardization and Min-Max scaling depends on the specific
The learning algorithm you plan to use.
-Inside the 'except' block, you can handle the exception in various ways, such as dataset and the requirements of the machine
printing an error messuge, logging, or taking corrective action. Visualizing the Distribution of a Numerical Feature in Python:
-The 'else' block, if present,contains code that runs if no exception occurs within the To visualize the distribution of a numerical feature in a dataset in Python, you can
try block. It is optional. Seaborn
-The inally' block, ifpresent, contains code that always runs, regardless of whether use various plotting libraries such as Matplotlib or
Here's how to do it using Matplotlib:
an exception occurred or not. This block is oflen used for cleanup operations like closing
files or releasing resources. import matplotlib.pyplot as plt
'column_name' is the name of the
Here are some additional techniques and considerations for handling exceptions # Assuming 'data' is your DatuFrame and
in Python: numerical feature
(AlMJAIDS)
of Data Science
l4-022 Third Semester, Foundations IP University-( Tech) Akash Bke
Value')
Loading
AssumingtheyouDataset
have a dataset in CSV formnat (but Pandas supports varioun other plt histtdf|'numeriealeolumn'), bins-20)plt elahel
load it using Pandas: plt ylabel(Erequency)
iprnats like Excel, SQL, ete ), let's plt. title"Histogram of Numerieal
Colume') plt showf)
import pandas as pd Hyutilizing these methols, you effeetively esplore
can and undertand the otruture
Load the dataset (assuming it's a CSV
fle) Pandas in Python Adjut the explorstin band
path file_path o and characteristics of the datanet using
Replace your, dataset csv' with the actual filename and on the nanture of your dataset and the insighta you aim to derive
missing data in a
dataset csv' 9.4. (b)What are ome common techniques for handling
df = pd read_csv sle_path) dataset7?
dataset print"First 5 row preproreeing to enrure the
Display the frst few rows to get a preview of the Ans. Handling missing data is a crucial step in data
common techniques for handiing
the dataset") quality and integrity of your dataset There are several
printidf head)) Explurng the Dataset missing data:
Once the dataset is loaded unto a Pandas DataFrame, you can explore it in variou 1. Deletion:
rowe with misng values Thie i
ways -Listwiso Deletion (Row Deletion) Remove entir valuable data if many roer have
result in a los of
1. Basic lnformation: Bstraightforward approach but can
· Display basic information about the DataFrame, including the number of row missing values.
pairn of columns, ignuring
and columns, data types, and memory usage. -Pairwise Deletion: Analyze and perform operations onwhen working wità speciie
each specific pair. This can be useful
Display basic information about the DataFrame print("\nDataset Information:) missing values for
models that cun hundle missing values within pairs of columns
printidfinfo) caleulations or
2. Summary Statistics: 2. Imputation:
with the mean (averugei.
- Caleulate summary statistics like mean, median, min, max, and quartiles for -Mean, Median, or Mode Imputation: Fill missing values
numerical colunns. (middle value), or mode (most frequent value) of the respective column This
median distribution of the data
Display summary statistics for numerical columns print("\nSummary Statistics:") method is simple but may not capture the true
-Porward Fill (fMll) and Backward Fill(bfil)Propagate the last known non-missing
print df.describe()) a time series or ordered dataset
3. Columa Names: value forward or the next known value backward in interpolating between adjacent
-Display the column names (feutures) of the DataFrame. -Linear Interpolation: Estimate missing values by
Display column names data points.
print("\nColumn Names:"), This is often used in time series data.
values by averaging
printdf.columns) -K-Nearest Neighbors (KNN) Imputation: Impute missing metrics
points based on similarity
4.Data Types of Columns values from the K-nearest data
values based on
-Display the data types of each calumn in the DataFrame. -Regression Imputation: Use regression models to predict missing
Display data types of columns other variables in the dataset.
print("\nData Types of Columns") printudf.dtypes) -Multiple Imputation: Generate multiple imputed datasets and analyze them
5.Number of Unique Values separately, combining results to account for uncertainty.
-Count the number of unique values in each column. 3. Missing Value Indicators:
Display the number of unique values in each column -Create binary indicator variables that represent whether a value s missing or not
print("\nNumber of Unique Values in Each Column:") patterns related to missing data.
for each column. This can help models capture
print(df.nunique()) 4. Domain-Specific Imputation:
6. Missing Values: imputation techniques For
-Check for missing values in the dataset. -Utilize domain knowledge to determine appropriate imputed d1fferently than in
instance, in a time series dataset, missing values might be
Check for missing values
print("\nMissing Values") a customer demographic dataset.
6. Data Transformation:
printtdf.isnull().sum)) label,
7.Exploratory Data Analysis (EDA: -Convert categorical variables with missing values into a new categury or
is missing
-Perform additional exploratory analysis based on the specific dutuset and ubjertive such as "Unknown" or "Missing." to preserve the information that data
This can include plotting histograms, scatter plots, correlation matrices, etc. intentionally.
# Example: Plotting a histogram for a numericalcolumn import mutplotlibgygia 6. Model-Based Imputation:
predict
as plt -Use machine learning models, such as decision trees or random forests, to
missing values based on other features in the dataset.
(AIMIJAIDS)
Third Semester,
Foundations of Data Science LP University-(9 Terh|-Akash Bonks
22-2022
column comparison, data ieualiatinn, data pinttine. data eting
Example of flling missing values with the mean of the dats representation, data mising data, ete Such operations are useful n deta
df'column name'l mean) indexing. alignment, handlingvarius libraries of python Python utilises the analyna of
mean
dfl'column_name'] fillnatmean, inplace=True) analyses that are handled by analysis or test mining.
using Pandas. You data with mix statistica with
image
4. Handling Duplicates Check for and remove duplicute rows complex Analyus?
How to Perform Statistical statistical analyusof data proresing by gython
drop_duplicates) methods.
can use the duplicated) and There are different modules of
Check for duplicate rows Representation
1. Data Colleetion/ ete that can be
duplicates =df duplicated) The data can be anything rlated to business, polity, edueation,
matrix, with columns giving the different attrbutes of the data.
Remove duplicates seon as a 2D table, or and categarical values
Bnd rows the observations. A datanet is amixture of numerical library This hbrary
df - df drop duplicatest) to perform various data transformations, Prthon cnn interact with data in C3V format by using the pandas
5. Data Transformation: You might needencoding categorical variables. Pandas another library to handie array data structure Every column
such as converting data types, scaling, or is builton numpy which is furthr processing/analysis The data can be an
provides methods for these operations: ofa dataset is fetched into an array for 2D mtriz and stored into an array for further
d|'column_name'.astypelnew_type) image that is further converted into a
" Change data types: dfcolumn_name'] = min-max scaling. processing.
"Scaling Use NumPy to apply functions like standardization or other encoding 2. Descriptive Statistics
"Encoding categorical variables: Use Pandas' get_dummies() or hidden patterns in the data It just
Descriptive statistics are used to identifyingmake any predictions about the data
describes the data through statistics. It doesn't
techniques.
descriptive statistics of data such as mean,
Example of one-hot encoding a categorical variable: Several methods are used to analyze deviation. These mathematical statistics are
df = pd get_dummiesdf, columns=['categorical_column'l) median, mode, variance, and standard
library called statistics. This library contains all uch
6.Handling Outliers: ldentify and potentially handle outliers in your dataset utilized on data in python using a analyis helps the
analysis of data This kind ofstatistical
Using NumPy or Pandas. You can use statistical methods like Z-scores or IQR to detect Dathematical methods for the descriptive
data. As discussed above that the analysis
and handle outliers. about
user to obtain busic statistics knowledge from complex data. The mean, median, and
Example of fltering out rows with outliers using the Z-score: is the extraction of some useful statistics in which the user is intended to extract the
Z_scores = np abs((df- df.mean()) /df.std() mode lies in central tendency come
data. Standard deviation statisticsuse
df = df(z_scores < 3).alllaxis=1)) #Keep only rows within 3 standard deviations central or the middle knowledge of complexdata from its actual mean Variance the to
variation in
7.Feature Engineering: Create new features or perform feature engineering to measure the spread or in a group are spread out It is the square of the
analysis that how far individual data
using Pandas and NumPy. You can create new columns based on existing ones, apply standard deviation.
mathematical operations, or extract information from text or datetime columns. 3. Inferential Statistics
8.Data Scaling and Normalization: If needed, scale or normalize your data to extract inferences or hypotheses
This typc of statistical analysis is intendedpopulation
using NumPy functions to ensure that different features have similar scales, which can data. Prediction about the is carried out from random
be important for machine learning algorithms. from a sample of large the dependent variable based on the independent
samples of data. The prediction of
Example of min-max scaling using NumPy: inferential statistics. For gathering predictions about sample
df'column_name'| =(df'column_name'] - dfl'column_name'].min) / (dl'column variable is carried out in samples and learn the correlation betweea
data the model is trained with training
name'].max()- dfl'column_name'] min() independent variables. Based on its learning and type of model, the
dependent and terms are used to make a pred1ction
9.Data Imputation: If you have missing data, you can impute it using methods machine can make a prediction. Some technical
like mean, median, or more advanced techniques like regression imputation. about sample data are listed below:
10.Save Cleaned Data: Pinally, save your cleaned and preprocessed data to a new ZScore: Z score isa way to compute the probability of data occurring within the
ile for future analysis or modeling. different values in data with the mean
normal distribution. It shows the relationship of
df.to_cav('cleaned_data.csv', index=False)
data. To compute the Z score, we subtract the mean from each data value and divide
of
Q.7. (a) How can you perform statistical analysis in Python using Scipy the whole by standard deviation. Z score is computed for a column in the dataset. It tells
and Statsmodels? Z score helps us to decide whether
(7.5)
Ans. Statistical analysis of data refers to the extraction of some useful knowledge
whether a data value is typical for a specific dataset.
hypothesis refers that there is no spatial
from vague or complex data. Python is widely used for statistical data analysis by using to keep or reject the null hypothesis. The null can be imported
data frame objects such as pundas. Statistical unalysis of data includes importing, pattern among the data values associated with the features. Z score
cleaning, transformation, ete. of duta in preparution for analysis. The dataset from "scipy library of python.
samples of data
CSV ile is considered to be analyzed by python libraries which process every dataoffrom
the Z test: Z test is to analyze whether the means of two different
are similar or difTerent while knowing their variances and standard deviations. It is
preprocessing to end result. Some libraries in python ure efTectively used like pandas, normal distribution. It is used for large-size data
statsmodels, seaborn, ete that use to handle the analysis of such duta. Python a hypothetical test that follows a
does amples. It tells if the two datasets are similar or not In this case, the null hypothesis
(AIML/AIDS) 2022-21
Foundations of Data Science LP. Univernity-1B Techl-Akash Booka
20-2022 Third Semester, Tensor Flow
7. Keras: Koras in an interface or high-level API that sits on top of
Numeric Types: backends like Theano and Microsoft Cognitive Toolkit), Itsimplifies the
42, -10, 0). (ond othor
Integer data type (e.g.,
int: numbers (e.g.,, 3.14, -0.001), development of deep learning models.
Floating-point duta type for decimal decp learning library that in known for its
foat:
"Python"). 8. PyTorch: PyTorch is another popular widely uned in research and develapment
Sequence Types: Moxibility and dynamic computation graph It's
for representing text (e.g., 'hello',
str: String data type items (e.g., |1, 2, 3]). of neural networks.
holding n collection of items (e.g.. scalable gradient boosting library. Its
list: List data type for for holding an ordered, immutable collection of 9. XGBoost: XGBoost is an effcient and classihcation and regression, and is known
tuple: Tuple data type Usod for supervised learning tasks, such as
(1, 2, 3). for its predictive aceuracy. NLTK is a library for working with
Mapping Type: key-value pairs (e.g., ('key': 'value|). 10. NLTK (Natural Languago Toolkit):
holding natural language processing (NLP) tasks,
dict: Dietionary data type for human languago data. It's widely used forunderstanding.
Set Types: collection of unique elements (e.g., (1,
2, 3). including text classification and language Data Science:
sot: Set data type for holding a Installing and Managing Python Packages for
Python packagcs ior data science, you can use package
Boolean Type:
representing True or False. To install and manage how to do it:
bool: Boolean data type management tools like pip and conda. Here's
None Type: nullvalue 1. Using pip:
reprcsenting the absence of a value or a install package-name.
NoneType: A special duta type
a variable in Python are as follows: "To install apackage, use the following command: pip numpy.
(e.g., None). The rules for declaring "For example, to install NumPy, you can run: pip install
A-Z) or an underscore (). the version number: pip install
-Variable names must start with a letter (a-z, "You can specify the version by adding == and
letters, numbers (0-9), or underscores.
-They can be followed by and 'myvar are different). numpy==l.21.3.
-Variable names are case-sensitive (e.g.,`myVar variable names. " To upgrade a package, use: pip install -upgrade
package-name.
-Avoid using Python's reserved keywords as short and well-defined scopes. "To uninstall a package, use: pip uninstall package-name.
-Avoid single-letter variable names except for the Anaconda distribution)
2. Using conda (if you'reusingfollowing
-Choose variable names that are not ambiguous and convey their purpose.
"To install a package, use the command: conda install package-name.
conventions ensure clenn, readable, and maintainable Python code. can run: conda install pandas.
These rules and " For example, to install Pandas, you pip.
UNIT- III
You can specify the version in the same way as with
libraries in Python for data science and use: conda update package-name.
Q.6. (a) What are some popular used for? How can you install and manage To upgrade a package,
machine learning and what aro they (7.5) " To uninstall a package, use: conda remove package-name.
pre-processing in python
packages in Python for data science?
libraries and packages for data science and Q.6. (b) How can you perform data cleaning and (7.5)
Ans. Python has a rich ccosystem of using Panada and Numpy?
and their typical uses: Python using Pandas and
machine learning. Here are some of the popular libraries Ans, Performing data cleaning and preprocessing inand analysis. Here are some
1. NumPy: NumPyis fundamental for numerical computing in Python. It provides NumPy is a common and essential task in data science
along with a collection of libraries:
support for large, multi-dimensionalarrays and matrices, common data cleaning and preprocessing steps using these NumPy libraries to work
mathematical functions to operate on these arrays. Libraries: First, import the Pandas and
and analysis. It offers data 1. Import the
2. Pandas Pandas is used for data manipulation easy to handle structured data, with data and arrays. You can do this as follows:
structures like DataFrames and Series, making it
import pandas as pd
including reading and writing data from various file formats. import numpy as np
3. Matplotlib: Matplotlib is a widely-used library for creating static, animated, 2. Load the Data: Load your dataset into a Pandas
DataFrame. You can read data
and interactive visualizations in Python. It's highly customizable and suitable for a from various file formats (e.g., CSV, Excel, SQL databases) using
read_csv),
Pandas'
wide rango of visualization needs. read_excel(), or other methods.
4. Seaborn: Seaborn is built on top of Matplotlib and is designed for creating #Example for loading aCSV 6le
attractive and informative statistical graphics. It's commonly used for data visualization df = pd.read_csv'your_data.csv)
in a statistical context. part of
6. Scikit-Learn: Scikit-Learn is a machine learning library that provides a 3. Handling Missing Values: Dealing with missing values is a crucial
missing data.
data cleaning. You can use Pandas and NumPy to identify and handle
wide range of tools for tasks like classifcation, regression, clustering, dimensionality Here are some common operations:
reduction, and model evaluation.It's a go-to choice for building and deploying machine
learning models. "Identify missing values: df.isnull() or df.isna()
6. TensorFlow: TensorFlow is an open-source deep learning library developed by "Count missing values: df.isnull().sum()
Google. It's particularly popular for deep neural network development and is used fora "Remove rows with missing values: df.dropna()
variety of machine learning tasks, including image and speech recognition. "Fill missing values: df.illna(value)
(AIMIJADS) Akaeh Bss
Foundations of lData Science IP Univerity (8 Teeh They are ofen
Third Semester, pninta ronnerted by lines
24-2022
are signiicantly similar Asignificancollevel (aay %) Plote láne plots repreent data runtinuous varishles
value of data in more 2 Line over ime sr other
considers thatboth datasets is only accepted ifthe p
are similar and arethas, dtevisualize data trends are uned to represent rategonral data with rertar
gaiar
null hypothesissignifies boththe datasetimplemented
beset sotthatthe z-test that 3. Bar(Charts Bar charta
of different eateguriee
level Aguod other. The z-lest methodcan be
the signiticantdifferent using the bars.They are effective for tomparing the values
diatribtion ofa eingle variahle by
di ridng
signifirantly from each Histograma display the
in python. listograme ponta in earh bin
library called "statsmodels" datasets are nimite:
determine whether the two method is applicabl
4
the frequeney of data shonng
used to t inte bins and ahawing dintribution of data using quartile,
T-test Ttest is also
but the difference is that this
implenented u 6. Box Plots: Rox plots display the
different. It is the same as z-test
must be less than 30.The T-test can be iderntifying potential outliers
the median, and Heatmaps display a matrix of data using colurs to
repreent value
a smaller sample size which
lhbraries like numpy, pandas, and
seipy 6. Heatmaps visualizing carrelations or datamatrices
F-distribution It is uscd to determine if the two sampl They are often used for Matplotlib:
F lest. F test utilizes hypothesis is rejected if the Features of over every aspect of a
their variances. The null There is some significane Salient provides fine grained euntrol
data are equal based on comparing of data is cqual to one. 1. Customization:
Matpiotlib
element of the vizualizaion
ratioof the variances of two samplesof difference between the two samples which is not allowing you to customize almost every
ued data visualization
libraris in
level also to tolerate some amount using "scipy" library of python.
plot,
Used: It is one of the most widely
considered sign1ficant It is implemented 2. Widely community and extensive dorumentation
visualitatons, from
Python, with a large user
4. Correlation Matrix
that sh¡ws Matplotlib allows you to create a wide range of
drawa pattern ina dataset. It is a tablerelationship
The correlation matrix is used tovariables 3. Flexibility:
customized figures
correlation coefficients between the of a dataset. It depicts the basic plots to complex, Control: You can create and modify plots programmatically.
how the occurrenee of any data is 4. Programmatic
between different data and helps us to understand be utilized in linear regression or
publication-qual1ty figures
associated with the occurrence of other data. It can which is essential for ereating visualization lbraries and
of covariance. The correlation can be integrated with other
multiple regression models. Correlation is the function 5. Integration: Matplotlib
taking the ratio ofthecovariance of these interactive plots.
coeffcient of anytwo variables is caleulated by
deviation. It is usod to find the dependency toolkits for advanced and Seaborn:
variables and the produet of their standard Salient Features of data visualization It
Seaborn is designed for statisticalmakes it easy to create
between the two varnables.
1. Statistical Plotting: range of statistical plots and
Importance of Statistical Analysis of Data includes built-in support for a wide code.
Statistical analysis of data is important because it saves time and optimizes the
complex visualizations with minimal provides attractive and informative default colar
libraries are used to take every
problem It is carried out efficiently in python. Python small Seaborn
2. Attractive Defaults: appealing plots without
analysis of data. Python libraries can smartly handle issues like the scaling of
palettes and themes. This makes it easier to ereate visually
mathematical
data while analyring statistical properties. Python replaces a complex extensive customization. DataFrames,
expression with the functions that are present in its libraries. It is fast and provides seamlessly integrates with Pandas
accurate knowledge about data which can be used to process further for predictions or 3. Pandas Integration: Seaborn with structured data
simplifying the process of working comples
classifcations hke problems Statistical analys1s is important to good decisions on data offers a high-level interface for ereating
Satistical analysis of data helps us to aceess effective data only with good efficiency. It 4. High-Level Interface: Seaborn intricate plots like pair plots, facet grids, and
helps us to dende an optimal path for data accessing and processing visualizations. It simpliies the ereation of
Satistical analysis of data is the acquisition of knowledge about data in order to violin plots. a strong
is actively maintaincd and has changing
smplify the complex dala which can be further used for processing. The job is effectively 6. Active Development: Seaborn continues to evolve with the
done by diflerent librarices of python which effectively use for the analysis of data in less community, ensuring that it stays up-to-date and
tume The goal of data analysis is to optimize the complex data structure. It helps us to needs of data scientists and analysts strengths, and the choice between them
take aptimal decisions on dat. Both Matplotlib and Seaborn have their is highly lexible and customizable.
Matplotlib
Q.7. (b) What are some common data visualization techniques in Python depends on your specific requirements
informative statistical visualizations with less eflort.
using Matplotlib and Seaborn? What are salient features of Matplotlib and while Soaborn excels in producing
Seaborn? (7.6) UNIT- IV
you perform supervised
Ans. Data visualization techniques in Python are often implemented using Q.8. (a) What is machine learning and how can (7.5)
Matplotlib and Seaborn. Here are some common data visualization techniques along learn?
with the salient features of Matplot!lib and Seaborn: and unsupervised in python using scikit intelligence that foeuses on
Common Data Visualization Techniques: Ans. Machine Learning is asubfield of artificialcomputers to learn and make
the development of ulgorithms and models that enable
Machine learning is
1. Scatter Plots: Scatter plots are used to display individual data points as dota predictions or decisions without being explicitly programmed supervised learning and
an atwo-dimensional plane. Thoy nro useful for visualizing the relationships betwocn divided into several categories, with the two main types being
davo continuous variables.
unsupervised learning
(AIMIJAID8,
Foundations of Data Sience LP Unirraity f8Tehi Akash Bks
18-2022 Third Semester,
black,
datalcolumn_namel plotikind hist, bins20, edgecolor ,fignize-(8, 0) p Handling Miseing Valuee
Distribution of lcolumn_namel") drnpnafinglaceTrami
title(f temove rows with missing values df hlinavaie
plt xlabellcolumn_name) Fill minsingvalues withA specie value df ome name)
pltylabel('Frequency) inplare-Trnue)
plt show() Outlier Detection and Treatment
the code above:
In datalcolumn_namel'extracts the specific nunmerical column you want to vimualiz #Detect outliersuning eores
histogram of the data, where you can specify the
num Bcores np abatldff'eolumn name]
"plot(kind= 'hist' creates a parameter.
'bins'
-dleolumn name'] mean() /dfeolumn name] std
of bins (untervals) using the bars for better visibility. df_no outliers dfiz ores <threshold!
edgecolor= black"adds black borders to the -Feature Engineering Create new features or tranafurm
eisting ones lo impre
figsize' sets the size of the plot. used to add a title and labels to t analysis. d'new_ feature'] dflfeature1'| + dfl feature21
plt. title(), 'plt xlabel(Y, and 'plt.ylabel() are -Data Sealing: Scale numerical features if needed
(eg.standardization ar Min Mas
plot.
-Finally, "plt.showi' displays the histogram.
other parameters to suit the specit.
scaling).
6. Data Analysis:
You can adjust the number of bins and detail you want in your visualization objetives This nay inelude
characteristics of your data and the level ofdistribution of numerical data, includin Perform data analysis tasks based on your prajeet's
-Visualization: Use libraries like Matplotlib or Seaborn to ereate plota and chars
Histograms are useful for understanding the identify patterns, anomalies, ang visualization
its central tendency, spread, and shape. They can help for data exploration and
statistical tests to validate hypotheses atout the
potentialoutliers in the data. -Hypothesis Testing. Conduct
analysis in Python using
Q.5. (b) Can you explain the process of data data.
learning models uing libr aries like
libraries uch as pandas and numpy?
process that typically
-Machine Learning: Ifupplicable, train machine
training, evaluation, and hyperparameter
Ans. Certainly! Data analysis in Python is a multi-step
manipulate, explore, clean, and
scikit-learn. This can include model selection,
involves using libraries like Pandas and NumPy to tuning.
analyze data. Here's an overview of the key steps in the data analysis process using 6. Data Visualization:
findings Visual1zation libraries hke
these libraries:
Create visualizations to convey insights andcreate various types of plota, such as
1. Importing Libraries: you
Matplotlib, Seaborn, and Plotly can help heatmaps.
First, you need to import the necessary libraries: histograms, scatter plots, bar charts,and
import pandas as pd 7. Reporting and Communication: understandable manner us.ng
import numpy as np Present your findings and insights in a clear andJupyter Notebooka are excellent
dashboards. Libraries like
2. Loading Data:
functions for reports, presentations, or
Load your dataset into a Pandas DataFrame. Pandas provides various
databases, etc. For for creating interactive reports.
different fle formats, such as CSV, Excel, SQL
reading data from 8. Documentation and Sharing:
example, to load a CsV Gle: reference and sharing with
df =pd.read_csv('your_dataset.csv') Document your analysis, code, and findings for future
team members or stakeholders.
3. Data Exploration: 9. Iterative Process:
Explore the dataset to understand its structure and content: to revisit previous steps.
Data analysis is ofen an iterative process You may need insights into the data
-Basic Information: Get an overvicw of the DataFrame, including the number of refine your analysis, or try different techniques as you gain more
rows, columns, and data types. printldf.info)) NumPy provide powerful tools
Throughout the data analysis process, Pandas and other like Matplotlib
-Summary Statistics: Calculate summary statistics for numerical columns. print(df for data manipulation, cleaning, and analysis, while libraries
describe() Effective data analysis in Python
-Column Names: Display the column names (features) of the DataFrame. print(dt and Seaborn enhance data visualization capabilities. science techniques, and these
involves a combination of domain knowledge, data
columns)
libraries to derive meaningful insights from your dutaset
how do you declare
-Data Types: Check the data types of each column. print(df.dtypes) Q.5. (e) What are the different data types in python and (5)
-Sample Data: View a sample of the data to get a sense of its contents. print(d a variable?
head()
Ans. In Python, variables do not require explicit deelaration of data types like in
4. Data Cleaning and Preprocessing: languages. The data type of avariable is inferred based on the
8ome other programming several built-in data types
Prepare the data for analysis by handling missing values, outliers, and other data value assigned to it. However, Python does huve
quality issues. Pandas and Numy provide functions for these tasks: Here are some of the commonly used data types in Python
26-2022 Third Semester, Foundations of Data Science 20122-27
(AIMIUAIDS) Techl-Akash Books
Supervised Learning is a machine learning approach where the LP. University-|B
trained on a labeled dataset, which means that the algorithe clustar import
KMeans
output or target. The goal is to learn a mapping input data is paired with the cor aklearn.
model
from KMeanstn_clustersn3)
algorithm to make predictions on new, unscen data. from input to output, allowing t asignments
include classification and regression. Common supervised learning task model fittdata) clustering, you ran use the ciuster
reduction, you can
6.Exploring the Results: For dimensionality
Unsupervised Learning, on the patterns,
other hand,structures, model to group data points For
The algorithm's objective is to discover involves training on unlabeled
or relationships withindata
the
sualize the
rovided by the
reduced data and identify patterns various machine
data without specific guidance in the form of and user-frnend!ly API for tuning. and
a consistent hyperparameter
labeled outputs. Common unsupervised
learning tasks incude clustering and dimensionality Seikit-learn provides
ofTers tools for model selection, valuable resource for
reduction. it also
ingtasks, and Additionally, scikit-learn's documentation is a
To perform supervised and parameters they accept.
(sklearn), follow these general steps:unsupervised learning in Python using scikit-learn(odel
evaluation.
specific algorithms and the learning model
nderstanding the usageofevaluate the performanceofa machine F1-Score? (7.5)
Supervised Learning with scikit-leara: can you precision, recall and
O.8. (b)How such as accuracy, using
1. Import scikit-learn: First, import the python using metric model in Python
or Jupyter Notebook. scikit-learn library in your Python script31 Ans. You can evaluate the performance of Fl1-Scoreamachine learning
their respective formulas
with
precision, recall, and
import sklearn etrics like accuracy, and the steps to caleulate these metrics:
2. Choose a Supervised the formulas
Learning Algorithm: Select an appropriate algorithm lere1. are
based on your problem type (classifcation out of all predictions
proportion ofcorrect predictions
Accuracy:
or regression). For classification tasks, you measure of the
might use algorithms like Decision Trees, Random Forests, or is a
For regression tasks, you could use Linear Support Vector Machines. iadeAccuracy
by the model.
Regression, Random (TP +TN + PP+ FN)
Gradient Boosting Regressor, among others. Forest Regressor, or Formula: Aceuracy=(TP +T)/ the number of correctlypredicted
positive instances
3. Data Preparation: Prepare your labeled TP (True Positives) represents predicted negative
dataset. This includes splitting it into represents the number of correctly
features (independent variables) and the target (dependent variable). TN (True Negatives)
incorrectly
chusen algorithm, ft it istances.(False Positives) represents the number of negative instances
4. Create and Train the Model: Create an instance of the
to your training data using the .it) method, and tune "FP
from sklearn.model_selection import hyperparameters if necessary. redicted as positive. instances incorrecetly
train_test_split the number of positive
"FN (Palse Negatives) represents
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, redicted
random_state=42) as negative.
from sklearn.ensemble import RandomPorestClassifier 2. Precision: instances
model to correctly identify positive
model RandomForestClassifer() Precision measures the ability of the
ut of all instances predicted as positive.
model.fitCX_train, y_train) Formula: Precision = TP/(TP + FP)
6. Model Evaluation: Assess the model's performance using appropriate of correctly predicted positive instances.
evaluation metrics like TP (True Positives) represents the number
accuracy,precision, recall, Fl-score, or mean squared error (for Positives) represents the number of negative instances incorrectly
regression). Use the model to make predictions on the test data and compare them to "FP (False
the true values. rodicted as positive.
Y_pred = model.predict(X_test) 3. Recall (Sensitivity or True Positive Rate):
positive instances out
from sklearn.metrics import accuracy_score Recall measures the ability of the model to correctly identify
accuracy = accuracy_scorey_test, y_-pred) fall actual positive instances.
Formula: Recall =TP/(TP+FN)
Unsupervised Learning with scikit-learn:
1. Import scikit-leara: Import scikit-learn as described earlier. "TP (True Positives) represents the number of correetly predicted positive instances.
2. Choose an Unsupervised Learning Algorithm: Select an algorithm based on "FN (False Negatives) represents the number of positive instances incorrectly
predicted as negutive.
your task. For clustering, you can use K-Means, DBSCAN, or hierarchical clustering. 4. FI-Score:
For dimensionality reduction, consider Prineipal Component Analysis (PCA) or The Fl-Score is the harmonie mean of precision and recall, providing a balance
tDistributed Stochastic Neighbor Embedding (t-SNE).
3. Data Preparation: Prepare your unlabeled dataset. Make sure your data is between these two metrics.
properly preprocessed if necessary (e.g., scaling or encoding categorical variables). Formula: F1-Score=2" (Precision Recall) /(Precision + Recall)
4. Create and Train the Model: Create an instance of the chosen algorithm and "Precision is the ability to correctly identify positive instances.
it it to your data using the .ft() method. "Recall is the ability to correctly identify actual positive instances.

realme Shot by Ashish chaurasiya


32-2022 Third Semester, Foundations of Data Science (AIMIJAIDS)
users who have shown interest in the target item. It's worth noting that K-NN SYLLABUS
limiations, such as scalability issues with large datasets and the sparsity
Universal Human Values II
(when there are few overlupping interactions between users or items).
Bayesinn inference: Bayesian inference can be used in building
probls (ÄIML/AIDS/IOT-211)
recomnend
systems, particularly in the context of probab1l1stic graphical models. Bav
inference allows for the incorporation of prior knowledge and uncertainty into Applicable from the Academic
UNIT I
Session 2021-22 Onwards
recommendation process. One appronch that utilizes Bayesian inference is Bayesa and Process or
Personalized Ranking (BPR). BPR is a popular method for collaborative fltering to Value Education - Noed, Basie Guidolines, Content Validation as
Tobeoduction Acceptance, Experiential
recommender systems. It is based on pairwise ranking and uses Buyesian prineiples Natural
telue Education, Self-Exploration,
odel the preferences of users. In BPR, a probabilistic model is constructed to estimat and Prosperity, Basic
Exploration. Continuous HappinessPhysical Facilities - the
me e likelihood of a user's preference for one item over another. The model incorporate e mechanism for Self and
Understanding, Relationship
ior knowledge about user-item interactions and learns the parameters that maximiz Homan Aspirations. Right spirationsofe very human bcing w ith
their
likelihood of observed rankings. The Bayesian framework enables the system e hesic requirements for fulillment ofa Prosperity, Methud to fulhll the above human
date its beliefs about user preferences as more data is observed. iority, Understanding Happiness and various levels. [8 Hours]
Dimensionality reduction: Dimensionality reduction techniques are commonly RsDiratiuns: Understanding and living in harmony at
(D din building recommender systems to address the "curse of dimensionality and UNIT II
Co-existence of the
prove the officieney and performance of the rocommendation process. These techniques Human Being, human being as a
aim to reduce the number of dimensions or features in the data while preserving Understanding Harmony in the Understanding the needs of Self (1) and 'Body,
oportant information or structure. There are two main ways dimensionality reduction sentient T and the material Body'. 1 being
Understanding the Body as an instrument of T T and
Jsed in recommender systems. happiness and physical facility,
Understanding the characteristics and activities of
Matrix factorization: Mutrix factorization is a dimensionality reduction
the doer, seer and enjoyer), Sanyam and Health,
:hnique widely used in collanborative fltering-based recommender systems. It aims to
harmony in T, Understanding the harmony of I with the Body:
Gompose the user-item interaction matrix into lower-dimensional representations or Prosperity, Programs to ensure Sanyam
Iatont factors The idea is to represent users and items in a shared latent space where enrrect appraisal ofPhysical needs, meaning of (12 Hours]
Sir preferences and characteristics are cuptured This reduces the dimensionality of nnd Health
UNIT III
Ke original data and allows for eficient computation of recommendations
Understanding values in human-human
Feature extraction: In content-bused ilering approaches, dimensionality Harmony in Human-Human Relationship,
re action techniques can be used to extract mcaningful features from iten attributes
relationship, meaning of Justice (Nine universal values in relationships) and program
otntent. For example, in text-based recommender systems, techniques like Latent foundational
Happiness, Trust and Respect as the
Sen antic Analysis (I.SA) or Latent Dirichlet Allocation (LDA) ean be used to reduce the for its fulíllment to ensure Mutual Trust, Difference between
meaning of
dinsionality of the text data and extract latent topics or features.
12NNeural values of relationship, Understanding the Respect, Difference between
Networks: A neural network is a type of machine learning algorithm Understanding the mcaning of
Intention and Competence,
17 tothat is similar to the brain. It is composed of interconnected neurons that can learn relationship, Understanding the
racognize patterns. Neural networks are ofen used for prediction tasks, lilke Respect and Differentiation, the other salient values in Resolution, Prosperity,
6 recoramender systems. harmony in the society (society being an extension of family), Gonls, Visualizing a
Fearlessness (trust) and Co-existance as comprehensive Human
iya universal harmonious order in society: Undivided Society,
to world family.
Universal order from family
[12 Hours]
UNIT IV
Mutual
Understanding Harmony in Nature. Interconnectedness: Self-regulation and
Fulállment among the Four Orders of Nature: Recyclability and Self-regulation in
Nature, Realizing Existence as Co-existence at AIl Levels. The Holistic Perception
of Harmony in Existence. Natural Acceptance of Human Values. Deinitiveness of
(Ethical) Human Conduct. A Basis for Humanistic Education, Humanistic Constitution
and Universal Humanistic Order. (8 Hoursl
Science (AIMIIA,
Third Senmester, Foundations of Data IP Univerity-(BTeehj- Akaeh Hoska 20/22-29
28 2022

Tocaleulatethene metrica using Python, youlltypically have these mode>'s peede plt.title(Trend Analysis')
labeln (y true) You can use these formulas in your oude foll
(y pred) and the true _score, precision score, recal plt.legend)
sklearn ymetrics
fromAssunint import acruracy plt grid)
pred and y true are your predicted and true labels,,retpectiely
plt.show()
Calculateand visualize trende: Depending on your analyis goa.s, you may
accuracy accuracy scorely true, y.pred) caleulate and visual1 trende Here are some
precision precision, scorely_true, y_pred) to spply various techniques to
pommon approaches:
recall recall scorety_true, y_pred) in
nf1scorely_true, y_pred) ) Moving Averages: Calculate and plot maving averages to smooth aut noise
learning odnta and identify long-term trends
These metrics provide insights into how well your machine problem and Calculate a simple moving average (8MA)
performing, particularly in classification tasks. Depending on the needed
goals, you may prioritize different metric. Aceuracy is suitable for balanced data window 30 # Adjust the window size as
essential when dealing with imbalanced datasets data(SMA] = data['value'].rolling window=window).mean)
while precision and recall are
F1-Score provides a balance between precision and recal. plt.figuro(figsize=(12, 6))
Q.9. (a) What is trend analysis and how can it performed in Python u plt.plot(data.index, data['value'l, label=Data)
pandas and matplotlib? plL.plot{data.index, data(SMA'I, Iabel=f(window}-Day SMA', linestyle='-",
Ans. Trend analysis is a statistical and data analysis technique used to ident olor='red')
and visualize patterns or trends in data over time. It is comnmonly used in various fiel plt.xlabel('Date)
such as fnance, economics, and data science, to uncover insights and make predicti plt.ylabel(Value')
based on historical data. In trend analysis, you examine data points colleted at gogul
intervals,typically over time, to understand the underlying patterns, relationships, a plt.title(Trend Analysis with SMA')
tendencies within the dataset. plt.legend()
Performing trend analysis in Python using pandas and matplotlib involves t plt.grid()
following steps: plt.show()
1. Import necessary libraries: You'll need to import the pandas library for da Linear Regression: Perform linear regression to identufy linear trends in the
manipulation and matplotlib for data visualization. Make sure you have these libraridata. (b)
installed.
from sklearn.linear_model import LinearRegression
import pandas as pd
X= pd.to_numericldata.index- data.index(0) days.reshape(-1, 1)
import matplotlib. pyplot as pl y= data('value']
2. Load your data: Import your time series data into a pandas DataFrame. Yo model = LinearRegression()
data should have at least two columns:one for the time (date) and another for the dat
values you want to analyze. model.it(X, y)
Example data loading trend_line = model.predict(X)
data = pd. read _csv(your_data.csv') plt.figuret figsize=(12, 6))
3. Prepare the data: Ensure that the time column is in a datetime format. Y plt.plot(data.index, datalvalue'l, label=Duta')
ua use the pd. to_datetime function for this: plt.plotdata.index, trend_line, label=Trend Line", linestyle-, color=red)
datal'date'l= pd.to_datetimedatal'date') plt.xlabel('Date')
& Set the time column as the DataFrame inde: This step is optional plt.ylabel(Value')
makes it easier to work with time series data.
plt.title(Trend Analysis with Linear Regression")
data.setindex('date', inplace=True) plt.legend()
5. Visualizethe data: Use matplotlib to create line plots or other visualizations plt.grid()
abserve the trends in your data.
plt.show()
plt.figurelfigsize-( 12, 6) These are just a few examples of how to perform trend analysis in Python using
plt.plotldata index,data ['value'l. label=")ata) pandas and matplotlib. Depending on your specific data and analysis goals, you may need
plt.xlabel(Date) to exploro other techniques, such as exponential smoothing, time series decomposition,
pit.ylabel(Value') or moreadvnnced regression models, to capture and visualize trends effectively.
30-2022 Third Semester, Foundations of Data Science (AIMUAIIs) IP University Techj Akah Bnke
Q.9. (b) What is predietive mining and how canit be performed dataset
libraries?
using scikit-learn and other machine learning miningis Step 2:Nplitting the mathine learning mdele i ta determine thr rarryen
Predietive Mining: The main goal ofthis to say imprtant aspnct ofall
train the mndel ung modetbe
future
Ans.
results
are used not of current
to predict
behaviour. It uses the supervised learning
the target value. The methods come under this type
sonet
fusttieg
One
ow, in order to determine
their areuracy, one eansame
rosponse
detaset ard then predict thethe model.
values for the datast eing that end

of
category are calledclassification, time-series analysis and regression Modlh henco,find the acrursry
several Maws in it, like
is the necessity of the predietive analysis, and it works by utilizing a few varia But thia method has
performance ofa model on sutof sample
data
present to predict the future not known data values for other variables. "The goal is to estimate the likely eompler medela that nt
Examples of predictive data mining include regression analysis, decis arcuracy rewards overly
" Maximnizing trainingmordel.
and neural networks. Regression analysis involves predicting a continuous necessarily goneralize our
variable based on one or more predictor variables. Decision trees involve buildine modela may over 6t the training datA
like model to make predictions based on a set of rules. Neural networks involve b "Unnecennarily complex one for training ur
better option is to split our data into two parts the hrst
a model based on the structure of the human brain to make predictions. A second one for testing our model
The latest version of Scikit-learn is 1.l and it requires Python 3.8 or newer machine learning model, and the
model
Step 3: Training the dataset Sctkit lears
Scikit-learn requires: prediction models using our
" NumPy Now, it's time to train some learning algorithms that haveaunifieteonient
provides awide range of machine aceuracy, ete
" SciPy as its dependencies. interface for fitting,. predicting recommendation algorithms used
Before installing scikit-learn, ensure that you have NumPy and SciPy inet different types of
Once you have a working installation of NumPy and SeiPy, the easiest way to in Q.9. (c) What are the (5)
in recommender systems?
scilkit-learn is using pip: recommendation system, is a subclass of
or a
pip install -U scikit-learn Ans. A recommender system, rating" or "preference a user
to predict the
information iltering system that seeksaystems are utilized in avariety of areas, with
Let us get started with the modeling process now. would give to an item. Recommender generatars for video and
taking the form of playlist
Step 1: Load a dataset commonly recognized examples for online stores, etc. These systems can operate
Adataset is nothing but a collection of data. Adataset generally has twe music services, product recommenders
music, or multiple inputs within and across platfarms hke
components: using a single input, like queries. Recommender systems are utilized in order to make
news, books, and search recommendations to frienda
" Features: (also known as predictors, inputs, or attributes) they are simply customers, or personalized
variables of our data. They can be more than one and hence represented by a feat better product suggestions to learning algorithms in order to
make
matrix (X isa common notation to represent feature matrix). Alist of all the feat Recommender systems leverage machineThere are a number of different machine
preferences
names is termed feature names. better predictions about a user'sused in a recommender system Each algorithm has its
learning algorithms thut can be particular application will
"Response: (also known as the target, label, or output) This is the output vari
own strengths and weaknesses, and the best algorithm for alinear regression algorithm.
depending on the feature variables. We generally have a single response column most common is the
depend on the nature of the data. The approximation to a data
it is represented bya response vector (y isa common notation to represent respe used to find the best linear
vector). All the possible values taken by a response vector are termed target names The linear regression algorithm is algorithm is used to predict how a user willrate an
set. In a recommender system, this used
machine learning algor1thms that can be
Loading exemplar dataset: scikit-learn comes loaded with a few exa item based on their past ratings. Other the following
datasets like the iris and digits datasets for classification and the boston house pn
dataset for regression. in Recommender Systems, include some of using K-NN in recommender
K-NearestNeighbor (K-NN): The key idea behind
Loading external dataset: Now, consider the case when we want to load (similar users or items) based on historical
systems is to find the nearest neighbors recommendations. In user-based
external dataset. For this purpose, we can use the pandas library for easily loading data and use their preferences or interactions to make
manipulating datasets. most similar users toa target user
collaborative filtering, K-NN can be used to find the k
To install pandas, use the following pip command: interactions with items. The similarity can be measured using
based on their historical similarity or Pearson correlation coeficient
pip instal pandas various distance metrics, such as cosine
identified, the system can recommend items that these
In pandas, important data types are: Once the similar users are item-based collaborative filtering, K-NN can be
Series: Series is a one-dimensional labeled array capable of bolding any data ty users have liked or interacted with. In
a tanget item based on user preferences. The
DataFrame: It is a 2-dimensional labeled data structure with columns of potaot usod to find the k most similar items to similarity or adjusted cosine
ditferent types, You can think of it like a spreadsheet or SQL table, or a dicd Se similarity can be calculated using techniques like cosinecan recommend these items to
abjoete. It is generally the most commonly usedpandas abject sinmilarity. Once similar items are identified, the system

You might also like