0% found this document useful (0 votes)
42 views

2nd - Semester - Data Science - Modified

vtu syllabus

Uploaded by

drmanu.ar-csds
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views

2nd - Semester - Data Science - Modified

vtu syllabus

Uploaded by

drmanu.ar-csds
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 14

Dayananda Sagar Academy of Technology & Management

(Autonomous Institute under VTU)

Semester : 3rd Semester


Course Title : Data Science for Engineers
Course Code : 23CSDS33
Course Type
: Integrated
(Theory/Practical/Project/Integrated)
Category : IPCC
Stream : CSE-DS CIE : 50
Teaching hours/ week (L:T:P:S) : 3-0-2-0 SEE : 50
Total Hours : 40 hours Theory + SEE Duration : 3 hours
20 hours Practical
Credits : 4

Course Learning Objectives: Students will be able to:


Sl. No Course Objectives
To provide a foundation in data Science terminologies, fundamentals and process and tools available for
data science and data analytics and to Define big data and its key characteristics (volume, variety, velocity)
and understand the challenges associated with processing big data using traditional methods, and Gain
1 proficiency in Apache Spark for distributed data processing

To describe the data for the data science process and to get familiarize data science process and steps and
Study usage of various data sources, and to develop ETL pipelines for data preparation using Spark on
2
Databricks, and Apply statistical concepts to summarize and analyze data, understand hypothesis testing
and perform statistical inference
To describe the relationship between data and to Demonstrate the data visualization tools, and Learn data
extraction from various data sources, and Create informative data visualizations using Python libraries, also
3 identify relationships and patterns within datasets through EDA techniques

To analyze the data science applicability in real time applications., and working with various Data analytics Charts.
4
And Grasp the fundamental principles of supervised and unsupervised machine learning algorithms
To utilize the Python libraries for Data Wrangling, Understand the various calculations and best practices. To present
5
and interpret data using visualization libraries in Python used for data science

Scheme of Teaching and Examinations for BE Programme -2024-25


Outcome Based Education and Choice Based Credit System (CBCS)
(Effective from the Academic Year 2024-25)

DSATM

COURSE CURRICULUM
Module Topics Hours
No.
INTRODUCTION: Introduction to Data Science: Evolution of Data Science, Data Science Roles, Life cycle
1 of Data Science, Representation of Data Science as a Venn Diagram, Technologies revolving around Data 8
Science, What is Data Science? Big Data and Data Science hype – and getting past the hype, Why now? –
Datafication, Current landscape of perspectives, Skill sets. Needed Statistical Inference: Populations and Hours
samples, Statistical modelling, probability distributions, fitting a model. Types of Data: Structured and
Unstructured Data, Supervised and unsupervised data, Qualitative and Qualitative Data Four levels of
Data (Nominal, Ordinal, Interval, Ratio Level). Data Science: Benefits and uses – facets of data - Data
Science Process: Overview – Defining research goals – Retrieving data – Data preparation - Exploratory
Data analysis – build the model– presenting findings and building applications - Data Mining - Data
Warehousing – Basic Statistical descriptions of Data. DESCRIBING DATA: Types of Data - Types of Variables -
Describing Data with Tables and Graphs –Describing Data with Averages - Describing Variability - Normal
Distributions and Standard (z) Scores. THE DATA SCIENCE PROCESS-Overview of the data science process-
defining research goals and creating project charter, retrieving data, cleansing, integrating and transforming data,
exploratory data analysis, Build the models, presenting findings and building application on top of them. Data
Science Workflow: Build on the foundational knowledge from the first section by diving into the specific steps
involved in a data science project, including problem formulation, data collection, and preprocessing. Data Lakes and
Data Meshes, Avoiding Data Leaks. The Data Ecosystem: the different types of data structures, file formats, sources
of data, and the languages data professionals use in their day-to-day tasks. various types of data repositories such as
Databases, Data Warehouses, Data Marts, Data Lakes, and Data Pipelines. the Extract, Transform, and Load (ETL)
Process, which is used to extract, transform, and load data into data repositories. basic understanding of Big Data
and Big Data processing tools such as Hadoop, Hadoop Distributed File System (HDFS), Hive, and Spark.

Big Data Fundamentals: Definition and characteristics of big data (volume, variety, velocity), Impact of big data on
different industries, Challenges of processing big data with traditional methods. (This helps students grasp the scope
and challenges associated with handling large and diverse datasets).Apache Spark: A Distributed Processing
Engine: Introduction to Spark and its distributed nature, Components of the Spark ecosystem (Spark Core, Spark
SQL, Spark Streaming), Benefits of using Spark for data processing. Building ETL Pipelines with Spark on
Databricks: Introduction to the ETL process (Extract, Transform, Load), Setting up a Databricks workspace (free tier
available), Connecting to data sources and data ingestion techniques in Spark, Data transformation and manipulation
using Spark Data Frames/Datasets (Introduce Spark as a tool to handle big data challenges. Students will learn about
distributed data processing, which is crucial for managing large datasets efficiently.)
Group activity - Through
 Interactive Lectures: Combine traditional lectures with online resources for a
comprehensive understanding using Prompt Engineering. as it enhances
problem-solving abilities, boosts efficiency in interacting with AI systems,
improves communication with advanced technologies, fosters innovation, and
ensures relevance in a rapidly evolving job market.
 Hands-on Exercises: Practical exercises to apply in. THE DATA SCIENCE PROCESS
and Understand the Data Science Workflow using ).Apache Spark and Building ETL Pipelines
Pedagogy with Spark on Databricks.
 Support the students to subscribe for the Free member ship for students and in
IEEE Data ports to access the data and work on data
 Through this pedagogy make the students to Summarize testable predictions for
real-time data, Data can be accessed used from the Kaggle data set and other
open source Github repository and IEEE and other data repository. To make
students to understand what exactly data process.

PREPARING AND GATHERING DATA AND KNOWLEDGE: Philosophies of data science - Data science in a
big data world - Benefits and uses of data science and big data - facts of data: Structured data, 8
Unstructured data, Natural Language, Machine generated data, Audio, Image and video streaming data -
2 Hours
The Big data Eco system: Distributed file system, Distributed Programming framework, Data Integration
frame work, Machine learning Framework, NoSQL Databases, Scheduling tools, Benchmarking Tools,
System Deployment, Service programming and Security. Statistical Analysis: Introduce statistical
concepts to analyze and interpret data. This includes summarizing data, performing hypothesis testing, and
making statistical inferences, which are essential for drawing meaningful conclusions from data.

Deep Dive into ETL with Spark: Data Ingestion and Cleaning: Techniques for handling various data
formats (text files, CSV, JSON), Addressing common data quality issues (missing values,
inconsistencies). Data Transformation with Spark Functions : Working with Spark Data
Frames/Datasets and applying transformation functions -Filtering, aggregating, and manipulating data for
analysis, Joining datasets for comprehensive analysis. Data Quality Checks and Missing Value
Handling Implementing data quality checks to identify errors and inconsistencies, Techniques for
handling missing values (imputation, deletion), Ensuring data integrity for reliable analysis. Introduction
to Apache Spark SQL: Declarative data querying with Spark SQL using SQL-like syntax, Integrating
Spark SQL with Spark DataFrames/Datasets, Performing complex queries on large datasets efficiently.
Blended Learning and Practical Demo – using (text files, CSV, JSON), files and demonstrate
Data Transformation with Spark Functions, and Data Quality Checks and Missing Value Handling
using Apache Spark SQL, Hadoop- Hadoop Distributed File System, MapReduce,

YARN: Cassandra, High-Performance Computing Cluster, or HPCC, Apache Storm- Scalable


Advanced Massive Online Analysis (SAMOA), Write Once Run Anywhere (WORA), Atlas.ti, Stats
iQ, CouchDB tools. Use Data Tools with Data Debisian Tools,
 Live Demonstrations: Real-time examples in a controlled environment.
 Simulations: Use of software tools to simulate interrupt handling. Using Apache
Pedagogy Spark SQL
 Blended Learning-Data Collection from Kaggle and other repository and performing case studies
 Promote the students to do certification courses on NASCOM Feature skills and
free certification courses portal and also proposed to have department or
institution level MOU with Upgrad, Udemy, Unacademy courses and Scaler for
advanced level courses for students to enhance their skills

Feature Generation and Feature Selection: Extracting Meaning from Data: Motivating application: user (customer)
retention. Feature Generation (brainstorming, role of domain expertise, and place for imagination), Feature Selection 8
algorithms. Filters; Wrappers; Decision Trees; Random Forests. Recommendation Systems: Building a User-Facing
3 Data Product, Algorithmic ingredients of a Recommendation Engine, Dimensionality Reduction, Singular Value Hours
Decomposition, Principal Component Analysis, Exercise: build your own recommendation system .Data Science
Workflow – CRISP-DM , Descriptive Statistics: Introduction to statistics- Summarizing data using
central tendency (mean, median, mode) and dispersion (variance, standard deviation) measures, Exploring
data distribution and skewness, Probability Concepts and Random Variables : Understanding basic
probability concepts and random variables, Different probability distributions (normal, binomial, Poisson)
and their applications in data analysis, Hypothesis Testing : Formulating null and alternative hypotheses,
Calculating p-values and interpreting statistical significance, Making data-driven inferences based on
hypothesis testing. Correlation and Regression Analysis : Measuring the strength and direction of
relationships between variables, Understanding linear regression analysis and its applications in data
modeling. ? Visualizing Data, matplotlib, Bar Charts, Line Charts, Scatterplots, Linear Algebra, Vectors, Matrices,
Statistics, Describing a Single Set of Data, Correlation, Simpson’s Paradox, Some Other Correlational Caveats,
Correlation and Causation, Probability, Dependence and Independence, Conditional Probability, Bayes’s Theorem,
Random Variables, Continuous Distributions, The Normal Distribution, The Central Limit Theorem.

Exploratory Data Analysis and the Data Science Process:Basic tools (plots, graphs and summary statistics) of
EDA, Philosophy of EDA, The Data Science Process, Case Study: Real Direct (online realestate firm). Three Basic
Machine Learning Algorithms: Linear Regression, k-Nearest Neighbours (k- NN), k-means Introduction to
Exploratory Data Analysis (EDA)): Importance of EDA in data science projects, Different stages of the
EDA process, Techniques for gaining insights from data. Univariate and Multivariate Data Analysis:
Analyzing data distribution (histograms, boxplots) for individual variables. Visualizing relationships
between pairs of variables (scatter plots). Exploring relationships among multiple variables using
techniques like correlation matrices

Poster Presentation - Feature Generation and Feature Selection and on usage of PYTHON
LIBRARIES FOR DATA WRANGLING - Introduce the Concept of Lambda and Gaba Poster
Presentation: allows students to represent the concepts visually in order to understand the topics easily
from Start/Beginning of the data to where it is used. Adapt Block Diagram giving the flagship from start
to end Road map, providing how they are used in real-world.
Pedagogy
 Visual Learning: Students create posters illustrate pipeline stages and
performance factors using Canva and Power BI and Tableau Tools for data
visualization.
 Peer Review: Encourages collaboration and critical evaluation of concepts.

PYTHON LIBRARIES FOR DATA WRANGLING: Basics of Numpy arrays –aggregations –computations on arrays –
comparisons, masks, Boolean logic – fancy indexing – structured arrays – Data manipulation with Pandas – data
8
indexing and selection – operating on data – missing data – Hierarchical indexing – combining datasets –
4 aggregation and grouping – pivot tables. Application for machine learning in data science- Tools used in machine Hours
learning- Modeling Process – Training model – Validating model – Predicting new observations –Types of machine
learning Algorithm: Supervised learning algorithms, Unsupervised learning algorithms. Data Visualization and Data
Exploration Introduction: Data Visualization, Importance of Data Visualization, Data Wrangling, Tools and Libraries for
Visualization Comparison Plots: Line Chart, Bar Chart and Radar Chart; Relation Plots: Scatter Plot, Bubble Plot
Correlogram and Heatmap; Composition Plots: Pie Chart, Stacked Bar Chart, Stacked Area Chart, Venn Diagram;
Distribution Plots: Histogram, Density Plot, Box Plot, Violin Plot; Geo Plots: Dot Map, Choropleth Map, Connection
Map; What Makes a Good Visualization? Visualization: Introduction to data visualization – Data visualization options
Filters – Map Reduce, Dashboard development tools. Decision Trees: What Is a Decision Tree?, Entropy, The
Entropy of a Partition, Creating a Decision Tree, Putting It All Together, Random Forests, Neural Networks,
Perceptrons, Feed-Forward Neural Networks, Backpropagation, Example: Fizz Buzz, Deep Learning, The Tensor,
The Layer Abstraction, The Linear Layer, Neural Networks as a Sequence of Layers, Loss and Optimization,
Example: XOR Revisited, Other Activation Functions, Example: Fizz Buzz Revisited, Softmaxes and Cross-Entropy,
Dropout, Example: MNIST, Saving and Loading Models, Clustering, The Idea, The Model, Example: Meetups,
Choosing k, Example: Clustering Colors, Bottom-Up Hierarchical Clustering

Data Visualization with Matplotlib and Seaborn : Creating various data visualizations (bar charts, line
charts, heatmaps) using Matplotlib. Leveraging Seaborn, a high-level library built on top of Matplotlib, for
advanced and aesthetically pleasing visualizations. Customizing visualizations to effectively communicate
insights. Feature Engineering for Machine Learning: Identifying and selecting relevant features for
machine learning models. Techniques for creating new features from existing data (feature scaling,
encoding categorical variables). Understanding the impact of feature engineering on model performance.
Machine Learning Algorithms for data science: Supervised Learning Algorithms: Introduction to
supervised learning and its goal of predicting a target variable. Common supervised learning algorithms:
Linear Regression: Predicting continuous target variables using a linear relationship, K-Nearest Neighbors
(KNN): Predicting a target variable based on the k nearest neighbors, Decision Trees: Making
classification decisions based on a tree-like structure, Understanding the concept of model evaluation
metrics (accuracy, precision, recall). Implementing basic supervised learning algorithms using Python
libraries (Scikit-learn). Unsupervised Learning Algorithms :Introduction to unsupervised learning and
its goal of uncovering hidden patterns, Common unsupervised learning algorithms: K-Means Clustering:
Grouping data points into clusters based on their similarity, Principal Component Analysis (PCA):
Reducing data dimensionality while preserving key information, Understanding the applications of
unsupervised learning in data exploration and dimensionality reduction, Implementing basic unsupervised
learning algorithms using Python libraries (Scikit-learn).Model Development: Simple and Multiple Regression
– Model Evaluation using Visualization – Residual Plot – Distribution Plot – Polynomial Regression and Pipelines –
Measures for In-sample Evaluation – Prediction and Decision Making.

Group Discussion and Demonstration : exhibits the implementation process Introduce to do project
by 3rd semester instead of end of the semester Using Apache HICEBERG tools- open-source
tools, Tableau, Weka, ETL, for Data Visualization with Matplotlib and Seaborn and Data
Exploration Using Pyspark
Pedagogy
 Collaborative Learning: Group discussions on Data Exploration Use Pyspark
 Debates: Structured debates to foster deeper understanding and analytical
thinking on Exploratory Data Analysis and the Data Science Process

Modeling, What Is Machine Learning?, Overfitting and Underfitting, Correctness, The Bias-Variance 8
5 Tradeoff, Feature Extraction and Selection, k-Nearest Neighbors, The Model, Example: The Iris Dataset, Hours
The Curse of Dimensionality, Naive Bayes, A Really Dumb Spam Filter, A More Sophisticated Spam
Filter, Implementation, Testing Our Model, Using Our Model, Simple Linear Regression, The Model,
Using Gradient Descent, Maximum Likelihood Estimation, Multiple Regression, The Model, Further
Assumptions of the Least Squares Model, Fitting the Model, Interpreting the Model, Goodness of Fit,
Digression: The Bootstrap, Standard Errors of Regression Coefficients, Regularization, Logistic
Regression, The Problem, The Logistic Function, Applying the Model, Goodness of Fit, Support Vector
Machines. Regression, The Problem, The Logistic Function, Applying the Model, Goodness of Fit,
Support Vector Machines.

A Deep Dive into Matplotlib and Case studies: Introduction, Overview of Plots in Matplotlib, Pyplot Basics:
Creating Figures, Closing Figures, Format Strings, Plotting, Plotting Using pandas DataFrames, Displaying Figures,
Saving Figures; Basic Text and Legend Functions: Labels, Titles, Text, Annotations, Legends; Basic Plots:Bar Chart,
Pie Chart, Stacked Bar Chart, Stacked Area Chart, Histogram, Box Plot, Scatter Plot, Bubble Plot; Layouts: Subplots,
Tight Layout, Radar Charts, GridSpec; Images: Basic Image Operations, Writing Mathematical Expressions.
Importing Matplotlib – Line plots – Scatter plots – visualizing errors – density and contour plots – Histograms –
legends – colors – subplots – text and annotation – customization – three-dimensional plotting - Geographic Data with
Basemap - Visualization with Seaborn.

Pedagogy Case Study -: maps different data domains in real time applications integrating AI and Data using
The Data Ecosystem

 Research Projects: Students conduct research on Different data sets and


present their findings Using Hugging Faces Algorithm (1000’s of Algorithm) groups
 Comparative Analysis: Analysis of different data sets models designed using
various Algorithms to understand their strengths and applications, explore
Ethics of AI and data usage to be included using Hyper scalers and conceptually Data
security.
List of Programs:

Sl. No. Experiments/Programs COs


1 A. Big Data Processing with Apache Spark: Objective: Understand the fundamentals of big data CO3,4,5
processing using Apache Spark.

Tasks: i. Set up a Databricks workspace:Create a free Databricks account. Set up a new workspace and
cluster.

ii. Data Ingestion: Load a large dataset (e.g., a CSV file containing transaction data) into
Databricks.
• Basic Data Exploration: Use Spark DataFrames to explore the dataset. And Perform
basic operations like filtering, grouping, and aggregating data.
• ETL Pipeline: Build an ETL pipeline to clean and transform the data. Save the
transformed data back to a storage system (e.g., DBFS).

iii. Installation of Python/R/Go language, Visual Studio code editors can be demonstrated along with
Kaggle data set usage.
iv. Write programs in Python/R and Execute them in either Visual Studio Code or PyCharm Community
Edition or any other suitable environment.
v. A study was conducted to understand the effect of number of hours the students spent studying on
their performance in the final exams. Write a code to plot line chart with number of hours spent
studying on x-axis and score in final exam on y-axis. Use a red ‘*’ as the point character, label the
axes and give the plot a title.
Number of 10 9 2 15 10 16 11 16
hrs
spent
studying
(x)
Score in the 95 80 10 50 45 98 38 93
final exam
(0
– 100)
(y)

d. For the given dataset mtcars.csv (www.kaggle.com/ruiromanini/mtcars), plot a histogram to check the frequency
distribution of the variable ‘mpg’ (Miles per gallon)

2 a. Consider the books dataset BL-Flickr-Images-Book.csv from Kaggle


(https://www.kaggle.com/adeyoyintemidayo/publication-of-books) which contains information about
books. Write a program to demonstrate the following.
 Import the data into a DataFrame
 Find and drop the columns which are irrelevant for the book information.
 Change the Index of the DataFrame
 Tidy up fields in the data such as date of publication with the help of simple regular expression.
 Combine str methods with NumPy to clean columns.

b. Statistical Analysis with Python:Objective: Apply statistical concepts to summarize and


analyze data.

Tasks:

1. Descriptive Statistics:
o Calculate measures of central tendency (mean, median, mode) and dispersion
(variance, standard deviation) for a dataset.
2. Probability Distributions:
o Analyze a dataset to identify its underlying probability distribution (e.g.,
normal, binomial).
o Visualize the distribution using histograms and probability plots.
3. Hypothesis Testing:
o Formulate null and alternative hypotheses for a given problem.
o Perform hypothesis testing (e.g., t-test, chi-square test) and interpret the
results.

3 a. Train a regularized logistic regression classifier on the iris dataset


(https://archive.ics.uci.edu/ml/machine-learning-databases/iris/ or the inbuilt iris dataset) using
sklearn. Train the model with the following hyperparameter C = 1e4 and report the best
classification accuracy.
b. Train an SVM classifier on the iris dataset using sklearn. Try different kernels and the associated
hyperparameters. Train model with the following set of hyperparameters RBF- kernel,
gamma=0.5, one-vs-rest classifier, no-feature-normalization. Also try
C=0.01,1,10C=0.01,1,10. For the above set of hyperparameters, find the best classification
accuracy along with total number of support vectors on the test data

4 A. Consider the following dataset. Write a program to demonstrate the working of the decision tree
based ID3 algorithm.
Price Maintenance Capacity Airbag Profitable
Low Low 2 No Yes
Low Med 4 Yes Yes
Low Low 4 No Yes
Low Med 4 No No
Low High 4 No No
Med Med 4 No No
Med Med 4 Yes Yes
Med High 2 Yes No
Med High 5 No Yes
High Med 4 Yes Yes
high Med 2 Yes Yes
High High 2 Yes No
high High 5 yes Yes

B. Consider the dataset spiral.txt (https://bit.ly/2Lm75Ly). The first two columns in the dataset
corresponds to the co-ordinates of each data point. The third column corresponds to the actual
cluster label. Compute the rand index for the following methods:
o K – means Clustering
o Single – link Hierarchical Clustering
o Complete link hierarchical clustering.
o Also visualize the dataset and which algorithm will be able to recover the true
clusters.

5 A. Import any CSV file to Pandas Data Frame and perform the following:
(a) Visualize the first and last 10 records
(b) Get the shape, index and column details
(c) Select/Delete the records (rows)/columns based on conditions.
(d) Perform ranking and sorting operations.
(e) Do required statistical operations on the given columns.
(f) Find the count and uniqueness of the given categorical values.
(g) Rename single/multiple columns
2. import any CSV file to Pandas Data Frame and perform the following:
a)Handle missing data by detecting and dropping/ filling missing values.
(b) Transform data using apply () and map() method.
(c) Detect and filter outliers.
(d)Perform Vectorized String operations on Pandas Series.
(e) Visualize data using Line Plots, Bar Plots, Histograms, Density Plots and Scatter Plots.

B. Exploratory Data Analysis (EDA) with Python: Objective: Conduct exploratory data
analysis using Python libraries.

Tasks:

a. Univariate Analysis:
a. Analyze the distribution of individual variables using histograms and
boxplots.
b. Multivariate Analysis:
a. Explore relationships between pairs of variables using scatter plots and
correlation matrices.
c. Data Visualization:
a. Create various visualizations (bar charts, line charts, heatmaps) using
Matplotlib and Seaborn.
b. Customize the visualizations to effectively communicate insights.
d. Feature Engineering:
a. Perform feature scaling and encoding of categorical variables.
b. Create new features from existing data to enhance model performance.

6 A. Reading data from text files, Excel and the web and exploring various commands for doing
descriptive analytics on the Iris data set.
B. Use the diabetes data set from UCI and Pima Indians Diabetes data set for performing the
following:
1. Univariate analysis: Frequency, Mean, Median, Mode, Variance, Standard Deviation,
Skewness and Kurtosis.
2. Bivariate analysis: Linear and logistic regression modeling
3. Multiple Regression analysis
4. Also compare the results of the above analysis for the two data sets.
C. Apply and explore various plotting functions on UCI data sets.
1. Normal curves
2. Density and contour plots
3. Correlation and scatter plots
4. Histograms
5. Three-dimensional plotting
D. Visualizing Geographic Data with Basemap
7 A . Supervised Learning with Scikit-learn: Objective: Implement and evaluate supervised
learning algorithms.

Tasks:

1. Data Preparation:
o Split a dataset into training and testing sets.
2. Linear Regression:
o Implement a linear regression model to predict a continuous target variable.
o Evaluate the model's performance using metrics like mean squared error
(MSE).
3. Decision Trees:
o Build a decision tree classifier to predict a categorical target variable.
o Assess the model's accuracy, precision, and recall.
4. K-Nearest Neighbors (KNN):
o Implement a KNN model for classification.
o Tune the hyperparameters (e.g., the number of neighbors) to optimize
performance.

b. Demonstrate Decision tree classification model and evaluate the performance of classifier on Iris
dataset.

c. Unsupervised Learning with Scikit-learn: Objective: Implement and explore


unsupervised learning algorithms.

Tasks:

i. K-Means Clustering:
a. Apply K-means clustering to a dataset to group similar data points.
b. Visualize the clusters and interpret the results.
ii. Principal Component Analysis (PCA):
a. Perform PCA on a high-dimensional dataset to reduce its dimensionality.
b. Analyze the principal components and their contribution to variance.
iii. Data Exploration:
a. Use unsupervised learning techniques to uncover hidden patterns and
insights within the data.

Suggested Learning Resources For Lab:


● Virtual Labs (CSE): http://cse01-iiith.vlabs.ac.in/

1. Using Python : https://www.python.org


2. R Programming : https://www.r-project.org/
3. Python for Natural Language Processing : https://www.nltk.org/book/
4. Data set: https://bit.ly/2Lm75Ly
5. Data set: https://archive.ics.uci.edu/ml/datasets.html
6. Data set : www.kaggle.com/ruiromanini/mtcars
7. Pycharm : https://www.jetbrains.com/pycharm/
8. https://nptel.ac.in/courses/106/106/106106179/
9. https://nptel.ac.in/courses/106/106/106106212/
10. http://nlp-iiith.vlabs.ac.in/List%20of%20experiments.html
11.Spark Documentation: https://docs.databricks.com/en/index.html
12.Scikit-learn Documentation:https://scikit-learn.org/0.19/documentation.html
13. https://www.databricks.com/
14. https://www.simplilearn.com/tutorials/data-science-tutorial/what-is-data-science
15.https://www.youtube.com/watch?v=N6BghzuFLIg
16. https://www.coursera.org/lecture/what-is-datascience/fundamentals-of-data-science-tPgFU
17. https://www.youtube.com/watch?v=ua-CiDNNj30
18. https://nptel.ac.in/courses/106/105/106105077/
19.https://www.oreilly.com/library/view/doing-data-science/9781449363871/toc01.html
20.http://book.visualisingdata.com/
21.https://matplotlib.org/
22..https://docs.python.org/3/tutorial/
23.https://www.tableau.com/
24. https://skillsforall.com/course/introduction-data-science?courseLang=en-
US&utm_campaign=writ&utm_content=intro-to-data-science-get-started-
button&utm_source=cisco.com&utm_medium=referral
25. https://www.simplilearn.com/data-science-free-course-for-beginners-skillup
26. https://www.coursera.org/learn/what-is-datascience
27. https://www.coursera.org/learn/datasciencemathskills
28.https://www.coursera.org/specializations/introduction-data-science
29.https://www.coursera.org/learn/foundations-of-data-science
30.https://www.coursera.org/learn/data-science-ethics
31.https://www.coursera.org/learn/foundations-of-data-science
32.http://www.data8.org/
33.https://www.microsoft.com/en-us/research/publication/foundations-of-data-science-2/
34.https://www.codecademy.com/learn/paths/data-science-foundations
35.https://github.com/glouppe/dats0001-foundations-of-data-science
36.https://www.cambridge.org/core/books/abs/data-science-in-context/data-science/
D767E4FD5E42834A1D92799541199663

Open ended Programs


Requirements Data Sets
For Open ended IRIS Data Set
Programs It is required that the student be conversant with R Programming Language or Python Programming language and
use them in implementing Data Science and Algorithms.

Iris is a particularly famous toy dataset (i.e. a dataset with a small number of rows and columns, mostly used for initial
small-scale tests and proofs of concept). This specific dataset contains information about the Iris, a genus that
includes 260-300 species of plants. The Iris dataset contains measurements for 150 Iris flowers, each belonging to
one of three species: Virginica, Versicolor and Setose. (50 flowers for each of the three species). Each of the 150
flowers contained in the Iris dataset is represented by 5 values:
□ Sepal length, in cm
□ Sepal width, in cm
□ petal length, in cm
□ petal width, in cm
Iris species, one of: iris-setose, iris-versicolor, iris-virginica. Each row of the dataset represents a distinct flower (as
such, the dataset will have 150 rows). Each row then contains 5 values (4 measurements and a species label). The
dataset is described in more detail on the UCI Machine Learning Repository website. The dataset can either be
downloaded directly from there (iris.data file), or from a terminal, using the wget tool. The following command
downloads the dataset from the original URL and stores it in a file named iris.csv.
$ wget "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data" -O iris.csv
MNIST Data Set
The MNIST dataset is another particularly famous dataset as CSV file. It contains several thousands of hand- written
digits (0 to 9). Each hand-written digit is contained in a 28 × 28 8-bit grayscale image. This means that each digit has
784 (282) pixels, and each pixel has a value that ranges from 0 (black) to 255 (white). The dataset can be
downloaded from the following
URL:https://raw.githubusercontent.com/dbdmg/data-science-lab/master/datasets/mnist_test.csv.
Each row of the MNIST datasets represents a digit. For the sake of simplicity, this dataset contains only a small
fraction (10,000 digits out of 70,000) of the real MNIST dataset, which is known as the MNIST test set.
For each digit, 785 values are available.
1 Load the Iris dataset as a list of lists (each of the 150 lists should have 5 elements). Compute and print CO3,4,5
the mean and the standard deviation for each of the 4 measurement columns (i.e. sepal length and width,
petal length and width). Compute and print the mean and the standard deviation for each of the 4
measurement columns, separately for each of the three Iris species (Versicolor, Virginica and Setose).
Which measurement would you consider “best”, if you were to guess the Iris species based only on those
four values?
2 Load the MNIST dataset. Create a function that, given a position 1 ≤ k ≤ 10, 000, prints the kthdigit CO3,4,5
of the dataset (i.e. thekthrow of the csv file) as a grid of 28 × 28 characters. More specifically, you
should map each range of pixel values to the following characters:
[0, 64) → " "
[64, 128) → "."
[128, 192) → "*"
[192, 256) → "#"
Compute the Euclidean distance between each pair of the 784-dimensional vectors of the digits at the
following positions: 26th, 30th, 32nd, 35th. Based on the distances computed in the previous step and
knowing that the digits listed are 7, 0, 1, 1, can you assign the correct label to each of the digits?
3 Split the Iris dataset into two the datasets - IrisTest_TrainData.csv, IrisTest_TestData.csv. Read them CO3,4,5
as two separate data frames named Train_Data and Test_Data respectively.
Answer the following questions:
 How many missing values are there in Train_Data?
 What is the proportion of Setosa types in the Test_Data?
 What is the accuracy score of the K-Nearest Neighbor model (model_1) with 2/3 neighbors using
Train_Data and Test_Data?
 Identify the list of indices of misclassified samples from the „model_1‟.
Build a logistic regression model (model_2) keeping the modelling steps constant. Find the accuracy of
the model_2
4 Demonstrate any of the Clustering model and evaluate the performance on Iris dataset. CO3,4,5

Text Books

Sl. No. Title of the Book/Name of the author/Name of the publisher/Edition and Year

1
Introducing Data Science, Davy Cielen, Arno D. B. Meysman and Mohamed Ali,Manning Publications, 2016
2 Robert S. Witte and John S. Witte, “Statistics”, Eleventh Edition, Wiley Publications, 2017. (Units II and III)

Reference Books

1 Joel Grus, “Data Science from Scratch”, 2ndEdition, O’Reilly Publications/Shroff Publishers and Distributors Pvt. Ltd.,
2019. ISBN-13: 978-9352138326

2 Data Visualization workshop, Tim Grobmann and Mario Dobler, Packt Publishing, ISBN 9781800568112

3
4
Data Science for Business by Foster Provost and Tom Fawcett (https://www.amazon.com/Data-Science-
Business-Data-Analytic-Thinking-ebook/dp/B00E6EQ3X4)

5
Python for Data Analysis by Wes McKinney
(https://www.oreilly.com/library/view/python-for-data/9781491957653/)

6 Hands-On Machine Learning with Scikit-Learn, Keras & TensorFlow by Aurélien Géron
7 Jake VanderPlas, “Python Data Science Handbook”, O’Reilly, 2016. (Units IV and V)
8 Cathy O’Neil and Rachel Schutt, “ Doing Data Science, Straight Talk From The Frontline”,
O’Reilly, 2014.

9 Jiawei Han, Micheline Kamber and Jian Pei, “ Data Mining: Concepts and Techniques”, Third
Edition. ISBN 0123814790, 2011.

10 Mohammed J. Zaki and Wagner Miera Jr, “Data Mining and Analysis: Fundamental Concepts and
Algorithms”, Cambridge University Press, 2014.
11 Jojo Moolayil, “Smarter Decisions : The Intersection of IoT and Data Science”, PACKT, 2016

12 Mining of Massive Datasets, Jure Leskovec, Anand Rajaraman, Jeffrey David Ullman, Cambridge
University Press, 2nd edition, 2014

13 Think Like a Data Scientist, Brian Godsey, Manning Publications, 2017.

14 A handbook for data driven design by Andy krik


15 Foundations of Data Science∗ Avrim Blum, John Hopcroft, and Ravindran Kannan Thursday 4th January,
2018
https://www.cs.cornell.edu/jeh/book.pdf

Course Outcomes: At the end of the course, the student will be able to:
RBT Level
CO Course Outcomes RBT Level Indicator
Describe the data science terminologies an To Understand the basics of data science R, U Level 1
CO1

Apply the Data Science process on real time scenario and Explain how data is collected, R, U Level 2
CO2 managed and stored for data science.
Analyze data visualization tools, Build, and prepare data for use with a variety of statistical Ap Level 3
CO3 methods and models
Apply Data storage and processing with frameworks and Ap, An Level 4
CO4 Analyze Data using various Visualization techniques.

Apply visualization Libraries in Python to interpret and explore data, Use the Python Libraries for Ap, An Level 4
Data Wrangling and Choose contemporary models, such as machine learning, AI, techniques to
solve practical problems
CO5

Program Outcome of this Course


Sl. No. Description POs

1 Engineering knowledge: Apply the knowledge of mathematics, science, engineering fundamentals, and PO1
computer science and business systems to the solution of complex engineering and societal problems.
2 Problem analysis: Identify, formulate, review research literature, and analyze complex engineering and PO2
business problems reaching substantiated conclusions using first principles of mathematics, natural
sciences, and engineering sciences.

3 Design/development of solutions: Design solutions for complex engineering problems and design system PO3
components or processes that meet the specified needs with appropriate consideration for the public health
and safety, and the cultural, societal, and environmental considerations.

4 Conduct investigations of complex problems: Use research-based knowledge and research methods PO4
including design of experiments, analysis and interpretation of data, and synthesis of the information to
provide valid conclusions.

5 Modern tool usage: Create, select, and apply appropriate techniques, resources, and PO5
modern engineering and IT tools including prediction and modeling to complex engineering activities with
an understanding of the limitations
6 The engineer and society: Apply reasoning informed by the contextual knowledge to assess societal, PO6
health, safety, legal and cultural issues and the consequent responsibilities
relevant to the professional engineering and business practices.
7 Environment and sustainability: Understand the impact of the professional engineering solutions in business PO7
societal and environmental contexts, and demonstrate the knowledge of, and need for sustainable
development.

8 Ethics: Apply ethical principles and commit to professional ethics and responsibilities and norms of the PO8
engineering and business practices.

9 Individual and team work: Function effectively as an individual, and as a member or leader in diverse teams, PO9
and in multidisciplinary settings.

10 Communication: Communicate effectively on complex engineering activities with the engineering community PO10
and with society at large, such as, being able to comprehend and write effective reports and design
documentation, make effective presentations, and give and receive clear instructions.

11 Project management and finance: Demonstrate knowledge and understanding of the engineering, business PO11
and management principles and apply these to one‟s own work, as a member and leader in a team, to
manage projects and in multidisciplinary environments.

12 Life-long learning: Recognize the need for, and have the preparation and ability to engage in independent PO12
and life-long learning in the broadest context of technological change.

Mapping of Course Outcomes to Program Outcomes:

CO/PO PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO10 PO11 PO12 PSO1 PSO2 PSO3
x x x x
CO1
x
CO2
x x x
CO3
x
CO4

CO5

CIE- Continuous Internal Evaluation (50 Marks)

Theory Practical
Bloom’s Continuous Assessment Tests Continuous Comprehensive Assessment (CCA)
Category (IAT) Practical Test
IAT-1 IAT-2 CCA-1 CCA-2
50 Marks 50 Marks 50 Marks 50 Marks
Remember

Understand
Apply

Analyse

Evaluate

Create

CIE Course Assessment Plan

Marks Distribution
Test-1 Test-2 Total Weightage
CO’s Module-1 Module-2 Module 2 to 2.5 Module-2.5 to 3 Module-4 Module-5 Marks
CO1
CO2
CO3
CO4
CO5
Total

SEE- Semester End Examination (50 Marks)

Bloom’s Category SEE Marks


(90% Theory+10% Practical Questions)
Remember
Understand
Apply
Analyse
Evaluate
Create

SEE Course Plan

Marks Distribution
Total Weightage
CO’s Marks
Module-1 Module-2 Module 2 to 2.5 Module-2.5 to 3 Module-4 Module-5
CO1
CO2
CO3
CO4
CO5
Total

You might also like