CS3352-QB Fds
CS3352-QB Fds
Data Science: Benefits and uses – facets of data - Data Science Process: Overview – Defining research
goals – Retrieving data – Data preparation - Exploratory Data analysis – build the model– presenting
findings and building applications - Data Mining - Data Warehousing – Basic Statistical descriptions of
Data
PART A
1 What is Data Science?
Data Science is the area of study which involves extracting insights from vast amounts of data
using various scientific methods, algorithms, and processes.
It helps you to discover hidden patterns from the raw data.
Data Science is an interdisciplinary field that allows you to extract knowledge from structured or
unstructured data. Data science enables you to translate a business problem into a research project
and then translate it back into a practical solution.
2 Why Data Science needed?
It helps you to recommend the right product to the right customer to enhance your business
Allows to build intelligence ability in machines
It enables you to take better and faster decisions
Data Science can help you to detect fraud using advanced machine learning algorithms
It helps you to prevent any significant financial losses
3 What are the components of data science?
Domain expertise
Data engineering
Statistics
Visualization
Advanced computing
4 List out the data science jobs.
Most prominent Data Scientist job titles are:
Data Scientist
Data Engineer
Data Analyst
Statistician
Data Architect
Data Admin
Business Analyst
Data/Analytics Manager
5 List out the tools for Data Science.
Data Analysis – Python, R, Spark and SAS
Data Warehousing – Hadoop, SQL
Data Visualization - R, Tableau
Machine Learning – Spark, Azure ML studio
6 List out Some applications of Data Science.
Internet Search Results (Google)
Recommendation Engine (Spotify)
Intelligent Digital Assistants (Google Assistant)
Autonomous Driving Vehicle (Waymo, Tesla)
Spam Filter (Gmail)
Abusive Content and Hate Speech Filter (Facebook)
Robotics (Boston Dynamics)
Automatic Piracy Detection (YouTube)
4
7 What are the skills required to become the data scientist?
6
2 What are the types of data?
10
2 What are the categories of basic array manipulation?
Attributes of arrays
Determining the size, shape, memory consumption, and data types of arrays.
Indexing of arrays
Getting and setting the value of individual array elements.
Slicing of arrays
Getting and setting smaller sub arrays within a larger array
Reshaping of arrays
Changing the shape of a given array
Joining and splitting of arrays
Combining multiple arrays into one, and splitting one array into many
3 What is the syntax for Numpy slicing?
The Numpy slicing syntax follows that of the standard Python list, to access a slice of an array x:
x[start:stop:step]
If any of these are unspecified, they default to the values start=0, stop=size of dimension, step=1. We can
access sub-arrays in one dimension and in multiple dimensions.
4 What will be the output for the below code:
x2 = array([[12, 5, 2, 4], [ 7, 6, 8, 8], [ 1, 6, 7, 7]])
print(x2[0, :])
Output:
[12 5 2 4]
5 What do you mean by ufuncs?
Ufuncs are the universal functions. The Vectorized operations in Numpy are implemented via ufuncs whose
main purpose is to quickly execute repeated operations on values in Numpy arrays. NumPy's universal
functions can be used to vectorize operations and thereby remove slow Python loops.
6 What is the purpose of the axis keyword?
The axis keyword specifies the dimension of the array that will be collapsed, rather than the
dimension that will be returned.
So specifying axis=0 means that the first axis will be collapsed. For two-dimensional arrays, this
means that values within each column will be aggregated.
7 What are the rules for broadcasting?
Broadcasting in Numpy follows a strict set of rules to determine the interaction between the two arrays:
● Rule 1: If the two arrays differ in their number of dimensions, the shape of the one with
fewer dimensions are padded with ones on its leading (left) side.
● Rule 2: If the shape of the two arrays does not match in any dimension, the array with shape
equal to 1 in that dimension is stretched to match the other shape.
● Rule 3: If in any dimension the sizes disagree and neither is equal to 1, an error is raised.
8 What is fancy indexing?
A style of array indexing is known as fancy indexing.
Fancy indexing is like the simple indexing but we pass arrays of indices in place of single scalars.
This allows us to very quickly access and modify complicated subsets of an array's values.
9 What is the difference between np.sort and np.argsort?
np.sort is used to return a sorted version of the array without modifying the input.
np.argsort is used to return the indices of the sorted elements.
10 What is the output of the given code?
data = np.zeros(4, dtype={'names':('name', 'age', 'weight'),'formats':('U10', 'i4', 'f8')})
print(data.dtype)
Output:
[('name', '<U10'), ('age', '<i4'), ('weight', '<f8')]
11 What is the difference between Numpy array and pandas series?
While the Numpy Array has an implicitly defined integer index used to access the values, the Pandas
Series has an explicitly defined index associated with the values.
This explicit index definition gives the Series object additional capabilities. For example, the index
need not be an integer but can consist of values of any desired type. For example we can use strings as
an index.
11
12 How the series object can be modified?
Series objects can be modified with a dictionary-like syntax. Just as we can extend a dictionary by
assigning to a new key, we can extend a Series by assigning to a new index value.
13 What is python none object?
The first sentinel value used by Pandas is None, a Python singleton object that is often used for missing
data in Python code. Because it is a Python object, None cannot be used in any arbitrary Numpy/Pandas
array, but only in arrays with data type 'object' i.e. arrays of Python objects.
14 What is the use of multi-indexing?
Multi-indexing is used to represent two-dimensional data within a one-dimensional Series.
We can also use it to represent data of three or more dimensions in a Series or Data Frame. Each
extra level in a multi-index represents an extra dimension of data.
15 What is pd.merge ( ) function?
The pd.merge ( ) function implements a number of types of joins: the one-to-one, many-to-one and many-
to-many joins. All three types of joins are accessed via an identical call to the pd.merge () interface. The
type of join performed depends on the form of the input data.
16 What is describe ( ) method?
The method describe ( ) computes several common aggregates for each column and returns the result. We
can use this method on the dataset for dropping rows with missing values.
17 What is split, apply and combine?
The split step involves breaking up and grouping a data frame depending on the value of the specified
key.
The apply step involves computing some function usually an aggregate, transformation, or filtering
within the individual groups.
The combine step merges the results of these operations into an output array.
18 What is the use of get ( ) and slice ( ) operations?
The get () and slice () operations enable vectorized element access from each array.
For example, we can get a slice of the first three characters of each array using str.slice (0, 3).
get () and slice() methods also let us access elements of arrays returned by split().
For example, to extract the last name of each entry, we can combine split () and get().
19 What do you mean by datetime and dateutil?
The datetime type is used to manually build a date. Using the dateutil module, we can parse dates from a
variety of string formats. With datetime object, we can print the day of the week.
20 What is the advantage of using numexpr library?
● The Numexpr library gives the ability to compute compound expressions element by element
without the need to allocate full intermediate arrays.
● Numexpr evaluates the expression in a way that does not use full-sized temporary arrays and can be
much more efficient than Numpy, especially for large arrays.
● The Pandas eval() and query() tools are conceptually similar and depend on the Numexpr package
PART B
1 Explain all the array manipulation functions with examples in Numpy.
2 Write short notes on Computation on Arrays.
3 Explain Aggregation Functions and Fancy Indexing with examples in Numpy.
4 Explain selection sort and other sorting methods used in Numpy with Examples
5 What are the Data Manipulation Techniques in Pandas.
6 Explain in detail the steps involved in constructing a pandas data frame
7 What are the steps involved in handling missing data in pandas.
8 Explain in detail about the aggregate, filter, transform and apply operations of the GroupBy object
9 Write short notes on dates and times in pandas with examples.
10 Explain in detail about the Pivot table?
UNIT V - DATA VISUALIZATION
Importing Matplotlib – Line plots – Scatter plots – visualizing errors – density and contour plots – Histograms –
legends – colors – subplots – text and annotation – customization – three dimensional plotting - Geographic Data
12
with Basemap - Visualization with Seaborn
1 What is Matplotlib?
Matplotlib is a python library used to create 2D graphs and plots by using python scripts.
It has a module named pyplot which makes things easy for plotting by providing feature to control line
styles, font properties, formatting axes etc.
It supports a very wide variety of graphs and plots namely - histogram, bar charts, power spectra, error
charts etc.
2 What is the line plot?
A Line plot can be defined as a graph that displays data as points or check marks above a number line,
showing the frequency of each value.
3 Define Scatter plots.
Scatter plots are the graphs that present the relationship between two variables in a data-set.
It represents data points on a two-dimensional plane or on a Cartesian system.
The independent variable or attribute is plotted on the X-axis, while the dependent variable is plotted
on the Y-axis.
These plots are often called scatter graphs or scatter diagrams.
4 Define Error bars.
Error bars function used as graphical enhancement that visualizes the variability of the plotted data on a
Cartesian graph.
Error bars can be applied to graphs to provide an additional layer of detail on the presented data. As you
can see in below graphs.
5 How do you visualize error bars?
Error bars are used to display either the standard deviation, standard error, confidence intervals or the
minimum and maximum values in a ranged dataset.
To visualise this information, Error Bars work by drawing cap-tipped lines that extend from the centre
of the plotted data point
6 What is density plot?
Density Plot is a type of data visualization tool.
It is a variation of the histogram that uses ‘kernel smoothing’ while plotting the values. It is a
continuous and smooth version of a histogram inferred from a data.
7 What are Contour plots?
Contour plots (sometimes called Level Plots) are a way to show a three-dimensional surface on a two-
dimensional plane.
It graphs two predictor variables X Y on the y-axis and a response variable Z as contours. These
contours are sometimes called the z-slices or the iso-response values.
8 Define histogram
A histogram is the graphical representation of data where data is grouped into continuous number
ranges and each range corresponds to a vertical bar.
The horizontal axis displays the number range.
The vertical axis (frequency) represents the amount of data that is present in each range.
9 What are legends in data visualization?
A legend is used to identify data in visualizations by its color, size, or other distinguishing features.
Legends identify the meaning of various elements in a data visualization and can be used as an
alternative to labeling data directly
10 Why is color important in data visualization?
Color is important in data visualization because it allows you to highlight certain pieces of information
and promote information recall.
Using different colors can separate and define different data points within visualization so that viewers
can easily distinguish significant differences or similarities in values.
PART B
14
15