Module 3 - Data Science
Module 3 - Data Science
Data Science
Learning Competencies
3.1. Understand the data, problems, and tools that data analysts use.
3.2. Familiarize with the programming languages used in Data Science.
3.3. Identify what each tool is used for, what programming languages they can
execute, their features, and their limitations.
3.4. Introduce relational database concepts and learn and apply foundational
knowledge of the Python, R, and SQL language.
G11: 1
MODULE 3:
I. Volume
As the name suggests, volume refers to the scale and the amount of data. Over the decade,
with the increase in data, technology has also become better. The decrease in computational
G11: 2
MODULE 3:
and storage costs has made collecting and storing vast amounts of data far easier. The volume
of the data defines whether it qualifies as big data or not.
When data ranges from 1 GB to around 10 GB, the traditional data science tools tend to work
well in these cases. The following are the tools that work well in these cases.
a. Microsoft Excel. Excel prevails as the most accessible and
most popular tool for handling small amounts of data. The
maximum number of rows it supports is just a shade over 1
million, and one sheet can hold only up to 16,380 columns
at a time. These numbers are not enough when the amount
of data is considerable.
b. Microsoft Access. It is a popular tool by Microsoft that is
used for data storage. Users of Microsoft Access can either
design their database or create a database using a readily
available template as per requirements. Smaller databases
up to 2 GB can be handled smoothly with this tool
but,beyond that, it starts cracking up.
c. SQL. SQL is used to access data from relational database
management systems. It is used to define the data in the
database, and manipulate it when needed. It is also used to
create a view, stored procedure, function in a database, and
allows users to set permissions on tables, procedures, and
views.
If the data is greater than 10 GB all the way up to storage greater than 1Tb, then the following
tools are needed to be implemented:
a. Hadoop. It is an open-source distributed framework that
manages data processing and storage for big data. Hadoop
is a framework written in Java with some code in C and
Shell Script that works over the collection of various
simple commodity hardware to deal with the large dataset
using a very basic level programming model.
b. Hive. It is a data warehouse built on top of Hadoop. Hive
provides a SQL-like interface to query the data stored in
various databases and file systems that integrate with
Hadoop.
II. Variety
Variety refers to the different types of data that are out there. The data type may be one of
these – structured and unstructured data.
Structured data is data that has been predefined and formatted to a set structure before being
placed in data storage, which is often referred to as schema-on-write (e.g., tabular data,
employee table, payout table, loan application table). It is the basis for inventory control
G11: 3
MODULE 3:
III. Velocity
Velocity is the speed at which the data is captured. This includes both real-time and non-real-
time data. Examples of real-time data being collected are CCTV, stock trading, fraud
detection for credit card transactions, and network data.
The following are the most commonly used data science tools for real-time data:
a. Apache Kafka. Kafka is an open-source tool by Apache. It is used for building real-
time data pipelines. Some of the advantages of Kafka are – It is fault-tolerant, really
quick, and used in production by a large number of organizations. The original use
case
G11: 4
MODULE 3:
G11: 5
MODULE 3:
Tools for this spectrum enable an organization to identify trends and patterns to make crucial
strategic decisions. The types of analysis range from MIS, data analytics over to
dashboarding.
a. Excel. It gives various options, including Pivot tables and charts that let you analyze
in double-quick time. This is, in short, the Swiss Army Knife of data science/analytics
tools.
G11: 6
MODULE 3:
This is the domain where the bread and butter of most data scientists come from. Some of the
problems a data scientist will solve are statistical modeling, forecasting, neural networks, and
deep learn
The following are the commonly used tools for this domain:
G11: 7
MODULE 3:
G11: 8
MODULE 3:
Automated machine learning (AutoML) is the process of applying machine learning (ML)
models to real-world problems using automation. AutoML was proposed as an artificial
intelligence-based solution to the ever-growing challenge of applying machine learning. The
high degree of automation in AutoML allows non-experts to make use of machine learning
models and techniques without requiring them to become experts in machine learning.
Automating the process of applying machine learning end-to-end additionally offers the
advantages of producing simpler solutions, faster creation of those solutions, and models that
often outperform hand-designed models. AutoML has been used to compare the relative
importance of each factor in a prediction model.
G11: 9
MODULE 3:
Python is one of the most popular languages used by data scientists and software developers
alike for data science tasks. It can predict outcomes, automate tasks, streamline processes,
and offer business intelligence insights.
Below is a line-up of the most important Python libraries for data science tasks, covering
areas such as data processing, modelling, and visualization.
G11: 10
MODULE 3:
I. Data Mining
Data mining is a process used by companies to turn raw data into useful information. By
using software to look for patterns in large batches of data, businesses can learn more about
their customers to develop more effective marketing strategies, increase sales and decrease
costs. Data mining depends on effective data collection, warehousing, and computer
processing.
Developers use it for gathering data from APIs. This full-fledged framework follows
the Don't Repeat Yourself principle in the design of its interface. As a result, the tool
inspires users to write universal code that can be reused for building and scaling large
crawlers.
Data processing is the conversion of data into the usable and desired form. This conversion or
“processing” is carried out using a predefined sequence of operations either manually or
automatically. Most of the processing is done by using computers and other data processing
devices and thus done automatically.
Data modeling is the process of creating a visual representation of either a whole information
system or parts of it to communicate connections between data points and structures. The
goal is to illustrate the types of data used and stored within the system, the relationships
among them, how the data can be grouped and organized, and its formats and attributes.
G11: 11
MODULE 3:
The library offers many handy features performing operations on n-arrays and
matrices in Python. It helps to process arrays that store values of the same data type
and makes performing math operations on arrays (and their vectorization) easier. In
fact, the vectorization of mathematical operations on the NumPy array type increases
performance and accelerates the execution time.
Data scientists use it for handling standard machine learning and data mining tasks
such as clustering, regression, model selection, dimensionality reduction, and
classification. Another advantage? It comes with quality documentation and offers
high performance.
G11: 12
MODULE 3:
III.Data Visualization
Data visualization is the graphical representation of information and data. By using visual
elements like charts, graphs, and maps, data visualization tools provide an accessible way to
see and understand trends, outliers, and patterns in data.
G11: 13
MODULE 3:
e. Pydot. This library helps to generate oriented and non-oriented graphs. It serves as an
interface to Graphviz (written in pure Python). You can easily show the structure of
graphs with the help of this library. That comes in handy when you're developing
algorithms based on neural networks and decision trees.
API is an acronym for Application Programming Interface that software uses to access data,
server software, or other applications and has been around for quite some time. In layman’s
terms, it is a software intermediary that allows two applications to talk to each other.
APIs are versatile and can be used on web-based systems, operating systems, database
systems, and computer hardware. Developers use APIs to make their jobs more efficient by
reusing code from before and only changing the part relevant to the process they want to
improve. A good API makes it easier to create a program because the building blocks are in
place. APIs use defined protocols to enable developers to build, connect and integrate
applications quickly and at scale.
APIs communicate through a set of rules that define how computers, applications, or
machines can talk to each other. The API acts as a middleman between any two devices that
want to connect for a specified task.
A simplified example would be when you sign in to Facebook from your phone, you tell the
Facebook application that you would like to access your account. The mobile application
makes a call to an API to retrieve your Facebook account and credentials. Facebook would
then access this information from one of its servers and return the data to the mobile
application.
G11: 14
MODULE 3:
Categories of API
Some popular examples of web based API are Twitter REST API, Facebook Graph
API, Amazon S3 REST API, etc.
B. Operating System. There are multiple OS based API that offers the functionality of
various OS features that can be incorporated in creating windows or mac applications.
Some of the examples of OS based API are Cocoa, Carbon, WinAPI, etc.
C. Database System. Interaction with most of the database is done using the API calls to
the database. These APIs are defined in a manner to pass out the requested data in a
predefined format that is understandable by the requesting client.
This makes the process of interaction with databases generalised and thereby
enhancing the compatibility of applications with the various database. They are very
robust and provide a structured interface to database.
Some popular examples are Drupal 7 Database API, Drupal 8 Database API, and
Django API.
D. Hardware System. These APIs allows access to the various hardware components of
a system. They are extremely crucial for establishing communication to the hardware.
Due to which it makes possible for a range of functions from the collection of sensor
data to even display on your screens.
For example, the Google PowerMeter API will allow device manufacturers to build
home energy monitoring devices that work with Google PowerMeter.
G11: 15
MODULE 3:
Types of APIs
A. Rest API. This stands for representational state transfer and delivers data using the
lightweight JSON format. Most public APIs use this because of its fast performance,
dependability and ability to scale by reusing modular components without affecting
the system as a whole. This API gives access to data by using a uniform and
predefined set of operations. REST APIs are based on URLs and the HTTP protocol
and are based on these 6 architectural constraints:
1. Client-Server Based – the client handles the front end process while the server
handles the backend and can be both replaced independently of each other.
2. Uniform Interface – defines the interface between client and server and simplifies
the architecture to enable each part to develop separately.
3. Stateless – each request from client to server must be independent and contain all
of the necessary information so that the server can understand and process it
accordingly.
4. Cacheable – maintains cached responses between client and server avoiding any
additional processing.
B. SOAP. Simple Object Access Protocol is a little more complex then REST because it
requires more upfront about how it sends its messages. This API has been around
since the late 1990s and uses XML to transfer data. It requires strict rules and
advanced security that requires more bandwidth.
This protocol does not have the ability to cache, has strict communication and needs
every piece of information about an interaction before any calls are even thought of to
be processed.
D. JSON-RPC is very similar to XML-RPC in that they work the same way except that
this protocol uses JSON instead of XML format. The client is typically software that
calls on a single method of a remote system.
G11: 16
MODULE 3:
IDENTIFICATION.
Directions: Identify what is described in each statement. Write your answer on the space
before the number.
MULTPLE CHOICE.
A. Directions: Analyze the questions carefully. Choose the letter of the correct answer. Write
your answer on the space before the number.
____1. Which of the following tools tends to work well with data that has a volume of less
than 10 GB?
1. Hive
2. Hadoop
3. Microsoft Excel
4. Scrapy
____2. Which of the following tools is used for automated machine learning?
A. DataRobot
B. Jupyter Notebook
C. QlikView
D. Python
G11: 17
MODULE 3:
____5. Pandas is a library created to help developers work with ‘labeled’ and ‘relational’ data
intuitively. Which of the following describes the task Pandas is used for?
A. Data Mining
B. Data Processing
C. Data Visualization
D. Data Programming
B. Directions: Choose the word that best complete the analogy. Write the letter of the answer
on the space before the number.
G11: 18
MODULE 3:
CONCEPTUALIZATION.
Directions: In your Virtual Expo, create 4 concept maps with Data Science languages
(Python, SQL, R, and Jupyter Notebook) in the middle of the diagram. You may copy the
concept maps below or create a different style of concept map. Then, complete the concept
maps by supplying the function or features of the words written in the center of the diagrams.
G11: 19
MODULE 3:
Partially
Exemplary (10) Proficient (8) Incomplete (2)
Proficient (5)
Content The content is Content has one or There are 4-5 There is
complete, rich, two discrepancies, missing details. insufficient detail,
concise, and but includes Some extraneous or detail is
straightforward. relevant details. information and irrelevant and
minor gaps are extraneous.
The content is included.
relevant to the
discussed topics
and thoroughly
answers the
questions.
Creativity/Visual The concept maps The concept maps The main theme is Lacks visual
are visually are visually still discernible, clarity. The
effective. The use sensible. The use but use of graphics/images/
of of graphics/images/ photographs are
graphics/images/ graphics/images/ photographs are distracting from
photographs photographs are included but are the content of the
seamlessly relate included and used randomly. expo.
well to the content. appropriate.
Team The group The group The group The group does
Collaboration establishes and establishes clear establishes not establish roles
documents clear and formal roles informal roles for for each member
and formal roles for each member each member. The and the workload
for each member and distributes the workload could be is unequally
and distributes the workload equally. distributed more distributed.
workload equally. equally.
G11: 20
MODULE 3:
Amazon Lex – AWS Chatbot AI. (n.d.). Amazon Web Services, Inc. https://aws.amazon.com/lex/
An Introduction to APIs (Application Programming Interfaces) & 5 APIs a Data Scientist must know!
(2020, July 5). Analytics Vidhya. https://www.analyticsvidhya.com/blog/2016/11/an-
introduction-to-apis-application-programming-interfaces-5-apis-a-data-scientist-must-know/
Custer, C. (n.d.). 15 Python Libraries for Data Science You Should Know. Dataquest.
https://www.dataquest.io/blog/15-python-libraries-for-data-science/
Data Mining: How Companies Use Data to Find Useful Patterns and Trends. (n.d.). Investopedia.
https://www.investopedia.com/terms/d/datamining.asp
Davis, T. (2019, December 31). What is An API and How Does It Work? - Towards Data Science.
Medium. https://towardsdatascience.com/what-is-an-api-and-how-does-it-work-
1dccd7a8219e
Dewani, R. (n.d.). 22 Widely Used Data Science and Machine Learning Tools in 2020. Analytics
Vidhya. https://www.analyticsvidhya.com/blog/2020/06/22-tools-data-science-machine-
learning/
IBM Watson and its Key Features. (n.d.). NewGenApps - A Deep Tech Company, with B2B FinTech,
Blockchain, Cloud, Mobile Apps, Analytics Solutions.
https://www.newgenapps.com/blogs/ibm-watson-and-its-key-features/
Jordan, M. (n.d.). What is SPSS and How Does it Benefit Survey Data Analysis? Alchemer.
https://www.alchemer.com/resources/blog/what-is-spss/
Overview - Spark 3.1.2 Documentation. (n.d.). Spark.Apache.Org.
https://spark.apache.org/docs/latest/
R: What is R? (n.d.). R-Project.Org. https://www.r-project.org/about.html
What can I do with SAS? (n.d.). Support.Sas.Com. Retrieved August 17, 2021, from
https://support.sas.com/software/products/sas-studio/faq/SAS_whatis.htm
What Is Data Visualization? Definition, Examples, and Learning Resources. (n.d.). Tableau.
https://www.tableau.com/learn/articles/data-visualization
What is Matlab? (n.d.). Cimss.Ssec.Wisc.Edu.
https://cimss.ssec.wisc.edu/wxwise/class/aos340/spr00/whatismatlab.htm
What is Python? Executive Summary. (n.d.). Python.Org. https://www.python.org/doc/essays/blurb/
Wiki Archive. (n.d.). DataRobot. https://www.datarobot.com/wiki/
Wikipedia contributors. (n.d.). Automated machine learning. Wikipedia.
https://en.wikipedia.org/wiki/Automated_machine_learning
G11: 21
MODULE 3:
G11: 22