DSF 1-2
DSF 1-2
FUNDAMENTAL
S
DSC293
Lecture 1-2
Dr. Hufsa Mohsin
OVERVIEW
Data Science is a combination of multiple disciplines that uses
statistics,
data analysis, and
machine learning
There are new opportunities for finding and characterizing patterns using techniques described as
data mining, machine learning, data visualization, and so on.
Such techniques require computer processing. Among the tasks that need performing are data
cleaning, combining data from multiple sources, and reshaping data into a form suitable as input
to data-summarization operations for visualization and modeling.
WHY DATA SCIENCE…
Data science spans a wide range of capacities that they described as “data acumen.” Key
components that are part of data acumen include
mathematical
computational
statistical foundations
data management and curation
data description and visualization
data modeling and assessment
workflow and reproducibility, communication and teamwork, domain-specific considerations,
and ethical problem solving.
DATA WRANGLING
A process of preparing data for visualization and other modern techniques of statistical
interpretation and using those data to answer statistical questions via modeling and
visualization.
The ability to reason statistically and utilize computational and algorithmic capacities.
R and the packages dplyr and ggplot2—
focus on a small subset of functions that accomplish data wrangling tasks in a concise and expressive way.
WHAT IS DATA SCIENCE?
The science of extracting meaningful information from data.
Data science as a fine-grained blend of intellectual traditions from statistics and computer
science.
Computer science
It is the creation of appropriate abstractions to express computational structures and the development
of algorithms that operate on those abstractions.
Statistics
it is the interplay of general notions of sampling, models, distributions and decision-making.
Data science is based on the idea that these styles of thinking support each other (Pierson
2016).
WHAT IS DATA SCIENCE?
Data science is best applied in the context of expert knowledge about the domain from which
the data originate.
The distinction between data and information is the core of data science.
Data scientists are people who are interested in converting the data that is now abundant into
actionable information that always seems to be scarce.
STATISTICS AND DATA
SCIENCE
The goals of data scientists and statisticians are the same??
Much of statistical technique was originally developed in an environment where data were
scarce and difficult or expensive to collect, so statisticians focused on creating methods that
would maximize the strength of inference one is able to make, given the least amount of data.
These techniques were often ingenious, involved sophisticated mathematics, and have proven
invaluable to the empirical sciences
While several of the most influential early statisticians saw computing as an integral part of
statistics, it is also true that much of the development of statistical theory was to find
mathematical approximations for things that we couldn’t yet compute
STATISTICS AND DATA
SCIENCE
Today, the manner in which we extract meaning from data is different in two ways
we are able to compute many more things than we could before
some of the techniques that were ubiquitous in statistics education in the 20th century (e.g., the t-test, ANOVA)
are being replaced by computational techniques that are conceptually simpler but were simply infeasible until
the microcomputer revolution (e.g., the bootstrap, permutation tests).
We have a lot more data than we had before
many of the data we now collect are observational—they don’t come from a designed experiment, and they
aren’t really sampled at random.
clinical trials and A/B tests
VS
predictive model, an interactive visualization of the data, or a web application that allows the user to engage
with the data to explore questions and extract meaning.
EVOLUTION OF DATA
SCIENCE
DATA SCIENCE LIFE CYCLE
DATA SCIENCE COMPONENTS
DATA PLANNING AND
STRATEGY
Developing a plan or a data strategy is simply determining what data are you going to gather
and why.
not the strategy for deciding what mathematical techniques we’re going to use or the technologies
required.
The focus is on the data we need to address the business problem/ opportunity and why.
Hence, deciding on a strategy requires making a connection between the data and the business
goals.
Gathering and formatting the data, getting rid of the ‘garbage data’ that doesn’t serve the
business goal is a reflection of achieving mission-critical data for business goals.
DATA MINING
Data mining basically implies analyzing data patterns in large batches of data using one or
more software. It has applications in multiple fields like science and research.
As an application of data mining, businesses can learn more about their customers as it helps
them to be closer to them & develop more effective strategies related to business functions &
leverage resources in an optimal & insightful manner.
DATA ENGINEERING
Data engineering primarily involves the creation of software solutions for data problems that
involve establishing a data system with data pipelines and endpoints within that system.
Data engineering requires an in-depth understanding of a wide range of data technologies &
frameworks along with creating data solutions to enable business processes.
DATA ANALYSIS & MODELS
Considered as the heart of data science, we can think of data analysis & mathematical
models in terms of how to use data to extract insights or make business predictions & to create
a tool that replaces or supplements what a human does.
DATA VISUALIZATION &
OPERATIONALIZATION
Data visualization is not just presenting the analyzed data correctly; it involves understanding
the raw data and what is needed to be visualized based on the needs and goals of users and the
operations.
Data operationalization involves real-time person decision/action, a long-term response, or a
recommendation on a specific task.
1. DATA ANALYST SQL, R,
SAS, and
Python
Data analysts are responsible for a variety of tasks including visualisation, munging, and
processing of massive amounts of data. They also have to perform queries on the databases
from time to time. One of the most important skills of a data analyst is optimization.
This is because they have to create and modify algorithms that can be used to cull information from
some of the biggest databases without corrupting the data.
Roles and Responsibilities:
Extracting data from primary and secondary sources using automated tools
Developing and maintaining databases
Performing data analysis and making reports with recommendations
Analyzing data and forecasting trends that impact the organization/project
Working with other team members to improve data collection and quality processes
Hive,
Data engineers build and test scalable Big Data ecosystems for the businesses so that the data
scientists can run their algorithms on the data systems that are stable and highly optimized.
Data engineers also update the existing systems with newer or upgraded versions of the
current technologies to improve the efficiency of the databases.
Roles and Responsibilities:
Design and maintain data management systems
Data collection/acquisition and management
Conducting primary and secondary research
Finding hidden patterns and forecasting trends using data
Collaborating with other teams to perceive organizational goals
Make reports and update stakeholders based on analytics
3. DATABASE
ADMINISTRATOR
The job profile of a database administrator is pretty much self-explanatory- they are
responsible for the proper functioning of all the databases of an enterprise and grant or revoke
its services to the employees of the company depending on their requirements. They are also
responsible for database backups and recoveries.
Data scientists have to understand the challenges of business and offer the best solutions using
data analysis and data processing. For instance, they are expected to perform predictive
analysis and run a fine-toothed comb through an “unstructured/disorganized” data to offer
actionable insights. They can also do this by identifying trends and patterns that can help the
companies in making better decisions.
Roles and Responsibilities:
Identifying data collection sources for business needs
Processing, cleansing, and integrating data
Automation data collection and management process
Using Data Science techniques/tools to improve processes
Analyzing large amounts of data to forecast trends and provide reports with recommendations
Collaborating with business, engineering, and product teams
data warehousing,
A statistician, as the name suggests, has a sound understanding of statistical theories and data
organization. Not only do they extract and offer valuable insights from the data clusters, but
they also help create new methodologies for the engineers to apply.
Roles and Responsibilities:
Collecting, analyzing, and interpreting data
Analyzing data, assessing results, and predicting trends/relationships using statistical
methodologies/tools
Designing data collection processes
Communicating findings to stakeholders
Advising/consulting on organizational and business strategy basis data
Coordinating with cross-functional teams
8. BUSINESS ANALYST Power BI
Tableau
The role of business analysts is slightly different than other data science jobs. While they do
have a good understanding of how data-oriented technologies work and how to handle large
volumes of data, they also separate the high-value data from the low-value data. In other
words, they identify how the Big Data can be linked to actionable business insights for
business growth.
Roles and Responsibilities:
Understanding the business of the organization
Conducting detailed business analysis – outlining problems, opportunities, and solutions
Working on improving existing business processes
Analysing, designing, and implementing new technology and systems
Budgeting and forecasting
Pricing analysis
9. DATA AND ANALYTICS
MANAGER
A data and analytics manager oversees the data science operations and assigns the duties to their team
according to skills and expertise. Their strengths should include technologies like SAS, R, SQL, etc.
and of course management.
Roles and Responsibilities:
Developing data analysis strategies Python, SAS,
Researching and implementing analytics solutions R, Java
Leading and managing a team of data analysts
Overseeing all data analytics operations to ensure quality
Building systems and processes to transform raw data into actionable business insights
Staying upto date on industry news and trends