0% found this document useful (0 votes)
35 views8 pages

Data Scince

The document discusses defining data science and what data scientists do. It examines topics such as data formats, data science skills, algorithms, and a day in the life of a data scientist. It also explores careers in data science and emerging trends in the field.

Uploaded by

chaimaeib723
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views8 pages

Data Scince

The document discusses defining data science and what data scientists do. It examines topics such as data formats, data science skills, algorithms, and a day in the life of a data scientist. It also explores careers in data science and emerging trends in the field.

Uploaded by

chaimaeib723
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 8

Defining Data Science and What Data Scientists Do

Defining Data Science

 Defining Data Science

 Video: What is Data Science?

 Fundamentals of Data Science

 The Many Paths to Data Science

 Data Science: The Sexiest Job in the 21st Century

 Defining Data Science

 Advice for New Data Scientists

What Do Data Scientists Do?

 A Day in the Life of a Data Scientist

 Data Science Skills & Big Data

 Working on Different File Formats

 Data Science Topics and Algorithms

 Discussion Prompt: Introduce Yourself

 Reading: What Makes Someone a Data Scientist?

Data Science Topics


Big Data and Data Mining

 How Big Data is Driving Digital Transformation

 Introduction to Cloud

 Cloud for Data Science

 Foundations of Big Data

 Data Scientists at New York University

 What is Hadoop?

 Big Data Processing Tools: Hadoop, HDFS, Hive, and Spark

 Reading: Data Mining

Deep Learning and Machine Learning

 Artificial Intelligence and Data Science

 Generative AI and Data Science


 Neural Networks and Deep Learning

 Applications of Machine Learning

 Reading: Regression

 Lab: Exploring Data using IBM Cloud Gallery

Applications and Careers in Data Science:


Data Science Application Domains

 How Should Companies Get Started in Data Science?

 Old Problems with New Data Science Solutions

 Applications of Data Science

 How Data Science is Saving Lives

 Reading: The Final Deliverable

Careers and Recruiting in Data Science

 How Can Someone Become a Data Scientist?

 Recruiting for Data Science

 Careers in Data Science

 Importance of Mathematics and Statistics for Data Science (only name change)

 The Report Structure

 Reading: Infograph on roadmap

Data Literacy for Data Science (Optional):


Understanding Data

 Understanding Data

 Data Sources

 Working on Varied Data Sources and Types

 Reading: Metadata

Data Literacy

 Data Collection and Organization

 Relational Database Management System

 NoSQL

 Data Marts, Data Lakes, ETL, and Data Pipelines


 Considerations for Choice of Data Repository

 Data Integration Platforms

Chapter1: Defining Data Science and What Data


Scientists Do
Defining Data Science
what data science is?
Data science is all about using data to understand things better. It's like telling a story with data to uncover hidden
insights and trends. Imagine you have a puzzle (data) and you're trying to figure out what picture it makes. It involves
studying data in different forms, like numbers or words, to find answers to questions you're curious about. Today,
data science is super important because we have tons of data available, and the tools to work with it are cheap and
easy to use.

Fondamentale of data science.


Data science is about analyzing vast amounts of data from various sources to gain insights and solve problems. It
involves clarifying the problem, collecting and analyzing relevant data, using different models to uncover patterns,
and communicating findings to stakeholders. Ultimately, data science is transforming how organizations operate and
understand their environments.
Data Science: The Sexiest Job in the 21st Century
In today's world, data scientists are like treasure hunters. Companies desperately need them to dig through
mountains of digital information and find valuable insights. But there aren't enough of these skilled treasure hunters
to go around, so they're in high demand. Even though the big bosses might not fully understand the power of data,
companies are willing to pay top dollar to hire these experts. So if you're good at analyzing data, you could land a job
with a six-figure salary.

Defining Data Science


Data science involves studying data to understand the world around us and uncovering insights hidden within it. With
improved access to data and advanced computing power, data analysis can reveal new knowledge and insights, much
like detectives uncovering secrets. Data scientists play a crucial role in translating data into stories that inform
strategic decision-making for companies and institutions. Similar to biological or physical sciences, data science deals
with structured and unstructured data. The process of gleaning insights from data includes clarifying the problem,
data collection, analysis, pattern recognition, storytelling, and visualization.

Key skills for data scientists include curiosity, argumentation, and judgment. They should be comfortable with math,
curious, and skilled storytellers. Data scientists come from diverse backgrounds, and they need to master data
analysis tools relevant to their industry.

The future for skilled data scientists looks promising, with evolving job roles and a growing demand for certification
to ensure qualifications. Data scientists will continue to rely on logical thinking, algorithms, and careful data analysis
to drive successful business outcomes.

What Do Data Scientists Do?


Working on Different File Formats
As a data professional, understanding various data file types and formats is crucial. This comprehension helps
in making informed decisions about which formats suit data and performance needs best. file formats like
delimited text files, XLSX, XML, PDF, and JSON. Delimited text files store data as text, with values separated by
a delimiter, typically commas or tabs. XLSX is an XML-based format used for spreadsheets, compatible with
various applications and known for its security. XML is a self-descriptive markup language for data encoding
and sharing. PDF is a format for presenting documents consistently across devices, often used for legal and
financial documents. JSON is a language-independent format for transmitting structured data over the web,
widely used in APIs and web services. These formats serve diverse data sharing needs effectively.

Data Science Topics and Algorithms


When it comes to data, I distinguish between structured and unstructured data. Structured data, akin to
Excel spreadsheets, is organized into rows and columns. Unstructured data, prevalent on the web, lacks this
organization and requires sophisticated algorithms for extraction and analysis.
Regression, though often explained with complex statistical jargon, can be simplified. Imagine taking a cab
ride: there's a base fare, and as the distance or time increases, so does the fare. Regression essentially
uncovers this relationship, determining the base fare and the impact of distance and time on the fare. It's
akin to discovering that initial $2.50 fee in a cab ride, along with understanding how distance and time affect
the total fare.

Reading: What Makes Someone a Data Scientist?


Defining who can be called a data scientist is tricky because people have different opinions. Some say it's
about solving problems with data and telling stories about it, no matter how big the data is or what tools you
use. Others say it's more about using specific tools like machine learning or dealing with really big data sets.
But limiting it like this might leave out a lot of talented people. Instead, it's better to have a broader
definition that focuses on solving problems with data and being good at explaining your findings. Being
curious and open-minded is also really important. Overall, there's no one-size-fits-all definition of a data
scientist, but being open to different backgrounds and ideas can help drive innovation in the field.

What Do Data Scientists Do?


Data scientists are like modern-day explorers, using data to solve real-world problems in innovative ways.
They analyze information to find patterns and make predictions. For example, Dr. Murtaza Haider linked bad
weather to public transit complaints in Toronto. They also work on environmental issues, like predicting
water quality problems caused by algae blooms.

Education is important for becoming a data scientist. You need to learn programming and statistics. Tools like
regression analysis and machine learning help them make sense of large amounts of data, often called "big
data."

Data comes in many forms, like text, videos, or spreadsheets. Exceptional data scientists are curious and
skilled at using different techniques to find insights in this data.

In the end, data science is a journey of discovery, where talented individuals use their skills to unlock the
secrets hidden in data.

Term Definition
Comma-separated values Commonly used format for storing tabular data as plain text where either the comma
(CSV) / Tab-separated or the tab separates each value.
values (TSV)
Data file types A computer file configuration is designed to store data in a specific way.
Data format How data is encoded so it can be stored within a data file type.
Data visualization A visual way, such as a graph, of representing data in a readily understandable way
makes it easier to see trends in the data.
Delimited text file A plain text file where a specific character separates the data values.
Extensible Markup A language designed to structure, store, and enable data exchange between various
Language (XML) technologies.
Hadoop An open-source framework designed to store and process large datasets across
clusters of computers.
JavaScript Object A data format compatible with various programming languages for two applications
Notation (JSON) to exchange structured data.
Jupyter notebooks A computational environment that allows users to create and share documents
containing code, equations, visualizations, and explanatory text. See Python
notebooks.
Nearest neighbor A machine learning algorithm that predicts a target variable based on its similarity to
other values in the dataset.
Neural networks A computational model used in deep learning that mimics the structure and
functioning of the human brain’s neural pathways. It takes an input, processes it using
previous learning, and produces an output.
Pandas An open-source Python library that provides tools for working with structured data is
often used for data manipulation and analysis.
Python notebooks Also known as a “Jupyter” notebook, this computational environment allows users to
create and share documents containing code, equations, visualizations, and
explanatory text.
R An open-source programming language used for statistical computing, data analysis,
and data visualization.
Recommendation engine A computer program that analyzes user input, such as behaviors or preferences, and
makes personalized recommendations based on that analysis.
Regression A statistical model that shows a relationship between one or more predictor variables
with a response variable.
Tabular data Data that is organized into rows and columns.
XLSX The Microsoft Excel spreadsheet file format.

Chapter2: How Big Data is Driving Digital


Transformation
Bid data and data mining.
How Big Data is Driving Digital Transformation
Big data was started by Google when Google tried to figure out how how to solve their PageRank algorithm

Digital transformation is about updating how businesses operate by integrating digital technology into every
aspect of their operations, ultimately delivering better value to customers. This change is driven by data
science and Big Data, allowing organizations to analyze large amounts of data for competitive advantage.
Examples like Netflix, the Houston Rockets NBA team, and Lufthansa show how data analysis can lead to
significant improvements in services and operations. It's not just about digitizing existing processes, but
fundamentally changing how businesses work by incorporating data science into their workflows. Success in
digital transformation requires support from top executives like the CEO and CIO, as well as a shift in mindset
across the entire organization to adapt to new ways of working.

Introduction to Cloud
Cloud computing, also known as the cloud, provides various computing resources like networks, servers, and
storage over the Internet on a pay-for-use basis. It allows users to access applications and data online instead
of on their local computers, such as using web apps or storing files on platforms like Google Drive. One major
benefit is cost-effectiveness, as users can use online applications without purchasing them outright and
access the latest versions automatically. Additionally, cloud-based applications enable collaborative work and
real-time file editing among users. Cloud computing has five essential characteristics: on-demand self-
service, broad network access, resource pooling, rapid elasticity, and measured service. There are three
deployment models: public, private, and hybrid, indicating where the infrastructure resides and who
manages it. Furthermore, there are three service models: Infrastructure as a Service (IaaS), Platform as a
Service (PaaS), and Software as a Service (SaaS), which provide different layers of computing resources.
Cloud for Data Science
Cloud computing is like a magic toolbox for data scientists. It lets you store your data and use powerful
computing tools without being limited by your own computer's capabilities. You can access advanced
algorithms and high-performance computing resources from anywhere, anytime. Plus, the Cloud allows
teams from different parts of the world to work together on the same data simultaneously. You can instantly
access tools like Apache Spark without worrying about installation or maintenance. Whether you're using a
laptop, tablet, or phone, the Cloud is always accessible, making collaboration easier than ever. Big tech
companies like IBM, Amazon, and Google offer Cloud platforms where you can practice and develop your
data science skills using tools like Jupyter Notebooks and Spark clusters. With the Cloud, data scientists can
boost their productivity and work more efficiently.

Foundations of Big Data


In today's digital age, Big Data plays a crucial role in shaping our world. Big Data refers to the massive and diverse
volumes of data generated by people, tools, and machines. It encompasses various elements such as velocity,
volume, variety, veracity, and value. Velocity represents the speed at which data accumulates, while volume refers to
the scale of data stored, which is constantly increasing due to various sources and technologies. Variety signifies the
diversity of data types, including structured and unstructured data from different sources. Veracity relates to the
quality and accuracy of data, which is essential for deriving meaningful insights. Finally, value represents our ability to
turn data into valuable insights and benefits. Big Data enables organizations to gain real-time insights that can
enhance business performance and customer satisfaction. To manage and analyze Big Data effectively, advanced
tools like Apache Spark and Hadoop are utilized, leveraging distributed computing power. These tools help extract,
load, analyze, and process large datasets across distributed resources, unlocking new possibilities for organizations to
connect with their customers and improve their services. So, the next time you interact with your digital devices,
remember that your data is part of a journey through Big Data analysis, shaping the world around you.

What is Hadoop
According to Dr. White, most of the components of data science, such as probability, statistics, linear algebra, and
programming, have been around for many decades but now we have the computational capabilities to apply
combine them and come up with new techniques and learning algorithms.
1. Data Slicing and Distributed Computing: Larry Page and Sergey Brin broke data into small pieces and
spread them across many computers. Each computer did a part of the work, making it faster to handle big
amounts of data.
2. Hadoop and Big Data Growth: Hadoop, created by Doug Cutting at Yahoo, made it easier for companies to
manage big data. This started a big trend in using technology to deal with large amounts of information.
3. Scalability: Big data systems can easily grow by adding more computers. This means they can handle more
data without slowing down, which was a huge help for social media companies and others dealing with lots
of information.
4. Data Science Mixes Old and New: Data science combines old techniques like math and statistics with new
tools like computers and programming. This lets us find patterns in really big sets of data, helping us make
better decisions.
5. Decision Sciences Blend Different Fields: Decision sciences bring together different subjects like math,
computers, and business to solve problems using data. Schools like NYU's Stern School of Business are
leaders in this area.
6. Data Science's Rise: The term "data science" became popular recently as more people realized the value of
using data to make decisions. It's a new field that's growing fast.
7. Constant Change: Business analytics and data science keep changing as new technologies come out. Things
like deep learning and neural networks are just some of the latest tools being used to understand data better.
Big Data Processing Tools: Hadoop, HDFS, Hive, and Spark
In this video, we explore three important open-source technologies for big data analytics: Apache Hadoop, Apache
Hive, and Apache Spark.

1. Apache Hadoop: Hadoop is a collection of tools for distributed storage and processing of big data. It scales across
clusters of computers, offering reliable and cost-effective solutions for storing and processing various data formats,
including structured, semi-structured, and unstructured data.

2.Apache Hive: Hive is a data warehouse built on top of Hadoop, designed for data query and analysis. It allows easy
access to data stored in HDFS or other systems like Apache HBase. While queries may have high latency, Hive is
suitable for data warehousing tasks such as ETL, reporting, and data analysis, offering SQL-based access to data.

3. Apache Spark: Spark is a versatile data processing engine for a wide range of applications, including interactive
analytics, stream processing, machine learning, data integration, and ETL. It leverages in-memory processing for
faster computations and supports various programming languages like Java, Scala, Python, R, and SQL. Spark can run
standalone or on top of other infrastructures like Hadoop, accessing data from multiple sources including HDFS and
Hive.

Overall, Spark's ability to process streaming data quickly and perform complex analytics in real-time makes it a crucial
component in the big data ecosystem.

Data Mining
Data mining begins with setting clear goals and understanding the cost-benefit trade-offs involved in achieving
desired levels of accuracy. It requires selecting high-quality data sources and preprocessing them to remove errors
and handle missing data effectively. Transforming data into suitable formats and storing it securely prepares it for
analysis. Mining the data involves applying various techniques to extract insights, which are then evaluated for their
effectiveness and shared with stakeholders for feedback. This iterative process ensures continuous improvement in
data mining outcomes.

You might also like