Data Scince
Data Scince
Introduction to Cloud
What is Hadoop?
Reading: Regression
Importance of Mathematics and Statistics for Data Science (only name change)
Understanding Data
Data Sources
Reading: Metadata
Data Literacy
NoSQL
Key skills for data scientists include curiosity, argumentation, and judgment. They should be comfortable with math,
curious, and skilled storytellers. Data scientists come from diverse backgrounds, and they need to master data
analysis tools relevant to their industry.
The future for skilled data scientists looks promising, with evolving job roles and a growing demand for certification
to ensure qualifications. Data scientists will continue to rely on logical thinking, algorithms, and careful data analysis
to drive successful business outcomes.
Education is important for becoming a data scientist. You need to learn programming and statistics. Tools like
regression analysis and machine learning help them make sense of large amounts of data, often called "big
data."
Data comes in many forms, like text, videos, or spreadsheets. Exceptional data scientists are curious and
skilled at using different techniques to find insights in this data.
In the end, data science is a journey of discovery, where talented individuals use their skills to unlock the
secrets hidden in data.
Term Definition
Comma-separated values Commonly used format for storing tabular data as plain text where either the comma
(CSV) / Tab-separated or the tab separates each value.
values (TSV)
Data file types A computer file configuration is designed to store data in a specific way.
Data format How data is encoded so it can be stored within a data file type.
Data visualization A visual way, such as a graph, of representing data in a readily understandable way
makes it easier to see trends in the data.
Delimited text file A plain text file where a specific character separates the data values.
Extensible Markup A language designed to structure, store, and enable data exchange between various
Language (XML) technologies.
Hadoop An open-source framework designed to store and process large datasets across
clusters of computers.
JavaScript Object A data format compatible with various programming languages for two applications
Notation (JSON) to exchange structured data.
Jupyter notebooks A computational environment that allows users to create and share documents
containing code, equations, visualizations, and explanatory text. See Python
notebooks.
Nearest neighbor A machine learning algorithm that predicts a target variable based on its similarity to
other values in the dataset.
Neural networks A computational model used in deep learning that mimics the structure and
functioning of the human brain’s neural pathways. It takes an input, processes it using
previous learning, and produces an output.
Pandas An open-source Python library that provides tools for working with structured data is
often used for data manipulation and analysis.
Python notebooks Also known as a “Jupyter” notebook, this computational environment allows users to
create and share documents containing code, equations, visualizations, and
explanatory text.
R An open-source programming language used for statistical computing, data analysis,
and data visualization.
Recommendation engine A computer program that analyzes user input, such as behaviors or preferences, and
makes personalized recommendations based on that analysis.
Regression A statistical model that shows a relationship between one or more predictor variables
with a response variable.
Tabular data Data that is organized into rows and columns.
XLSX The Microsoft Excel spreadsheet file format.
Digital transformation is about updating how businesses operate by integrating digital technology into every
aspect of their operations, ultimately delivering better value to customers. This change is driven by data
science and Big Data, allowing organizations to analyze large amounts of data for competitive advantage.
Examples like Netflix, the Houston Rockets NBA team, and Lufthansa show how data analysis can lead to
significant improvements in services and operations. It's not just about digitizing existing processes, but
fundamentally changing how businesses work by incorporating data science into their workflows. Success in
digital transformation requires support from top executives like the CEO and CIO, as well as a shift in mindset
across the entire organization to adapt to new ways of working.
Introduction to Cloud
Cloud computing, also known as the cloud, provides various computing resources like networks, servers, and
storage over the Internet on a pay-for-use basis. It allows users to access applications and data online instead
of on their local computers, such as using web apps or storing files on platforms like Google Drive. One major
benefit is cost-effectiveness, as users can use online applications without purchasing them outright and
access the latest versions automatically. Additionally, cloud-based applications enable collaborative work and
real-time file editing among users. Cloud computing has five essential characteristics: on-demand self-
service, broad network access, resource pooling, rapid elasticity, and measured service. There are three
deployment models: public, private, and hybrid, indicating where the infrastructure resides and who
manages it. Furthermore, there are three service models: Infrastructure as a Service (IaaS), Platform as a
Service (PaaS), and Software as a Service (SaaS), which provide different layers of computing resources.
Cloud for Data Science
Cloud computing is like a magic toolbox for data scientists. It lets you store your data and use powerful
computing tools without being limited by your own computer's capabilities. You can access advanced
algorithms and high-performance computing resources from anywhere, anytime. Plus, the Cloud allows
teams from different parts of the world to work together on the same data simultaneously. You can instantly
access tools like Apache Spark without worrying about installation or maintenance. Whether you're using a
laptop, tablet, or phone, the Cloud is always accessible, making collaboration easier than ever. Big tech
companies like IBM, Amazon, and Google offer Cloud platforms where you can practice and develop your
data science skills using tools like Jupyter Notebooks and Spark clusters. With the Cloud, data scientists can
boost their productivity and work more efficiently.
What is Hadoop
According to Dr. White, most of the components of data science, such as probability, statistics, linear algebra, and
programming, have been around for many decades but now we have the computational capabilities to apply
combine them and come up with new techniques and learning algorithms.
1. Data Slicing and Distributed Computing: Larry Page and Sergey Brin broke data into small pieces and
spread them across many computers. Each computer did a part of the work, making it faster to handle big
amounts of data.
2. Hadoop and Big Data Growth: Hadoop, created by Doug Cutting at Yahoo, made it easier for companies to
manage big data. This started a big trend in using technology to deal with large amounts of information.
3. Scalability: Big data systems can easily grow by adding more computers. This means they can handle more
data without slowing down, which was a huge help for social media companies and others dealing with lots
of information.
4. Data Science Mixes Old and New: Data science combines old techniques like math and statistics with new
tools like computers and programming. This lets us find patterns in really big sets of data, helping us make
better decisions.
5. Decision Sciences Blend Different Fields: Decision sciences bring together different subjects like math,
computers, and business to solve problems using data. Schools like NYU's Stern School of Business are
leaders in this area.
6. Data Science's Rise: The term "data science" became popular recently as more people realized the value of
using data to make decisions. It's a new field that's growing fast.
7. Constant Change: Business analytics and data science keep changing as new technologies come out. Things
like deep learning and neural networks are just some of the latest tools being used to understand data better.
Big Data Processing Tools: Hadoop, HDFS, Hive, and Spark
In this video, we explore three important open-source technologies for big data analytics: Apache Hadoop, Apache
Hive, and Apache Spark.
1. Apache Hadoop: Hadoop is a collection of tools for distributed storage and processing of big data. It scales across
clusters of computers, offering reliable and cost-effective solutions for storing and processing various data formats,
including structured, semi-structured, and unstructured data.
2.Apache Hive: Hive is a data warehouse built on top of Hadoop, designed for data query and analysis. It allows easy
access to data stored in HDFS or other systems like Apache HBase. While queries may have high latency, Hive is
suitable for data warehousing tasks such as ETL, reporting, and data analysis, offering SQL-based access to data.
3. Apache Spark: Spark is a versatile data processing engine for a wide range of applications, including interactive
analytics, stream processing, machine learning, data integration, and ETL. It leverages in-memory processing for
faster computations and supports various programming languages like Java, Scala, Python, R, and SQL. Spark can run
standalone or on top of other infrastructures like Hadoop, accessing data from multiple sources including HDFS and
Hive.
Overall, Spark's ability to process streaming data quickly and perform complex analytics in real-time makes it a crucial
component in the big data ecosystem.
Data Mining
Data mining begins with setting clear goals and understanding the cost-benefit trade-offs involved in achieving
desired levels of accuracy. It requires selecting high-quality data sources and preprocessing them to remove errors
and handle missing data effectively. Transforming data into suitable formats and storing it securely prepares it for
analysis. Mining the data involves applying various techniques to extract insights, which are then evaluated for their
effectiveness and shared with stakeholders for feedback. This iterative process ensures continuous improvement in
data mining outcomes.