0% found this document useful (0 votes)
85 views4 pages

Data Manipulation at Scale

This document provides an overview of a course on data manipulation at scale. The course aims to cover important trends and technologies in data science at both a high level and with technical depth on selected topics. It will explore relevant systems and algorithms, the principles they are based on, their tradeoffs, and how to evaluate their utility for different requirements. The course also examines the history of data science and how to structure a data science project. It is organized into a guided tour of trends, a deep dive into key algorithms and techniques, and hands-on assignments to develop practical skills.

Uploaded by

ionut
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
85 views4 pages

Data Manipulation at Scale

This document provides an overview of a course on data manipulation at scale. The course aims to cover important trends and technologies in data science at both a high level and with technical depth on selected topics. It will explore relevant systems and algorithms, the principles they are based on, their tradeoffs, and how to evaluate their utility for different requirements. The course also examines the history of data science and how to structure a data science project. It is organized into a guided tour of trends, a deep dive into key algorithms and techniques, and hands-on assignments to develop practical skills.

Uploaded by

ionut
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4

Data Manipulation at Scale: Systems and

Algorithms
Prezentare

Welcome to Data Manipulation at Scale: Systems and Algorithms!

We have been working hard to prepare a curriculum for you that captures the breadth of topics
important for a practicing data scientist without sacrificing technical depth on specific topics. Whether
you are new to the area, a manager looking to build knowledge, or a practitioner looking to round out
your technical skills working with massive data sets, applying practical machine learning techniques, or
creating compelling visualizations, we think you will agree that this specialization is a 'must take' for
anyone in the data science arena.

In this course, you will learn the landscape of relevant systems and techniques, the principles on which
they rely, their tradeoffs, and how to evaluate their utility against your requirements. You will learn how
practical systems were derived from the frontier of research in computer science and what systems are
coming on the horizon.

You will also learn the history and context of data science, the skills, challenges, and methodologies the
term implies, and how to structure a data science project.

When you finish the course, you will have a strong foundation for more advanced study in particular
topics across computer science and statistics, as well as a broad understanding of the overall area.

How this course is organized

- a guided tour of important trends and technologies

- a deep dive into selected must-know algorithms, techniques and technology

- a set of hands-on assignments to deliver specific skills and experiences

The challenge here was to design a course that would be broad enough to cover the topics that we want
and also inclusive enough that we didn't sort of, have to dial it in for a very specific cohort. But the
challenge then, is that it's going to be very difficult for some people, and others may find it some aspects
of it certainly routine.  pare o prezentare generala mai degraba pentru cineva care mai stie, deci nu
cred sa fie ceva foarte interesant pentru mine. Si nu stiu cu ce o sa ma aleg, poate cu o idee vaga. Plus ca
trebuie sa stiu ceva Python.
Week 1

Characterizing Data Science

Three sexy skills of data geeks:

- statistics (traditional analysis)

- data munging (parsing, scraping, formatting data)

- visualization (graph, tools,etc)

Three types of tasks:

1. Preparing to run a model: gathering, cleaning, integrating, restructuring, transforming, loading,


filtering, deleting, combining, merging, verifying, extracting, shaping

2. Running the model

3. Communicating the results

Distinguishing Data Science from Related Topics


So let's talk a little bit about what distinguishes the term Data Science from other related fields.

Business Intelligence. Business Intelligence systems are associated with a couple of concepts. One is a
data warehouse, and the other is a sort of dashboards and reports that consume data from the data
warehouse and are used to answer particular questions. So both of these components require a lot of
upfront effort to design and build, and are, therefore, not too adaptable when requirements change.
And so, therefore, a software stack designed for business intelligence may or may not be appropriate for
any particular data science problems where changing requirements are considered the norm. And so it
sort of warrants a new term, is that business intelligence became associated with a particular approach
to a particular set of problems. And a data science is in some sense broader. The other point I like to
make about business intelligence is that the BI engineers are not typically expected to consume their
own data products and perform their own analysis and make business decisions themselves. Usually
they're building tools for others to make decisions with. As data scientist, you'll be doing both.

Statistics. Well, statistical methods are at the heart of what a data scientist does day to day, but a
statistician will typically be comfortable with assuming that any data set they encounter will fit in main
memory on a single machine.

Database management. Database experts, database programmers and administrators, bring a lot of
skills to the table to make them appropriate for data science tasks. But there's a focus on a particular
data model, which is usually the relational data model. So this is rows and columns. So we have data
coming from sources that are as video or audio or even text or to some extent even graphs, nodes and
edges, which we'll talk about. A relational database may or may not be the right tool. And even the
concepts that transcend any particular database system may or may not be appropriate. And we'll sort
of explore when and where it isn't appropriate as we get into the course.

Visualization. Visualization experts also bring a lot of skills to the table, but like statisticians are
historically less concerned with massive scale data that spans many hundreds of machines.

Machine learning. Machine learning is perhaps the closest to data science, but here and we'll try to
make more of a point about this later. As a proportion of the time you'll spend on a data science
problem, actually choosing the right model or algorithm, machine learning technique, and applying it
and running it is a fairly small fraction. What you'll be spending much more time on is the preparation of
the data, the manipulation of the data, the cleaning of the data, the wrangling of the data some have
been saying. And for this, machine learning techniques are not particularly relevant.

Big Data and the 3 Vs


So I want to spend a little time on the term big data, and I'm not too concerned with any sort of
technical definition of the term, because it probably doesn't exist. But I want to arm you with some of
the language that people use when they describe big data, so that you can speak intelligently about it
when asked. So probably the main thing to recognize is this notion of the three V's of big data, which are
volume, velocity and variety.

Volume: the size of the data

Velocity: the latency of data processing relative to the growing demand for interactivity (how fast is it
coming based on how fast it needs to be consumed)

Variety: the diversity of sources, formats, quality structures


Big Data Definitions
The notion that Mike Franklin at the University of Berkeley uses, which I like, is that Big Data is really
relative, right, it's any data that is expensive to manage and hard to extract value from. So it's not so
much about a particular cut-off. What makes it big? Is it a petabyte scale is big versus a terabyte scale is
small, or a gigabyte scale is very small since it fits in memory on your machine? Not necessarily, it
depends on what you're trying to do with it, and it depends on what sort of resources and infrastructure
you have to bring to bear on the problem. And so, in some sense, difficult data is perhaps what Big Data
really means, right? It's not so much about big, it's about being challenging. Okay. This is really
important to remember, that big is relative.

You might also like