Data Manipulation at Scale
Data Manipulation at Scale
Algorithms
Prezentare
We have been working hard to prepare a curriculum for you that captures the breadth of topics
important for a practicing data scientist without sacrificing technical depth on specific topics. Whether
you are new to the area, a manager looking to build knowledge, or a practitioner looking to round out
your technical skills working with massive data sets, applying practical machine learning techniques, or
creating compelling visualizations, we think you will agree that this specialization is a 'must take' for
anyone in the data science arena.
In this course, you will learn the landscape of relevant systems and techniques, the principles on which
they rely, their tradeoffs, and how to evaluate their utility against your requirements. You will learn how
practical systems were derived from the frontier of research in computer science and what systems are
coming on the horizon.
You will also learn the history and context of data science, the skills, challenges, and methodologies the
term implies, and how to structure a data science project.
When you finish the course, you will have a strong foundation for more advanced study in particular
topics across computer science and statistics, as well as a broad understanding of the overall area.
The challenge here was to design a course that would be broad enough to cover the topics that we want
and also inclusive enough that we didn't sort of, have to dial it in for a very specific cohort. But the
challenge then, is that it's going to be very difficult for some people, and others may find it some aspects
of it certainly routine. pare o prezentare generala mai degraba pentru cineva care mai stie, deci nu
cred sa fie ceva foarte interesant pentru mine. Si nu stiu cu ce o sa ma aleg, poate cu o idee vaga. Plus ca
trebuie sa stiu ceva Python.
Week 1
Business Intelligence. Business Intelligence systems are associated with a couple of concepts. One is a
data warehouse, and the other is a sort of dashboards and reports that consume data from the data
warehouse and are used to answer particular questions. So both of these components require a lot of
upfront effort to design and build, and are, therefore, not too adaptable when requirements change.
And so, therefore, a software stack designed for business intelligence may or may not be appropriate for
any particular data science problems where changing requirements are considered the norm. And so it
sort of warrants a new term, is that business intelligence became associated with a particular approach
to a particular set of problems. And a data science is in some sense broader. The other point I like to
make about business intelligence is that the BI engineers are not typically expected to consume their
own data products and perform their own analysis and make business decisions themselves. Usually
they're building tools for others to make decisions with. As data scientist, you'll be doing both.
Statistics. Well, statistical methods are at the heart of what a data scientist does day to day, but a
statistician will typically be comfortable with assuming that any data set they encounter will fit in main
memory on a single machine.
Database management. Database experts, database programmers and administrators, bring a lot of
skills to the table to make them appropriate for data science tasks. But there's a focus on a particular
data model, which is usually the relational data model. So this is rows and columns. So we have data
coming from sources that are as video or audio or even text or to some extent even graphs, nodes and
edges, which we'll talk about. A relational database may or may not be the right tool. And even the
concepts that transcend any particular database system may or may not be appropriate. And we'll sort
of explore when and where it isn't appropriate as we get into the course.
Visualization. Visualization experts also bring a lot of skills to the table, but like statisticians are
historically less concerned with massive scale data that spans many hundreds of machines.
Machine learning. Machine learning is perhaps the closest to data science, but here and we'll try to
make more of a point about this later. As a proportion of the time you'll spend on a data science
problem, actually choosing the right model or algorithm, machine learning technique, and applying it
and running it is a fairly small fraction. What you'll be spending much more time on is the preparation of
the data, the manipulation of the data, the cleaning of the data, the wrangling of the data some have
been saying. And for this, machine learning techniques are not particularly relevant.
Velocity: the latency of data processing relative to the growing demand for interactivity (how fast is it
coming based on how fast it needs to be consumed)