0% found this document useful (0 votes)
16 views28 pages

DSDM Unit2

Dsdm

Uploaded by

manekandan8214
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views28 pages

DSDM Unit2

Dsdm

Uploaded by

manekandan8214
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

UNIT II: HARNESSING SOCIAL DATA (9 Hrs)

APIs in a nutshell - Different types of API - Advantages and Limitations of social


media APIs - Connecting principles of APIs - Introduction to authentication
techniques - Parsing API outputs - Twitter - Facebook - GitHub - YouTube. Basic
cleaning techniques - MongoDB to store and access social data - MongoDB using
Python. Google Tools.

2 MARKS

Q) WHAT IS AN API?

An API stands for Application programming interface.It is the medium that allows the
exchange of data points between a service and the programmer or user.It is the Interface
that can be thought of as a contract of service between two applications. This contract defines
how the two communicate with each other using requests and responses.

Q)What are the types of API’S?


Two basic types of API’s are as follows:-

• RESTful API
• Stream API

Q)What is a RESTful API?


RESTful API is the most commonly used APiThe information from a REST API is static and
is from historical data.REST stands for Representational State Transfer and it relies on the
HTTP protocol for data transfer between machines.The architecture of REST uses the HTTP
protocol, it would be fair to assume that the WWW itself is based on RESTful design.
Two of the most important uses of RESTful services are:
• GET Procedure to receive data from a distant machine
• POST Procedure to write data to a distant machine

Q)What is a Stream API?


Stream API You need a Stream API when the requirement is to collect data in real time,
instead of backdated from the platform. The Stream API of Twitter is widely used to collect
real-time data from Twitter. The output is quite similar to that of a REST API apart from the
real-time aspect.
Q)What are the advantages of social media API'S?

Social media APIs have many advantages. The main advantages are:

• Extraction of Social data


• Ease of App development
• Provides various features
• Automated Marketing

Q)What are the limitations of social media API'S?


Some disadvantages of social media API are:

• Rate limits
• API changes
• Legal
Q)What are the connecting principles of API'S?
General connecting principles of API’S

• APP registration: Almost every social media platform needs you to register your application
on their website. It involves entering personal information and the objectives in using their
API services. This step results in the generation of certain keys, which are called
authentication and consumer keys.

• Authentication: Use the consumer keys (also called authentication keys) generated from the
previous step to authenticate your application.
• API endpoint hunting: The API endpoints will be different for each provider, so it is
necessary to read the provided documentation to identify which end points best correspond to
your needs.

Q)What is OAuth?

OAuth is simply an authorization protocol that allows users to share data with an application
without sharing the password. It is a way to obtain a secure authorization scheme based on a
token-based authorization mechanism. There are two API authentication models using
OAuth: OAuth1 and OAuth2

Q) What is User authentication?This is the most common form of resource authentication


implementation. The signed request both identifies an application's identity in addition to the identity
accompanying granted permissions of the end user making API calls on behalf of, represented by the
user's access token.

Q)What is Application authentication?authentication Application authentication is a form of


authentication where the application makes API requests on its own behalf, without a user context.
API calls are often rate limited per API method, but the pool each method draws from belongs to your
entire application at large, rather than from a per-user limit.
Q)What are the steps required to connect client with OAuth?Steps that are required to put in
place a client with OAuth authorization are:-

1. Creating a user/developer account:

2. Creating an application:
3. Obtaining access tokens:
4. Authorizing HTTP requests (optional):
5. Setting up permission scopes (optional):
6. Connecting to the API using obtained access tokens:

Q)Why do we need to use OAuth?


Social media networks APIs aim to provide full interaction with third-party applications allowing all
kinds of access within rate limits. Thus, applications can perform actions on behalf of their users
and access their data. The main advantage of this protocol is full security and the fact that the
connection protocol is standardized. Therefore, there are standard ways of writing code and
using request libraries.Moreover, an OAuth connection is the most proper and reliable technique
that adheres to the developer policy defined by social network companies. The main advantage
for the user is that it gives the highest available quota and very often more API endpoints to
collect the data.

Q)What are the different versions of OAuth?


Different versions of OAuth are:
• OAuth1
• OAuth2
OAuth2 is a fully rewritten improved version of OAuth1. It defines four roles for client,
authorization server, resource server and resource owner .
OAuth1 uses different concepts to describe the roles. There are also multiple technical
differences related for example to cryptography.

Q)What are the twitter APIS?


Twitter proposes three main APIs:
• REST API
• Streaming API
• Ads API.

Q)What are the FACEBOOK APIS?


Facebook Facebook provides three APIs for different purposes:
Atlas API: API for partners and advertisers
Graph API: The primary way for apps to read and write to the Facebook social graph
Marketing API: To build solutions for marketing automation with Facebook's advertising platform
Q)What are the social nodes composed of?

Social graph composed of: -


Nodes: All the main elements such as user, photo, page, and comment
Edges: Relationships between nodes such as user photos and comments in posts
Fields: Attributes that these nodes or edges can have such as location, name, birthday date,
time, and so on

Q)What is GitHub?
GitHub is one of the most important platforms for computer programmers and hobbyists. Its
main goal is to host source code repositories and empower open source communities to work
together on new technologies. The platform contains lots of valuable information about what is
happening in the community of technology enthusiasts, what the trends are, what programming
languages have started to emerge, and much more. We will use the data from GitHub to predict
the trending technologies of the future.

Q)What is YouTube?
YouTube is certainly the most popular video sharing social network and helps users to share and
monetize their media content. It has a very rich content ranging from amateur users to
professionals recording quality videos. On top of the media content it contains different kinds of
data such as comments, statistics, or captions automatically extracted from video sound. The
main advantage of YouTube is the number of users and the volume of new videos uploaded
every day. These numbers are huge and increase every day, making a data goldmine of this
social media platform.

Q)What is Pinterest?

Pinterest has become one of the most important photo sharing platforms over the last few years.
It allows users to share photos found on the internet with other users by creating pins. In our
further analysis we will analyze the content and relationships between users. In order to gather
content, we have to establish a connection to the Pinterest API.

Q)What is Encoding?
Data type and encoding Comments and conversation are textual data that we retrieve as strings.
In brief, a string is a sequence of characters represented by code points. Every string in Python
is seen as a Unicode covering the numbers from 0 through 0x10FFFF (1,114,111 decimal).
Then, the sequence has to be represented as a set of bytes (values from 0 to 255) in memory.
The rules for translating a Unicode string into a sequence of bytes are called encoding.

Q)What is Preprocessing?
Preprocessing is one of the most important parts of the analysis process. It reformats the
unstructured data into uniform, standardized form. The characters, words, and sentences
identified at this stage are the fundamental units passed to all further processing stages.

Q)What is Stemming and lemmatization?


The main aim of stemming and lemmatization is to reduce inflectional forms and sometimes
derivationally related forms of a word to a common base form.
Stemming reduces word forms to so-called stems, whereas lemmatization reduces word forms
to linguistically valid lemmas. Some examples of stemming are cars -> car, men -> man, and
went -> go Such text processing can give added value in some domains, and may improve the
accuracy of practical information extraction tasks
Q)What is Tokenization?
Tokenization is the process of breaking a text corpus up into words (most commonly), phrases,
or other meaningful elements, which are then called tokens. The tokens become the basic units
for further text processing

Q)What is the process of cleaning?


Common cleaning practices are:-
• Normalize the textual content:
• Remove special characters (example: punctuation).
• Remove stop words:
• Splitting attached words.
• Removal of URLs and hyperlinks:
• Slang lookups:

Q)What is MongoDB?

MongoDB (from humongous) is a cross-platform document-oriented database. Classified as a


NoSQL database, MongoDB eschews the traditional table-based relational database structure in
favor of JSON-like documents with dynamic schemas (MongoDB calls the format BSON), making
the integration of data in certain types of applications easier and faster. Released under a
combination of the GNU Affero General Public License and the Apache License, MongoDB is
free and open-source software.

Q)What are the advantages of MongoDB?


MongoDB is recognized for the following advantages:
• Schema-less design
• High performance
• High availability
• Automatic scaling

Q)How to start MongoDB?


We need to go to the folder where mongod.exe stored and and run the following command:
cmd bin\mongod.exe
Once the MongoDB server is running in the background, we can switch to our Python
environment to connect and start working.

Q)How to install MongoDB usng Python?


PyMongo can be installed using the following command:
pip install pymongo
10, 5 MARKS

Q) API’s in nutshell:
An API is the medium that allows the exchange of data points between a service and the
programmer or user. API concepts have been widely used in the software industry when we
needed different software to exchange data with with another. Mobile and internet
applications have been using web services and APIs to enrich information from external
sources. Social media also started creating APIs to share their data with third-party
application developers. The popularity of data science has made APIs emerge also as a
source for mining and knowledge creation. The nature of all social media is different, so are
their APIs. The steps involved in making a connection may not differ greatly, but the data
points we capture do

Different types of API:


Currently, two types of API are available. They are as follows:
• RESTful API
• Stream API

RESTful API:
REST stands for Representational State Transfer and it relies on the HTTP protocol for data
transfer between machines. It has been created to simplify the transfer of data between
machines unlike previous web services such as CORBA, RPC, and SOAP. Since the
architecture of REST uses the HTTP protocol, it would be fair to assume that the WWW itself
is based on RESTful design. Two of the most important uses of RESTful services are:
• GET: Procedure to receive data from a distant machine
• PUT: Procedure to write data to a distant machine
Almost all the functionalities of a REST API can be used through the preceding two
methods.
Stream API:
You need a Stream API when the requirement is to collect data in real time, instead of
backdated from the platform. The Stream API of Twitter is widely used to collect real-time
data from Twitter. The output is quite similar to that of a REST API apart from the real-time
aspect. We'll see examples of the Twitter Stream API and its outputs.
Advantages of social media APIs are:
• Social data: APIs allow you to extract valuable data around Social Media users and
content that is used for behavioral analysis and user insights.
• App development: Thousands of software and applications have been built using
Social Media APIs that provide additional services on top of Social Media platforms.
• Marketing: Social media APIs are useful in automating marketing activities such as
social media marketing by posting on platforms. It also helps in enriching marketing
data through Social Data acquired about customers.
Limitations of social media APIs:
• Rate limits: Social media companies need to take into account the amount of data
that enters or leaves their systems. These are rules based on their infrastructural
limitations and business objectives. We must not think of acquiring unlimited amounts
of data at our own speeds. The amount of data and the speed of receiving are clearly
stated by most social media platforms. We have to read them carefully and include
them in our extraction strategy.
• API changes: This is one of the biggest challenges to deal with when developing
applications or analysis using social data. Social media platforms are free to change
or stop their API services own will. Such kinds of change or stoppage could severely
impact development or analytics strategies. The only advice in such situations is to
be prepared for it and have flexible systems to be able to adapt to the changes.
• Legal: This challenge is mainly in the use cases around social media APIs. The rules
and regulations for social media platforms are strict about the type of usage of its
data and services. We have to be conscious of the legal framework before thinking of
our usage and applications. Any use of data from APIs that doesn't conform to the
stipulated regulations risks legal implications.

Q) Connecting principles of APIs:


Connecting to social media platforms and using their API data services require a few
steps to be configured before usage. There are nuanced differences between different
platforms, but the following are the general steps that are applicable to almost all: APP
registration: Almost every social media platform needs you to register your application on
their website. It involves entering personal information and the objectives in using their API
services. This step results in the generation of certain keys, which are called authentication
and consumer keys. Authentication: Use the consumer keys (also called authentication
keys) generated from the previous step to authenticate your application. API endpoint
hunting: The API endpoints will be different for each provider, so it is necessary to read the
provided documentation to identify which end points best correspond to your needs

Q) Introduction to authentication techniques:


Historically, there were multiple ways of accessing API resources, but nowadays
there is one common protocol used by all the main social media networks. When you get
into the developer documentation you will most probably encounter the problem of
authentication referred to by an enigmatic term, OAuth.

What is OAuth?
OAuth is simply an authorization protocol that allows users to share data with an application
without sharing the password. It is a way to obtain a secure authorization scheme based on
a token-based authorization mechanism.
There are two API authentication models using OAuth:
• User authentication
• Application authentication.
User authentication: This is the most common form of resource authentication
implementation. The signed request both identifies an application's identity in addition to the
identity accompanying granted permissions of the end user making API calls on behalf of,
represented by the user's access token.
Application authentication: Application authentication is a form of authentication where the
application makes API requests on its own behalf, without a user context. API calls are often
rate limited per API method, but the pool each method draws from belongs to your entire
application at large, rather than from a per-user limit.
For the purposes of social media analysis, we will use in most cases application
authentication by creating an application on each social media platform that will query the
related API.
There are several steps that are required to put in place a client with OAuth
authorization:
1. Creating a user/developer account: First of all, you have to register a user/developer
account and provide personal information such as a valid email address, name, surname,
country, and in many cases a valid telephone number (the verification process is done by
sending you a text message with a code).
2. Creating an application: Once you create your account, you will have access to a
dashboard, which is very often called a developer console. It provides all the functionalities
to manage your developer account, create and delete applications, or monitor your quota. In
order to obtain access credentials you will have to create your first application via this
interface.
3. Obtaining access tokens: Then, you generate access tokens for your application and
save them in a safe place. They will be used in your code to create an OAuth connection to
the API
. 4. Authorizing HTTP requests (optional): Some APIs require HTTP request
authorization, which means that a request has to contain an additional authorization header
that provides the server with information about the identity of the application and permission
scope.
5. Setting up permission scopes (optional): Some APIs have the notion of multilevel
permissions. In that case when you generate your API key you need to specify the scope for
the key. Scope here refers to a set of allowed actions. Therefore, in cases where an
application attempts an action that is out of its scope, it will be refused. This is designed as
an additional security layer. Ideally one should use multiple API keys, each with restricted
scopes, so that in the scenario where your API key is hijacked, due to the restrictions in its
scope the level of potential harm is restricted.
6. Connecting to the API using obtained access tokens: When all the preceding steps
are configured, you can make requests using your access tokens. Now, the only limitation is
the request quota, which depends on each platform.

OAuth1 and OAuth2:


You might find different version of OAuth on social media platforms: OAuth1 and OAuth2.
OAuth2 is a fully rewritten improved version of OAuth1. It defines four roles for client,
authorization server, resource server and resource owner while OAuth1 uses different
concepts to describe the roles. There are also multiple technical differences related for
example to cryptography, but a complete analysis is beyond the scope of this chapter. We
can conclude that OAuth2 is slightly less complicated and easier to use.

Practical usage of OAuth:


In this part of the chapter, we will see how to connect to the main social media using OAuth
and how to get and parse the data. There are many libraries in Python 3 implementing the
OAuth protocol. For the purposes of this book, we will show how to use a library called
requests.
The requests library implements the whole range of authentication protocols and allows you
to execute HTTP requests such as GET or POST.

Firstly, you have to import the library in your code:

If you are using the OAuth protocol, you import the related library:

Then, you have to create your authenticated connection using access tokens and application
keys that you will find in the developer console:

Then, you can make GET requests

Pass these parameters:

POST requests:
Also, a whole range of additional requests:

In order to parse the outputs, you can use different methods such as:
• r.text(): This gets a string with request outputs
• r.json(): This gets JSON with request outputs
• r.encoding(): This checks the encoding of the output

Q)Explain the API process of Twitter


Twitter proposes three main APIs: the REST API, Streaming API, and the Ads API. We will
be focused on the first two APIs, which provide respectively on-demand or stream data.
Creating application:
As explained in the section about OAuth, you have to obtain credentials to be able to collect
data from Twitter. There are some simple steps to perform this action:
1. Create a Twitter account or use your existing one.
2. Go to https://apps.twitter.com/ and log in with your account.
3. Click on Create your app and submit your phone number. A valid phone number is
required for the verification process. You can use your mobile phone number for one
account only.
4. Fill the form, agree to the terms and conditions, and create your Twitter application.
5. Go to the Keys and Access Tokens tab, save your API key, and API secret and then
click on Create my access token to obtain the Access token and Access token
secret. These four elements will be required to establish a connection with the API.
Selecting the endpoint:
An endpoint indicates where a particular resource can be accessed. It is represented by an
URL that contains the name of the action. Even though there are multiple endpoints for each
API, we will focus on those used in the next chapters of the book. All other endpoints/actions
you can find in the official API documentation.
The Twitter REST API allows clients to retrieve a sample of tweets based on search criteria.
The search request is made up of a Boolean query with some additional optional parameters
(to, from, list,url , and filter). We will store the endpoint URL for this resource in a url variable:

Similarly, we will use an endpoint URL for the Streaming API that returns a random sample
stream of statuses:

We will use both variables to retrieve and parse the data.

Using requests to connect:


Firstly, we include all necessary libraries. We have added the json library to be able to parse
easily the outputs of the Twitter API and urllib.parse, which encodes a query string into a
proper request URL:
In the first place, we define parameters that will be used to establish connections with the
Twitter API and we create an OAuth client connection:

Firstly, we encode our query. We have chosen to search for three car brands: BMW,
Mercedes, and Audi:

Then we execute a search request using our query and OAuth client:

The request returned a list of tweets with all the meta information. We will convert it to JSON
and print the content of each tweet we find under the text field.

Similarly, we make a request to the Streaming API to get all recent tweets:

We keep iterating through all the lines that are being returned.

If the line exists we decode it to UTF-8 to make sure we manage the encoding issues and
then we print a field text from JSON.

Q)Explain the API processes of Facebook .


Facebook provides three APIs for different purposes:
• Atlas API: API for partners and advertisers
• Graph API: The primary way for apps to read and write to the Facebook social graph.
• Marketing API: To build solutions for marketing automation with Facebook's advertising
platform It is the primary way to collect data from Facebook platform using requests to
query data. It also enables the automation of all the actions available on Facebook such
as data uploads (photos or videos), likes, shares, and account management, among
others.
The name Graph API is related to the structure of the platform, which in fact represents a social
graph composed of:
• Nodes: All the main elements such as user, photo, page, and comment
• Edges: Relationships between nodes such as user photos and comments in posts
• Fields: Attributes that these nodes or edges can have such as location, name, birthday
date, time, and so on

There are small differences between versions, mostly in available endpoints, resources,
and parameters. We use the basic functionalities of this API so switching between
versions should not cause any problems in terms of endpoints and resources, but we
have to check the documentation to pass the right arguments.
Q)Explain the API processes of GITHUB
GitHub is one of the most important platforms for computer programmers and hobbyists. Its main
goal is to host source code repositories and empower open source communities to work together
on new technologies. The platform contains lots of valuable information about what is happening
in the community of technology enthusiasts, what the trends are, what programming languages
have started to emerge, and much more. We will use the data from GitHub to predict the trending
technologies of the future. .
Selecting the endpoint
The queries in our further project will be mostly based on searches within different repositories.
In order to obtain results based on our criteria we will use the following endpoint:

The argument list is divided into three parts:


• q: Query
• sort: Field to sort on
• order: Ascending or descending
Within the query part we can add multiple additional arguments: we will use language
(programming language the code is written in), created (the date it was created on) and pushed
(the date of the last update in the repository).
Finally, the endpoint will contain all the arguments used for query:
Q)Explain the API processes of Youtube.
YouTube is certainly the most popular video sharing social network and helps users to share and
monetize their media content. It has a very rich content ranging from amateur users to
professionals recording quality videos. On top of the media content it contains different kinds of
data such as comments, statistics, or captions automatically extracted from video sound. The
main advantage of YouTube is the number of users and the volume of new videos uploaded
every day. These numbers are huge and increase every day, making a data goldmine of this
social media platform
Q)Explain the API processes of Pinterest
Pinterest has become one of the most important photo sharing platforms over the last few years.
It allows users to share photos found on the internet with other users by creating pins. In our
further analysis we will analyze the content and relationships between users. In order to gather
content,
Creating an application
• Create a Pinterest account or use an existing one.
• Go to
• Go to Apps.
• Create a new app (you have to agree to the terms and conditions first).
• You will be redirected to the app management interface where you can find your access
token.
• Save it for further use in your code. Selecting the endpoint

There are multiples endpoints that we will be useful for network analysis. There are three main
objects that we can get with the Pinterest API:
• User
• Board
• Pins
Q)What are the basic cleaning processes?
Social media contains different types of data: information about user profiles, statistics
(number of likes or number of followers), verbatims, and media. Quantitative data is very
convenient for an analysis using statistical and numerical methods, but unstructured data
such as user comments is much more challenging. To get meaningful information, one
has to perform the whole process of information retrieval. It starts with the definition of
the data type and data structure. On social media, unstructured data is related to text,
images, videos, and sound and we will mostly deal with textual data. Then, the data has
to be cleaned and normalized. Only after all these steps can we delve into the analysis.
Data type and encoding
Comments and conversation are textual data that we retrieve as strings. In brief, a string
is a sequence of characters represented by code points. Every string in Python is seen
as a Unicode covering the numbers from 0 through 0x10FFFF (1,114,111 decimal).
Then, the sequence has to be represented as a set of bytes (values from 0 to 255) in
memory. The rules for translating a Unicode string into a sequence of bytes are called
encoding.
Encoding plays a very important role in natural language processing, because people
use more and more characters such as emojis or emoticons, which replace whole words
and express emotions . Moreover, in many languages there are accents that go beyond
the regular English alphabet. In order to deal with all the processing problems that might
be caused by these we have to use the right encoding, because comparing two strings
with different encodings is actually like comparing apples and oranges. The most
common one is UTF-8, used by default in Python 3, which can handle any type of
character. As a rule of thumb always normalize your data to Unicode UTF-8.

Structure of data
Better solution is to store the data in a tabular format in pandas dataframe, which has
multiple advantages for further processing. First of all, rows are indexed, so search
operations become much faster. There are also many optimized methods for different
kinds of processing and above all it allows you to optimize your own processing by using
functional programming. Moreover, a row can contain multiple fields with metadata about
verbatims, which are very often used in our analysis. It is worth remembering that the
dataset in pandas must fit into RAM memory. For bigger datasets we suggest the use of
SFrames.

Pre-processing and text normalization


Preprocessing is one of the most important parts of the analysis process. It reformats the
unstructured data into uniform, standardized form. The characters, words, and sentences
identified at this stage are the fundamental units passed to all further processing stages.
The quality of the preprocessing has a big impact of the final result on the whole process.
There are several stages of the process: from simple text cleaning by removing white
spaces, punctuation, HTML tags and special characters up to more sophisticated
normalization techniques such as tokenization, stemming, or lemmatization. In general,
the main aim is to keep all the characters and words that are important for the analysis
and, at the same time, get rid of all others, and the text corpus should be maintained in
one uniform format.
Q)What is the process of duplicates removal
Q)Explain MONGODB and MONGODB with Python in detail.
MongoDB (from humongous) is a cross-platform document-oriented database. Classified
as a NoSQL database, MongoDB eschews the traditional table-based relational database
structure in favor of JSON-like documents with dynamic schemas (MongoDB calls the
format BSON), making the integration of data in certain types of applications easier and
faster. Released under a combination of the GNU Affero General Public License and the
Apache License, MongoDB is free and open-source software.

You might also like