DS Retest
DS Retest
jv
5 MARK
1. As a first option, you can drop observations that have missing values, but doing this will drop or
lose information, so be mindful of this before you remove it.
2. As a second option, you can input missing values based on other observations; again, there is an
opportunity to lose integrity of the data because you may be operating from assumptions and
not actual observations.
3. As a third option, you might alter the way the data is used to effectively navigate null values.
jv
Step 5: Validate and QA
At the end of the data cleaning process, you should be able to answer these questions as a part
of basic validation:
• Does the data follow the appropriate rules for its field?
• Does it prove or disprove your working theory, or bring any insight to light?
• Can you find trends in the data to help you form your next theory?
False conclusions because of incorrect or “dirty” data can inform poor business strategy and
decision-making. False conclusions can lead to an embarrassing moment in a reporting meeting
when you realize your data doesn’t stand up to scrutiny. Before you get there, it is important to
create a culture of quality data in your organization. To do this, you should document the tools
you might use to create this culture and what data quality means to you.
md datadb
Starting MongoDB
We need to go to the folder where mongod.exe is stored and and run the following command:
cmd binmongod.exe
Once the MongoDB server is running in the background, we can switch to our Python environment to
connect and start working.
jv
We will not go into the detailed usage of MongoDB, which is beyond the scope of this book. We will see
the most common functionalities required for analysis projects. We highly recommend reading the official
MongoDB documentation.
client = MongoClient('localhost:27017')
The database structure of MongoDB is similar to SQL languages, where you have databases, and inside
databases you have tables. In MongoDB you have databases, and inside them you have collections.
Collections are where you store the data, and databases store multiple collections. As MongoDB is a
NoSQL database, your tables do not need to have a predefined structure, you can add documents of any
composition as long as they are a JSON object. But by convention is it best practice to have a common
general structure for documents in the same collections.
db_scrapper = db.scrapper
To access a collection named articles in the database scrapper we do this:
db_scrapper = db.scrapper
collection_articles = db_scrapper.articles
Once you have the client object initiated you can access all the databases and the collections very easily.
• Insert: To insert a document into a collection we build a list of new documents to insert into
the database:
docs = []
"author": "...",
"content": "...",
})
Inserting all the docs at once:
jv
db.collection.insert_many(docs)
Or you can insert them one by one:
db.collection.insert_one(doc)
batch_size = 100
Iteration = 0
iteration).limit(batch_size)]
Iteration += 1
To fetch documents using search queries, where the author is Jean Francois:
• Update: To update a document where the author is Jean Francois and set the attribute
published as True:
query_search = {'author': 'Jean Francois'}
db.collection.update_many(query_search, query_update)
jv
Or you can update just the first matching document:
db.collection.update_one(query_search, query_update)
db.collection.delete_many(query_search, query_update)
db.collection.delete_one(query_search, query_update)
db.dropDatabase()
We saw how to store and access data from MongoDB. MongoDB has gained a lot of popularity and is
the preferred database choice for many, especially when it comes to working with social media data.
(before):
• Python3
jv
# import MongoClient
# Creating a client
db = client['GFG']
Output:
Database is created!!
In the above example, it is clearly shown how a database is created. When creating a
client, the local host along with its port number, which is 27017 here, is passed to the
MongoClient. Then, by using the client, a new database named ‘GFG’ is created. We can
check if the database is present in the list of databases using the following code:
• Python3
list_of_db = client.list_database_names()
if "mydbase" in list_of_db:
print("Exists !!")
Output:
Exists!!
Thus the Facebook number you will most commonly see is the total for all interactions including
likes, shares and comments. However, are total Facebook interactions a good indication of
content resonance?
Many people have argued that Facebook shares are more powerful than ‘likes’ and you should
really just look at Facebook shares to see what content is resonating with people. From our
research things are not quite so simple and there are benefits in looking at total Facebook
interactions.
The FB developer pages go on to say “A single click on the Like button will ‘like’ pieces of
content on the web and share them on Facebook. You can also display a Share button next to
the Like button to let people add a personal message and customize who they share with.”
See https://developers.facebook.com/docs/plugins/like-button
Facebook also says to users that “a ‘Like’ is a way to give positive feedback or to connect with
things you care about on Facebook. You can like content that your friends post to give them
feedback or like a Page that you want to connect with on Facebook.”
As an example of a ‘like’ being a quick way to share content, my partner liked a post by Andy
Murray tonight and it immediately appeared at the top of my News Feed along with the image,
as you can see below.
jv
Do ‘likes’ tell us anything about users and content resonance?
Some people argue that the like’ is the “lazy” option and means very little. It is quick and easy
to ‘like’ something, unlike commenting or sharing where users need to spend some time writing
about the post they are sharing. This is definitely true, shares are harder to earn. However, for
many busy users this is the advantage of the ‘like’ button and it means they can share more
content than they would otherwise.
A Marketo and Brian Carter study found that people were eight times as likely to ‘like’ as share
or comment.
The degree to which people ‘like’ content was revealed in a Pew Research Center survey. The
survey asked thousands of Americans about their social media sharing and found that 44% of
jv
Facebook users “liked” content posted by their friends at least once a day, with 29% did so
several times per day.
A further separate study published by the National Academy of Sciences found that by
analyzing the “likes” of 86,000 volunteers they could predict the characteristics of the person
with incredible accuracy. They could predict whether someone was:
They found they could also predict a person’s political leaning, Democrat or Republican, with
85% accuracy. This would appear to indicate quite strongly that people ‘like’ content that
resonates with them.
Thus maybe, and I accept this goes against much conventional wisdom, the frequency and
nature of ‘likes’ is a more important indicator of content resonance than is often thought.
However, let’s look at the argument for shares and why these may be a better indicator.
The argument for valuing Facebook Shares over 'Likes'
Marketers generally place a higher value on Facebook sharing than on ‘likes’ according to
this AdWeek article. The argument for the greater value of shares runs along the lines that
sharing involves greater commitment and is more likely to mean content is shown in News
Feeds.
The first argument about commitment is certainly true, as whilst a ‘like’ is frictionless, to share
can require more effort, although there is now a ‘one click’ option in Facebook to simply share
with Friends without commenting, see image below.
However, to share wider you get a dialogue box which allows you to comment along with your
share and to control the image that gets displayed with the share. You can also control where
you share and who sees your share.
jv
The second argument for shares is the potential for shared content to be seen by far more
people. A share will for example show more clearly on your own profile page than a ‘like’ which
comes under recent activity. However, it is unlikely many people will go to your personal profile
page. What most people will see is their News Feed. It is claimed that the Edgerank algorithm,
which determines what shows up in the News Feed, gives far more weight to ‘shares’ than ‘likes’
leading to more visibility for content shared than ‘liked’. However, whilst this may be the case,
the appearance of articles in News Feed is a complex matter.
Facebook say:
“The goal of News Feed is to show you the stories that matter most to you. To do this, we use
ranking to order stories based on how interesting we believe they are to you: specifically, whom
you tend to interact with, and what kinds of content you tend to like and comment on.”
Thus the Edgerank algorithm looks at many factors. Facebook is also constantly updating the
algorithm and the control a user can exercise over what appears in a News Feed. For example,
the following are two updates:
“We’ve discovered that if people spend significantly more time on a particular story in News
Feed than the majority of other stories they look at, this is a good sign that content was relevant
to them.”
“To help prioritize stories, and make sure you don’t miss posts from particular friends and Pages,
you can now select which friends and Pages you would like to see at the top of your News Feed.”
Facebook’s aim is to ensure what appears in a News Feed, other than sponsored paid content, is
relevant to the user. This clearly takes into account many factors as we can see from above,
including an individual’s privacy settings. Shares and likes are just one part of a more complex
picture about what appears and where in News Feeds. As organic reach declines there is a view
that to make sure your content is seen you really have to pay to play.
jv
Shares may also be more valuable as they can be driven by different reasons to ‘likes’.
The Marketo and Brian Carter study found that people share content to share tips and advice,
to warn people, to pass on deals, to show they are part of a community and to entertain their
friends with amusing or inspirational posts. ‘Likes’ may be used more, as Facebook indicate, to
give positive feedback to people and things you care about.
I think there is another potential argument in support of shares over ‘likes’ namely that shares
are a little less easy to automate than ‘likes’. Whilst Facebook is constantly rooting out false
accounts, ‘likes’ may be slightly more at risk to exaggeration than shares through automation
and non-human activity.
Do shares correlate higher with Google rankings?
Searchmetrics produce a respected report each year looking at the factors that distinguish well-
placed sites from those with lower positions in the Google search results. We took a look at
the 2014 report and its correlations for Facebook activities.
The report found, not surprisingly, that relevant, quality content ranks better on average, and is
“identifiable by properties such as a higher word-count and semantically comprehensive
wording.” However, the report also looked in detail at the correlations of various factors with
rankings.
The top ranking correlations from their latest survey are shown below. Please remember, as the
report’s authors are at pains to point out, that correlation does not mean causation.
I have highlighted the Facebook correlations which are of interest to us here. We can see that
Facebook shares have a slightly higher correlation than comments or ‘likes’ to Google
rankings. However, we can also see that the correlation of total Facebook interactions; namely
shares, likes and comments; is the same as that for Facebook shares. This does not mean that
higher Facebook activity causes higher Google rankings but it does show that total Facebook
interaction activity has the same correlation as Facebook shares.
jv
10 MARK
Different Tools of Google Analytics
Below is the list of Google analytics tool which helps you to improve the quality of data and
increase the traffic of the website:
1.Google Tag Assistant: Google Tag Assistant is a debugging tool. It is a chrome extension
that is used to check whether google analytics, award conversion tracking, google tag
manager, and others are workings properly or not. With the Google Tag Assistant tool, you
can troubleshoot any google analytics issues quickly and fix them immediately. The tag
assistant is easy to use. It gives all the tags present on the web page which you are visiting.
This tool gives automatic suggestions to solve problems. Google Tag Assistant helps you to
find invalid events, missing tags, filters, etc.
Google Analytics URL Builder: Google Analytics URL Builder lets you create a campaign URL
based on the current URL and then use automatic reporting tools that automatically track the
URL’s progress. Campaign URL can be used to track which promotions get traffic to the site.
5.Google Analytics Table Booster: Google Analytics Table Booster is a chrome extension used
to enhance google analytics’s data grid. To enhance the data grid, it provides three types of
visualization. Every row can use different types of visualization, or they can combine thre e
types. It is the best way to evaluate the performance of the data.
jv
6.GA Debugger: GA debugger is a google analytics debugger. It is a chrome extension that is
used for debugging the google analytics tracking code. It allows users to debug their website
and allows them to see how other websites implemented google analytics tracking. It is easy
to use. After adding GA debugger on chrome, to have to turn it on; after that, for an opening
console, you have to press ctrl+ Shift + i. It automatically starts debugging.
7.Google Tag Manager injector: Google Tag manager injector is an open-source Chrome
extension that is used to inject google tag manager container tags into web pages. The
advantage of using this tool is it doesn’t require any JavaScript code to be added to preview
the GTM, i.e. Google Tag Manager containers.
8.WASP2: WASP stands for Web Analytics Solution Profiler. It is used for debugging the
google analytics tracking issue. It also allows users to debug other tracking issues. With the
help of WASP.crawler, you can get detailed information about tags present on the website.
WASP can audit any content or tag on the website.
9.Mobile traffic behaviour: Mobile Internet search nearly doubled between 2012 and 2013 to
maximize your possibilities by developing a mobile website and measuring its traffic conduct.
This metric shows how the mobile market affects your association’s website traffic and will be
an indicator of your site’s customer experience.
10.Audience Location: Audience location allows you to know the audience’s physical location,
improving marketing and the business. It helps you to determine the User’s area of interest
and make high revenue. Audience location helps you analyze whether you are reaching the
right audience or not, the website’s traffic, set the best strategy for marketing, etc.
11.Events: This is the most important tool which records user activity on the webpage in real-
time. It records the user’s information, such as how the user scrolls the web page, navigates
other web pages of the site, how much time he spent, and many more. These activities help
you analyze the website speed, behaviour, and response time and improve the website by
solving problems.
12.Supermetrics Add-on: Google Analytics tools widely use Supermetrics. It provides different
products. However, Supermetrics is mostly used for Google Sheets. It allows gathering all
data from different tools into Google Sheet. After that, the user can create his own
dashboard or connect the google sheet with other applications. It has a Free and paid
version.
jv
2)Analyse about Facebook API interaction on DS
Facebook Graph API?
Facebook Graph API is the primary way to interact with Facebook programmatically, as already
stated earlier. With the Graph API, apps can read & write data to Facebook. All the Facebook
SDKs use the Graph API too. I will talk about the SDKs further down the article & how it is
different from the Graph API. For now, let’s focus on the API.
You must have come across Login Via Facebook feature on third party websites or apps. That is
made possible by the Graph API. The API uses OAuth token to let users’ login in into apps via
their Facebook credentials. Though it doesn’t share the credentials with the third-party portals.
The Facebook API verifies your credentials & generates a unique authentication token & passes
it to the third-party app, verifying the login process.
Graph API is an HTTP based API via which apps can post on users walls, upload photos, share
events & stuff.
As we can figure out from the name. The API is a social graph. Just like any other graph, it has
nodes & edges. There is a third element known as the field. What are they? I’ll explain.
Nodes – Nodes represent individual users. Technically they are individual objects with a unique
id which are linked with each other via edges. Other nodes in the graph are our friend’s nodes.
Edges – An edge represents a link between one node & the another.
Field – Field represents the data about an object like his name or the pages he likes & stuff.
#!usr/bin/env python
import json
import facebook
jv
def main():
graph = facebook.GraphAPI(token)
profile = graph.get_object('me',fields='first_name,location,link,email')
print(json.dumps(profile, indent=4))
if __name__ == '__main__':
main()
Here inside the main() method, we are trying to get the information about our own Facebook account. Let's try to understand
the code line-by-line:
• Now, we have extracted the desired fields in a variable profile. Here, notice that 'me' in get_object() method indicates that
we are doing it for our own account.
• location: The person's current location as entered by them on their profile. This field is not related to check-ins.
• email: The person's primary email address listed on their profile. This field will not be returned if no valid email address
is available.
Besides these fields, there are numerous fields. For full list and description of fields refer the Facebook Graph API official
documentation, here.
Output:
jv
What is the Life Cycle of an Analytics Project?(develop with stories 😂)
I have a strong feeling that running an analytics project is pretty similar to building a house. First, the
architect meets his/her client, understands their needs and comes up with an actionable blueprint
Then it requires collecting building blocks such as cements, steels, bricks… etc. You have to learn the
features of your building materials and choose the right materials for construction. Otherwise, you may
end up having a house that can be collapsed easily. This is like a data collection process where you have
to do some EDA or feature engineering to understand data and find the right data to solve your analytics
problems, or else you may not manage to get solid or concrete results from your analysis!
jv
With the building materials and blueprint handy, you can start building a house (Run Analysis). After
construction is finished, home Inspection and quality checks are required to ensure safety. Similarly, we
need to document our analytics project regarding the methodologies, conclusions and limitations.
If I’m asked the most critical phase of the whole cycle, I would say Understanding and Planning without
any hesitation because the main purpose of data science and analysis is not to create a project with fancy
technology, but to solve real problems. Therefore, the success of an analytics project is highly dependent
on how well you understand the situation, define the problem and translate the business questions to an
analytics question. From that standpoint, it’s always worth spending time thinking about the broader
Analytics Plans
Before diving into the analysis, let’s come up with an analytics plan and set up another follow up meeting
It will provide a high-level overview of the plan, giving a clear picture of the next steps and draw the link
between technical actions and the bigger picture from the business side. Here are some key elements in
my Analytics Plans:
jv