Version Control System
Version Control System
Version Control System (VCS) is a software that helps software developers to work
together and maintain a complete history of their work.
Listed below are the functions of a VCS −
Allows developers to work simultaneously.
Does not allow overwriting each other’s changes.
Maintains a history of every version.
Following are the types of VCS −
Centralized version control system (CVCS).
Distributed/Decentralized version control system (DVCS).
AD
Centralized version control system (CVCS) uses a central server to store all files and enables
team collaboration. But the major drawback of CVCS is its single point of failure, i.e., failure
of the central server. Unfortunately, if the central server goes down for an hour, then during
that hour, no one can collaborate at all. And even in a worst case, if the disk of the central
server gets corrupted and proper backup has not been taken, then you will lose the entire
history of the project. Here, distributed version control system (DVCS) comes into picture.
DVCS clients not only check out the latest snapshot of the directory but they also fully mirror
the repository. If the server goes down, then the repository from any client can be copied
back to the server to restore it. Every checkout is a full backup of the repository. Git does not
rely on the central server and that is why you can perform many operations when you are
offline. You can commit changes, create branches, view logs, and perform other operations
when you are offline. You require network connection only to publish your changes and take
the latest changes.
Advantages of Git
DVCS Terminologies
Local Repository
Every VCS tool provides a private workplace as a working copy. Developers make changes
in their private workplace and after commit, these changes become a part of the repository.
Git takes it one step further by providing them a private copy of the whole repository. Users
can perform many operations with this repository such as add file, remove file, rename file,
move file, commit changes, and many more.
Working Directory and Staging Area or Index
The working directory is the place where files are checked out. In other CVCS, developers
generally make modifications and commit their changes directly to the repository. But Git
uses a different strategy. Git doesn’t track each and every modified file. Whenever you do
commit an operation, Git looks for the files present in the staging area. Only those files
present in the staging area are considered for commit and not all the modified files.
Let us see the basic workflow of Git.
Step 1 − You modify a file from the working directory.
Step 2 − You add these files to the staging area.
Step 3 − You perform commit operation that moves the files from the staging area. After
push operation, it stores the changes permanently to the Git repository.
Suppose you modified two files, namely “sort.c” and “search.c” and you want two different
commits for each operation. You can add one file in the staging area and do commit. After
the first commit, repeat the same procedure for another file.
# First commit
[bash]$ git add sort.c
# Second commit
[bash]$ git add search.c
Version Control
Version control is a tool that allows you to keep track of changes to a number of files. Using
version control for package development means that you can easily revert to previous
package versions, collaborate with multiple developers, and record reasons for the changes
that are made.
A web interface called GitHub allows users to visually see their tracked changes and has
additional features, such as issues, milestones, review requests, and commenting. With
GitHub, changes to code can be associated with bugs and feature requests. GitHub also
enables open science practices by sharing what goes on “behind-the-scenes” in the code. In
addition, GitHub is a great tool for collaborative work because issues, comments, and peer
reviews can be associated with a specific GitHub user account. Each user can edit the code at
the same time and handle conflicts appropriately.
In this course, we will be using Git and GitHub in conjunction with RStudio to complete
version control workflows. Below is a generic depiction of what a version control workflow
might be.
There is a main version of the code that people are collaboratively developing. Each
contributor has their own version of this code online and locally. Changes are made locally,
sent to their online version, and then combined with the collaborative version of the code.
Contributors are able to get the changes from other users by syncing their local version with
the collaborative version of the code. That is the main concept of version control.
To use Git for version control, you will need to download it, configure RStudio, and should
get set up with SSH keys. SSH keys allow you to connect to GitHub without specifying your
username and password each time. Follow these steps before continuing with this
lesson. This RStudio blog post is another good resource for installing Git and using SSH.
Once you have downloaded and installed Git, you will need to tell RStudio where it can be
found. In RStudio, navigate to Tools >> Global Options, and then click the Git/SVN tab. At
the top, there is a place for the filepath of the Git executable. Click Browse and select your
Git executable (a .exe file extension).
Git/GitHub Definitions
Here are some terms to be familiar with as we go through our recommended version control
workflow.
Term Definition
There are many ways to use Git, GitHub, and RStudio in your version control workflow. We
will discuss the method USGS-R has predominately used. It is most similar to the “fork-and-
branch” workflow (see the additional resources section below). There are three locations of
the repository: 1) canonical on GitHub (“upstream”), 2) forked repository on GitHub
(“origin”), and 3) the user’s local repository.
The initial setup requires a canonical repository on GitHub. To create a new repo on GitHub,
follow these instructions. Once there is a canonical repository, the user looking to contribute
to this code base would Fork the repository to their own account.
Next, the user would create the local version of the forked repo in an RStudio project. When
creating a new RStudio project, select Version Control,
To find the SSH address, click “Clone or download” on the GitHub repo.
Then you can select “Create Project” and it will open a new RStudio project. You should see
a new tab in the environment pane that you have not seen before called “Git”.
Next, you need to setup your local repository to recognize the main repository as the
“upstream” version. To do this, click the “More” drop down in your RStudio Git tab, then
select “Shell…”.
In the command prompt, type git remote -v and hit enter. This will show you which remote
repositories (available on online) are connected to your local repository. You should initially
only see your forked repository and it is labeled “origin”. To add the main repo as an
“upstream” repository, type git remote add upstream <SSH address> with the correct SSH
address and hit enter.
Now that you have the three repositories set up, you can start making changes to the code and
commit them. First, you would make a change to a file or files in RStudio. When you save
the file(s), you should see them appear in the Git tab. A blue “M” icon next to them means
they were existing files that you modified, a green “A” means they are new files you added,
and a red “D” means they were files that you deleted.
Getting upstream changes
To get changes available on the remote canonical fork to your local repository, you will need
to “pull” those changes down. To do this, go to the Git shell through RStudio (Git tab >>
More >> Shell) and use the command git pull with the name of the remote fork followed by
the name of your local repo, e.g. git pull upstream master. It is generally a good idea to do
this before you start making changes to avoid conflicts.
Committing changes
Click the check box next to the file(s) you would like to commit. To view the changes, select
“Diff”.
You can select the different files and it will show what was added (highlighted green) and
what was deleted (highlighted red). Then, type your message about the commit and click
“Commit”.
It’s best to keep commits as concise and specific as possible. So, commit often and with
useful messages. When you are ready to add these changes to the main repository, you need
to create a pull request. First, push your changes to your remote fork (aka origin). Either use
the “push” button in RStudio (this only works when you are on your master branch) OR type
the git command into the shell.
To get to the shell, go to the “Git” tab, then click “More”, and then “Shell…”. Now type your
git command specifying where changes are going, and which repository is being pushed: git
push origin master will push commits from the local repo (“master”) to your remote repo on
GitHub (“origin”).
To submit a pull request, you need to be on your remote fork’s GitHub page. The URL would
say github.com/YOUR_USERNAME/REPO_NAME,
e.g. github.com/lindsaycarr/dataRetrieval. It also shows where your repo was forked from:
From this page, click “New pull request”. Now, you should have a screen that is comparing
your changes. Double check that the left repo name (1 in the figure) is the canonical
repository that you intend to merge your changes into. Then double check that the fork you
are planning to merge is your remote fork (3 in the figure). For now, branches should both be
“master” (2 and 4 in the figure). See the section on branching to learn more.
Once you have verified that you are merging the correct forks and branches, you can select
“Create Pull Request”. Be sure to describe your changes sufficiently (see this wiki for more
tips):
Now, you wait while someone else reviews and merges your PR. To learn how to merge a
pull request, see the section on reviewing code changes. You should avoid merging your own
pull requests, and instead should always have a peer review of your code.
Commit workflow overview
Even though Git and GitHub make simultaneous code development easier, it is not entirely
fool-proof. If a code line you are working on was edited by someone else since the last time
you synced with the upstream branch, you might run into “merge conflicts”. When you
encounter conflicts during a pull from the upstream repo, you will see all changed files since
the previous sync in your Git tab. Files with checkmarks are just fine. Any file with a filled in
checkbox means that only part of the changes are being committed - this is where you have
merge conflicts.
When you open the file(s) with merge conflicts, look for the section that looks like this:
<<<<<<< HEAD
some code
some code
some code
=======
your code
your code
your code
>>>>>>> upstream/master
The chunk of code wrapped in <<<<<<< HEAD and ======= (the first chunk) is the code
that exists in the local repository. The chunk of code wrapped in ======= and
then >>>>>>> upstream/master (the second chunk) is the code from upstream that you are
trying to merge. To reconcile these differences, you need to pick which code you are keeping
and which you aren’t. Once you correctly edit the code, make sure to delete the conflict
markers (<<<<<<< HEAD, =======, and >>>>>>> upstream/master). Then, save the file.
Now that you’ve addressed the merge conflict in the file, it’s time to commit those changes.
All the non-conflicted files should still have a checkmark next to them in the Git tab. Check
the box next to your reconciled file and select commit. Add a message about these changes,
such as “merged conflicts” or something similar. Then commit. Now, you should be back on
track to continue your edits.
Here’s an example. When you try to merge upstream with your local code, the shell will say
something similar to CONFLICT ... Merge conflict in [filename].
The actual code will indicate the conflicting lines. Something similar to:
The Git tab will also indicate what files have conflicts by a colored-in check box.
Once you edit and save the file, just check the box and commit along with everything else
that is in the Git tab. Usually, the commit message can just say “Merging conflics”.
Branching
Branches are an optional feature of Git version control. It allows you to have a non-linear
commit history where multiple features/bug fixes could be developed and merged
independently. You have been working on the “master” branch for the previous sections. We
could add another branch off of this called “bug-fix”, and another from the master called
“new-feature”. You could change the code on either of those branches independed of one
another, and merge when one is done without the need to have the other completed at the
same time. Be careful though, you can create branches from a non-master branch which is
often not the behavior you want. Just be aware of your current branch when you are creating
a new one.
When the time comes to merge the branch, you can either merge it with the master branch
locally or create a pull request to the main repository specifying changes from your new
branch. Follow this blog to learn how to do the former method.
If you’d prefer the latter method, follow the blog until the “Merging branches back together”
section. Instead when you’re ready to merge your branch with the main repository through a
PR, follow these instructions.
1. Open the Git shell window and push your local branch to your remote fork via the
command git push origin/new-branch-name.
2. On GitHub, go to your remote fork page and click “New pull request”.
3. As noted in the section on submitting a pull request, double check that your
repositories and branches are correct on the “Comparing changes” page. The only
difference is that you want to change the farthest right drop-down to your branch.
4. Now follow the rest of steps for completing your PR submission as described in
the how-to-submit-a-PR section.
gitignore file
It is sometimes useful to have a text file name .gitignore. This file let’s Git know which files
it should not worry about tracking. For RStudio projects, it’s a good idea to have
the .Rproj and .Rhistory files specified in a gitignore. You can use * before a file extension to
say any file with that extension should be ignored (including those in sub-folders). Here’s an
example of what the .gitignore content might look like:
.Rproj.user
.Rhistory
.RData
*.Rproj
If you have uncommitted changes on your local repository and try to pull down updates from
the upstream repository, you’ll notice that you get an error message:
If you’re ready, you can go ahead and commit those changes. Then try pulling from upstream
again. If you’re not ready to commit these changes, you can “stash” them, pull from
upstream, and then bring them back as uncommitted changes.
To stash all uncommitted changes, run git stash in your Git shell (Git tab >> More >> Shell).
To see what you stashed, run git stash list. It will automatically put you in the VIM text editor
mode, so type “q” and hit enter before try to do anything else. To get your stashed changes
back, run git stash apply.
That is the basic use of stashing, but there are more complicated ways to stash uncommited
changes.
If you are the reviewer for an open pull request, you will likely need to pull down the
suggested changes and test them out locally before approving the PR. It’s pretty simple to do
this because you can copy and paste git commands for making a new branch of the PR. Next
to the “Merge pull request” button, select “command line instructions”.
Copy the two git commands from Step 1, From your project repository, check out a new
branch and test the changes. Paste these lines into your Git shell (Git tab >> More >> Shell).
You might not be able to right click and paste, or use the CTRL + V method. Instead, right
click the top bar of the shell window, hover over “Edit”, then click “Paste”. Once the code is
in the shell, hit enter.
RStudio should now have a different branch name in the top right. Before you can test the
changes available in this branch, you need to build and reload the package.
Once you have thoroughly tested this branch and approve of it, go back to the PR on GitHub.
Write a few comments about what you tested and why you are accepting these changes, then
click “Merge pull request” and then “Confirm”. You can now delete the branch you were
using to test these changes. Continuing the example for checking out a branch
called otherusername-master above,
Don’t forget to pull down these new changes to your local repository master branch!
Common Git commands
Command Description
git remote -v view the remote repos linked to this local repository
git remote add upstream add a remote repo at the specified url named 'upstream'
Command Description
Difference Between Git and GitHub
Git: Git is a distributed version control system for tracking changes in source code during
software development. It is designed for coordinating work among programmers, but it can
be used to track changes in any set of files. Its goals include speed, data integrity, and support
for distributed, non-linear workflows.
GitHub: GitHub is a web-based Git repository hosting service, which offers all of the
distributed revision control and source code management (SCM) functionality of Git as well
as adding its own features.
S.No
. Git GitHub
Git has no user management feature. GitHub has a built-in user management
8. feature.
for-use tier.
Git has minimal external tool GitHub has an active marketplace for
10. configuration. tool integration.
Git competes with CVS, Azure DevOps GitHub competes with GitLab, Git
12. Server, Subversion, Mercurial, etc. Bucket, AWS Code Commit, etc.
Difference Between GitLab and GitHub
GitLab: GitLab is a repository hosting manager tool that is developed by GitLab Inc and is
used for the software development process. It provides a variety of management by which we
can streamline our collaborative workflow for completing the software development
lifecycle. It also allows us to import the repository from Google Code, Bitbucket, etc.
Following are some features of GitLab:
Open-source community edition repository management platform.
Easy Maintaining of a repository on a server.
Offers tools like Group Milestones, Time Tracking and Issue Tracker, etc. for effective
development.
More Spontaneous User interface and authentication features.
User Permission and Branch protection are enhanced.
GitHub: GitHub is a repository hosting service tool that features collaboration and access
control. It is a platform for programmers to fix bugs together and host open-source projects.
GitHub is designed for the developers and to help them track their changes into a project
through the repository.
Following are some features of GitHub:
Specifies milestones and labels to the projects.
Comparison view between branches is allowed.
GitHub Pages allows us to publish and host websites within GitHub.
Syntax highlight feature.
It allows third-party API integrations for bug tracking and cloud hosting.
Parameters GitLab GitHub
Public It allows users to make public It allows users to have unlimited free
Repository repository. repository.
Project GitLab provides user to see GitHub doesn’t have this feature yet
Analysis project development charts. but they can check the commit history.
Gitlab supports adding other GitHub does not allow adding other
Attachments types of attachments. types of attachments.
Difference Between GIT and SVN
GIT SVN
Git is open source distributed vice control Apache Subversion is an open source
system developed by Linus Torvalds in 2005. It software version and revision control
emphasis on speed and data integrity system under Apache license.
In git we do not required any Network to In SVN we required Network for runs the
perform git operation. SVN operation.
Git is more difficult to learn. It has more SVN is much easier to learn as compared
concepts and commands. to git.
It does not have good UI as compared to SVN. SVN has simple and better user interface .
Merge tracking.
Open source. File locking.