Unipro UGENE User Manual
Unipro UGENE User Manual
Version 1.11
Contents
1 About Unipro . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.1 Contacts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
9
About UGENE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1 2.2 2.3 2.4 Key Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . User Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . High Performance Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Cooperation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 10 11 11
Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.1 3.2 3.3 Installing UGENE on Windows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Installing UGENE on Linux . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Installing UGENE on Mac OS X . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 13 14
Basic Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.1 4.2 UGENE Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . UGENE Window Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 4.2.2 4.2.3 4.2.4 4.3 4.4 4.5 Project View Task View . Log View . . Notications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 16 16 17 18 18 19 20 20 20 22 23 24 25 26 27 27 28 29 30 30
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Creating New Project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Opening Document . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.1 4.5.2 Opening for the First Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . Opening Document Present in Project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Creating Document
Locked Documents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Using Objects and Object Views . . . . . . . . . . . . . . . . . . . . . . . . . . . . Using Bookmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Working with Projects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Options Panel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Adding and Removing Plugins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fetching Data from Remote Database . . . . . . . . . . . . . . . . . . . . . . . . . UGENE Application Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.14.1 General . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
32 33 34 34 35 36 36 37
Sequence View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 Sequence View Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Global Actions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sequence Toolbar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sequence Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sequence Zoom View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5.1 Managing Zoom View Rows . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sequence Details View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Information about Sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Manipulating Sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.8.1 5.8.2 5.8.3 5.8.4 5.8.5 5.8.6 5.8.7 5.8.8 5.8.9 5.8.10 5.8.11 5.8.12 5.9 5.9.1 5.9.2 5.10 5.10.1 5.10.2 5.10.3 5.10.4 5.10.5 5.10.6 5.10.7 5.10.8 Going To Position . . . . . . . . . Toggling Views . . . . . . . . . . . Capturing Screenshot . . . . . . . Zooming Sequence . . . . . . . . . Creating New Ruler . . . . . . . . Selecting Amino Translation . . . . Showing and Hiding Translations . Selecting Sequence . . . . . . . . . Copying Sequence . . . . . . . . . Search in Sequence . . . . . . . . . Editing Sequence . . . . . . . . . . Locking and Synchronize Ranges of . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Several Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 42 42 43 44 44 45 45 47 47 47 47 48 48 48 49 50 52 53 54 57 58 59 59 60 60 61 61 64 65 66 66 66
Annotations Editor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Automatic Annotations Highlighting . . . . . . . . . . . . . . . . . . . . . . . The "db_xref" Qualier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Creating Annotation . . . . . . . . Editing Annotation . . . . . . . . . Highlighting Annotations . . . . . Creating and Editing Qualier . . . Adding Column for Qualier . . . . Copying Qualier Text . . . . . . . Deleting Annotations and Qualiers Importing Annotations from CSV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Manipulating Annotations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.1 6.2
Circular Viewer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3D Structure Viewer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.1 6.2.2 6.2.3 6.2.4 6.2.5 6.2.6 6.2.7 Opening 3D Structure Viewer . . . . . . . . . Changing 3D Structure Appearance . . . . . . Moving, Zooming and Spinning 3D Structure . Selecting Sequence Region . . . . . . . . . . . Selecting Models to Display . . . . . . . . . . Exporting 3D Structure Image . . . . . . . . . Working with Several 3D Structures Views . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
74 77 77 78 81 81 82 83 83 85 86 86 89 90 91 92 92 95 96 96 96 98 99 99 99 99
6.3
Chromatogram Viewer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.1 6.3.2 Exporting Chromatogram Data . . . . . . . . . . . . . . . . . . . . . . . . . . Viewing Two Chromatograms Simultaneously . . . . . . . . . . . . . . . . . . . Description of Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Graph Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Creating Dotplot . . . . . . . . . . . . . . . . . . . . . . . . . Navigating in Dotplot . . . . . . . . . . . . . . . . . . . . . . Zooming to Selected Region . . . . . . . . . . . . . . . . . . . Selecting Repeat . . . . . . . . . . . . . . . . . . . . . . . . . Interpreting Dotplot: Identifying Matches, Mutations, Ivertions, Editing Parameters . . . . . . . . . . . . . . . . . . . . . . . . Saving Dotplot as Image . . . . . . . . . . . . . . . . . . . . . Saving and Loading Dotplot . . . . . . . . . . . . . . . . . . . Building Dotplot for Currently Opened Sequence . . . . . . . . Comparing Several Dotplots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . etc. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.4
6.5
Dotplot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5.1 6.5.2 6.5.3 6.5.4 6.5.5 6.5.6 6.5.7 6.5.8 6.5.9 6.5.10
Advanced Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 Grid Prole . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 Exporting Image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 Building HMM Prole . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
7.4
Building Phylogenetic Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 7.4.1 7.4.2 PHYLIP Neighbour-Joining . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 MrBayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
Getting Information About Read . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 Short Reads Vizualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 8.4.1 8.4.2 Reads Highlighting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 Reads Shadowing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
Associating Reference Sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 Consensus Sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 Exporting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 8.7.1 8.7.2 8.7.3 8.7.4 Exporting Exporting Exporting Exporting Read . . . . . Visible Reads Consensus . . Image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 125 126 126
8.8
. . . . . . . . . . . . . . . . . . . . . . . . . . 127
Navigation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 Assembly Browser Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 Assembly Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 Assembly Overview Hotkeys . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 Reads Area Hotkeys . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
8.9
9.5 9.6
Zooming Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 Working with Clade 9.6.1 9.6.2 9.6.3 9.6.4 9.6.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 135 135 136 136 Selecting Clade . . . . . . . . . Collapsing/Expanding Branches Swapping Siblings . . . . . . . Zooming Clade . . . . . . . . . Adjusting Clade Settings . . . .
9.7 9.8
10
Running Workows on Cloud . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 Introduction . . . . . . . . . . . . Cloud Computing . . . . . . . . . Cloud Remote Machine . . . . . . Launching Workow . . . . . . . . Useful Tips and Recommendations
Running HMMER3 Search Task on Remote Machine . . . . . . . . . . . . . . . . . 142 Running Smith-Waterman Search Task on Remote Machine . . . . . . . . . . . . . 143 Running MUSCLE Align Task on Remote Machine . . . . . . . . . . . . . . . . . . 144
11
Plugins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
11.1 11.2 11.3 Workow Designer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 DNA Annotator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 DNA Flexibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 11.3.1 11.3.2 11.4 11.4.1 11.4.2 11.4.3 11.4.4 11.4.5 11.4.6 11.5 11.6 11.7 11.8 Conguring Dialog Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 Result Annotations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 Exporting Sequences as Alignments . . . . . . . . . . . . . . Export Sequences to FASTA, Genbank and FASTQ Formats . Exporting Alignment to Sequence Format . . . . . . . . . . . Exporting Selected Sequence Region . . . . . . . . . . . . . Exporting Sequence of Selected Annotations . . . . . . . . . Exporting Annotations to CSV Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 150 152 153 153 154
DNA Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 ORF Marker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 Remote BLAST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 Repeat Finder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 11.8.1 11.8.2 Finding Repeats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 Finding Tandem Repeats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
11.9
11.9.1 11.9.2 11.9.3 11.9.4 11.9.5 11.9.6 11.10 11.10.1 11.10.2 11.10.3 11.11 11.12
Selecting Restriction Enzymes . . Using Custom File with Enzymes Filtering by Number of Hits . . . Excluding Region . . . . . . . . . Circular Molecule . . . . . . . . . Results . . . . . . . . . . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
Molecular Cloning in silico . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 Digesting into Fragments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 Creating Fragment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171 Constructing Molecule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
Secondary Structure Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 SITECON . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178 11.12.1 11.12.2 SITECON Searching Transcription Factors Binding Sites . . . . . . . . . . . . . 179 Types of SITECON Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
11.13 11.14
Smith-Waterman Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184 HMM2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186 11.14.1 11.14.2 11.14.3 Building HMM Model (HMM Build) . . . . . . . . . . . . . . . . . . . . . . . 186 Calibrating HMM Model (HMM Calibrate) . . . . . . . . . . . . . . . . . . . . 187 Searching Sequence Using HMM Prole (HMM Search) . . . . . . . . . . . . . 188 Building HMM Model (HMM3 Build) . . . . . . . . . . . . . . . . . . . . . . . 189 Searching Sequence Using HMM Prole (HMM3 Search) . . . . . . . . . . . . 190 Searching Sequence Against Sequence Database (Phmmer Search) . . . . . . . 192 Aligning with MUSCLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194 Aligning Prole to Prole with MUSCLE . . . . . . . . . . . . . . . . . . . . . 195 Aligning Sequences to Prole with MUSCLE . . . . . . . . . . . . . . . . . . . 195 Aligning Short Reads with Bowtie . . . . . . . . . . . . . . . . . . . . . . . . . 198 Building Index for Bowtie . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200 Aligning Short Reads with BWA . . . . . . . . . . . . . . . . . . . . . . . . . . 202 Building Index for BWA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204 Aligning Short Reads with UGENE Genome Aligner . . . . . . . . . . . . . . . 207 Building Index for UGENE Genome Aligner . . . . . . . . . . . . . . . . . . . . 209
11.15
11.16
11.17
11.18
11.19
11.20 11.21
CAP3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210 Weight Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 11.21.1 11.21.2 Searching JASPAR Database . . . . . . . . . . . . . . . . . . . . . . . . . . . 215 Building New Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
11.22 11.23
11.24
12
13
APPENDICES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244
13.1 Appendix A. Supported File Formats . . . . . . . . . . . . . . . . . . . . . . . . . . 244 13.1.1 13.1.2 13.1.3 Specic File Formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244 UGENE Native File Formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247 Other File Formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
1 About Unipro
Established in 1992 Unipro company has its headquarters located in Novosibirsk Akademgorodok (the home of Siberian Branch of Russian Academy of Sciences). The companys primary activity is IT outsourcing solutions. To learn more about the company, please, visit the company website.
1.1 Contacts
Company website: Address: http://unipro.ru UniPro, 6/1 Lavrentiev Avenue 630090, Novosibirsk, Russia Marketing department: Tel: +7 (383) 3326061 Fax: +7 (383) 3302960 Email: [email protected] UGENE website: UGENE technical support: http://ugene.unipro.ru Email: [email protected]
2 About UGENE
Unipro UGENE is a free cross-platform genome analysis suite. It is distributed under the terms of the GNU General Public License. To learn more about UGENE visit UGENE website. It works on Windows, Mac OS X or Linux and requires only a few clicks to install.
10
2.4 Cooperation
Can be used for education purposes in schools and universities Features to be included into the next release are initiated by users UGENE team is ready for collaboration in related projects, both free and commercial
11
3 Installation
Get the appropriate package from the UGENE download page http://ugene.unipro.ru/download.html. Follow the installation instructions on the same page to install UGENE on your system. Quick guides on how to install UGENE on Windows, Linux and Mac OS X are situated below.
Alternatively, to use UGENE without installing: 1. Download UGENE zip package. 2. Unpack it. 3. Launch the ugeneui.exe le.
12
The downloaded le has *.tar.gz extension. 2. Unpack the archive. You can use this command: tar -xf [name of the downloaded *.tar.gz file] 3. Change the working directory to the unpacked UGENE directory: cd [name of the unpacked directory] 4. Launch the UGENE GUI version using the command: ./ugene -ui or the command line version using the command: ./ugene Note: Several native packages for specic Linux distributions are also available. Find out details on the download page. Note: UGENE is a part of Ubuntu and Fedora Linux distributions.
13
2. Launch the *.dmg le and accept the GNU license agreement. The following window will appear:
3. To start UGENE click on the ugeneui icon. You can also copy UGENE to the Applications folder by dragging it.
14
Chapter 3. Installation
4 Basic Functions
4.1 UGENE Terminology
Project Storage for a set of data les and visualization options. Document A single le (can be stored on a local hard drive or be a remote web page). Each document contains a set of objects. Object A minimal and complete model of biological data. For example: a single sequence, a set of annotations, a multiple sequence alignment. Task A process, usually asynchronous, that works in background. For example: some computations, loading and writing les. Plugin A dynamically loaded module that adds new functionality to UGENE. Object View A graphical view for a single or a set of objects. Project View A visual component used to manage active project. Task View A visual component used to manage active tasks. Log View A visual component used to show logs. Notications A visual component used to show notications. Generally it is used to open tasks reports. Plugin Viewer A visual component used to manage plugins. Sequence View An Object View aimed to visualize DNA, RNA or protein sequences along with their properties like annotations, chromatograms, 3D models, statistical data, etc. Annotation Additional information about a sequence, identied by its name and the sequence region. Alignment Editor An Object View used to visualize and edit DNA, RNA or protein multiple sequence alignments. Options Panel An Options Panel it is the panel with dierent information tabs and tabs with settings for Sequence View and Assembly Browser . In the image below you can see a typical UGENE window with a Project View and a single Object View window opened:
15
16
You can also use the Alt+1 hotkey to show/hide the Project View. To create a new project, refer to Creating New Project. Note that if you have no project created when opening le with a sequence, an alignment or any other biological data, a new anonymous project is created automatically.
17
The hotkey for showing/hiding the Task View is Alt+2. The Task name column of the Task View shows the tasks names. Task state description shows the status of the active tasks: Started, Running, Finished and so on. The Task progress column shows the percentage of the tasks progress. If you want to cancel a task, click the red cross button in the Actions column for the task.
The hotkey for this action is Alt+3. It is possible to congure the Log View settings: the level of the log to show (ERROR, INFO, DETAILS, TRACE) , the category (Algrorithms, Tasks, etc.), and the format of the log messages (format of the dates, etc.). This settings can be congured in the UGENE Application Settings.
4.2.4 Notications
The Notications component shows notications for tasks reports.
If a task has nished without errors, the notication is blue. If an error has occured during the task execution, the notication is red. To open a task report, click on the corresponding notication. See an example of a task report below:
18
To remove a notication from the Notications popup window, click the notication cross button. Note that you can click on the clip button of the Notications popup window to show the window always on top.
Actions
Settings Tools
Window Help
The menus can be dynamically populated with new actions added by plugins. Check the Plugins documentation to learn how each plugin aects global and context menus.
19
Here you need to specify the visual name for the project and the directory and le to store it. After you click the Create button the Project View window is opened.
20
or drag the le to the UGENE window. Warning: Documents created not by UGENE are locked . To be able to edit the document you should save a copy of the document and continue working with the copy. Advanced Dialog Options Open the Add existing document dialog:
21
The following parameters are available: Document location location of the document. It can be a local le, a shared network le or a web reference, for example: C:\store\mydocument.gb \\192.168.0.3\store\mydocument.gb http://someaddress.com/store/mydocument.gb Document format species how to interpret the data stored in the le. As specied above the format is detected automatically, but you can select it manually. Force read-only mode locks the document for editing. Save le to disk before opening this option becomes available if a web reference has been specied in the Document location. Saves the remote le to the local disk before opening it. Custom settings the button is available for Genbank, EMBL, FASTA and FASTQ document formats. The button opens the following dialog:
If there are several sequences in the document, then selecting the Separate sequences option will open several sequences in a Sequence View window. Contrariwise, selecting the Merge sequences option will merge the sequences into one sequence. The Gap length parameter species the length of the gaps inserted between the merged sequences. Your choice will be saved as default if you check the Save as default settings check box. Note that if you select to merge the sequences, then the annotations of the sequences if any are also relocated automatically.
22
You can input the created sequence to the Paste data here eld: The following Custom settings are available: Alphabet here you can select the alphabet:
Skip unknown symbols / Replace unknown symbols with you can select either to skip unknown input symbols or to replace them with the specied symbol. 4.6. Creating Document 23
Document location location of the created document. Document format format of the created document. Currently available formats are FASTA and Genbank. Sequence name name of the sequence in the created document. Save le immediately check this option if you want to save the document immediately after the Create button is pressed. The created document will be added to the current project and opened in the Sequence View .
For other le formats (for example, PDB) the Save a copy context menu item is not available:
You can use the built-in export utilities to create a copy of the document in this case. The list of the le formats supported by UGENE can be found here. The list also species whether it is possible to read / write a le of a certain format using UGENE.
24
Below is the list of object types supported by the current version of UGENE. Object types: Symbol [3d] [a] [as] [c] [i] [m] [s] [t] [tr] Icon Description A 3D model. Annotations for DNA sequence regions. An assembly. Chromatogram data. A le with index information for a set of other, usually large les. A multiple sequence alignment. A nucleic, protein or raw sequence. A plain text. A phylogenetic tree.
You can edit names of particular objects, such as sequence objects, by selecting them in the Project View and then pressing F2. To be able to do so, the document containing the target object must be unlocked. To see the list of all available views for a given object select the object and activate the context menu inside the Project View window and select the Open view submenu:
25
The picture above illustrates an option to visualize the selected DNA sequence object using the Sequence View a complex and extensible Object View that focuses on visualization of sequence objects in combination with dierent kinds of related data: sequence annotations, graphs, chromatograms, sequence analysis algorithms. Note, that the Sequence View is described in more details in the separate documentation section.
For every persistent view UGENE automatically saves the state of the view in the Auto saved bookmark when the view is closed.
26
Now, by activating bookmarks you can restore the original view state. For example for the Sequence View bookmarks you can store a visual position and zoom scale for the sequence region.
Use the F2 keyboard shortcut to rename a bookmark. To remove a bookmark press the Delete key. UGENE has limited set of built-in Object Views. Extensions modules or plugins can be used to adjust the existing views or to add new views to the tool.
To load a saved project later, select File Open and specify the path to the project le.
Note that Ctrl key can be used to open several tabs at the same time. In this case the tabs are shown on the Options Panel one after another. More detailed information about dierent Options Panel tabs can be found in the following chapters: Options Panel in Sequence View Information about Sequence Search in Sequence Highlighting Annotations Options Panel in Assembly Browser Navigation in Assembly Browser Assembly Browser Settings Assembly Statistic
28
To add or remove plugins use the Add plugin and the Remove plugin items available in the Plugin Viewer context menu:
When you select the Remove plugin item for a plugin, the plugins status is changed to the to remove after restart value. The Remove plugin is no more available in the context menu of the plugin. Instead the Enable plugin item appears in the context menu:
If you select this item the plugin will be enabled again, i.e. it will not be removed after restart. Otherwise, the plugin will not be available after UGENE restart.
Here you need to enter unique id of the biological object and choose a database. Unique identiers are dierent for various databases. For example, for NCBI GenBank such unique id could be Accession Number or NCBI GI number. Optionally, you can browse for a directory to save the fetched le to. After you click the OK button, UGENE downloads the biological object (DNA sequence, protein sequence, 3d model, etc.) and adds it to the current project. If something goes wrong check the Log View , it will help you to diagnose the problem.
29
4.14.1 General
30
Language of User Interface (applied after restart) here you can select UGENE localization. Currently available localizations are EN, RU and CS. The default value (Autodetection) species that UGENE should use the operating system regional options to select the localization. This setting is applied only after UGENE is reopened. Appearance denes the appearance of the application, for example here is a part of the same dialog when the Cleanlooks appearance has been applied:
Preferred Web browser you can use either System default browser or specify some other browser. Open last project at startup if the option is checked, the last project is opened when UGENE is started. Path to downloaded data species the path where les downloaded from the remote databases will be stored. Enable statistical reports collecting collects information about UGENE usage and sends it to the UGENE team to help improve the application. Note: The collected information includes: 1. System info: UGENE version, OS name, Qt version, etc. 2. Counters info: number of launches of certain tasks (e.g. HMM search, MUSCLE align). The collected information DOESNT include any personal data.
31
4.14.2 Resources
On the Resources tab you can set resources that can be used by the application: Optimize for CPU count, Tasks memory limit and Threads limit.
32
4.14.3 Network
On the Network settings tab of the dialog you can specify Proxy server parameters, select SSL settings and congure the Remote request timeout.
33
4.14.4 Logging
On the Logging tab you can select type of log information (ERROR, INFO, DETAILS, TRACE ) for each Category that will be output to the Log View . You can select format for each log message by checking the Show date, Show log level and Show log category options.
34
4.14.6 OpenCL
If you have a video card that supports OpenCL you can use it to speed up some calculations in UGENE. To do it install the latest video card driver and check the corresponding check box:
Now you can, for example, use OpenCL optimization for the Smith-Waterman algorithm.
35
4.14.7 CUDA
If you have a NVIDIA video card that supports Compute Unied Device Architecture (CUDA) you can use it to speed up some calculations in UGENE. To do it install the latest video driver and check the corresponding check box:
Now you can, for example, use OpenCL optimization for the Smith-Waterman algorithm.
36
37
The Index location value will appear automatically after you choose the document location, but you can choose some other location. Optionally you can check the Compress option to archive the result index le with the gzip utility (it is checked by default). Open the created index le. For example, the screenshot below demonstrates an index le created from a FASTA le with several sequences:
As you can see there is an index for each sequence. You can export the indexed sequence by selecting the corresponding index and using the Export Save selection to a new le context menu item.
38
You can also lter the indexes using the green arrow pop-up menu:
39
5 Sequence View
5.1 Sequence View Components
The Sequence View is one of the major Object Views in UGENE aimed to visualize and edit DNA, RNA or protein sequences along with their properties like annotations, chromatograms, 3D models, statistical data, etc. For each le UGENE analyzes the le content and automatically opens the most appropriate view. To activate the Sequence View open any le with at least one sequence. For example you can use the $UGENE/data/samples/EMBL/AF177870.emb le provided with UGENE. After opening the le in UGENE the Sequence View window appears:
After the view is opened you can see a set of new buttons in the toolbar area. The actions provided by these buttons are available for all sequences opened in the view. In the picture below these buttons are pointed by the "Global actions arrow. Below the toolbar there is an area for a single or several sequences. For each sequence a smaller toolbar with actions for the sequence and the following areas are available: Sequence overview Shows the sequence in whole and provides handy navigation in the Sequence zoom view and the Sequence details view . Sequence zoom view Provide exible tools for navigation in large annotated sequence regions. Sequence details view A supplementary component of the Sequence overview . It is used to show sequence content without zooming. Annotations editor Contains tools to manipulate annotations for a sequence.
40
You can change the focus by clicking on the corresponding sequence area. All sequences that are not in focus have the sequence name and icon disabled. The bottom area of the Sequence View is the Annotations editor . It contains a tree-like structure of all annotations available for all sequences shown in the Sequence View and can be used to perform various actions on annotations: create a new annotation, modify the existing one, group, sort, etc. 5.1. Sequence View Components 41
The global action toolbar provides possibility to go to the specied position (in all sequences at the same time). Also it allows to lock or adjust ranges of sequences in the same Sequence View . See this paragraph for details.
See also: Toggling Views Capturing Screenshot Zooming Sequence Showing and Hiding Translations Selecting Sequence
42
When the sigma button (in the right part of the Sequence overview ) is pressed, density of annotations in the sequence is shown. For example in the picture below there are annotations in the parts of the sequence that are marked with dark grey color:
43
Below the annotation rows there is a ruler to show coordinates in the sequence.
When the Show All Rows item is checked all available annotations are always shown. You can also add rows by selecting the +5 Rows and +1 Row items and remove rows by selecting the -5 Rows and -1 Row items. To restore the default number of rows select the Reset Rows Number item. See also: Navigating Sequence zoom view using Sequence overview Zooming Sequence Creating New Ruler Manipulating Annotations
44
See also: Navigating the Sequence details view using the Sequence overview Selecting Amino Translation Showing and Hiding Translations
45
Sequence length Characters occurrence Dinucleotides occurrence (for sequences with the standard DNA and RNA alphabets)
To copy the statistical information about a sequence select it on the Options Panel and choose the copy item in the context menu, or use the Ctrl+C shortcut.
46
Or use the Go to position context menu or the Actions main menu item.
The sequence can be removed from the view using the same menu. Once you remove the last sequence in the view, the view is automatically closed.
47
There are standard Zoom In and Zoom Out buttons. Additionally you can zoom to a selected region using the Zoom to Selection button. To restore the default view of the Sequence zoom view (when the sequence is not zoomed) use the Zoom to Whole Sequence button.
The new ruler will be shown right above the default one:
48
49
Selecting the Sequence region context menu item opens the Select range dialog:
Here you can specify the sequence range you would like to select. You can open the same dialog using the Select sequence region button on a sequence toolbar or using the Ctrl-A key sequence. To use the Sequence between selected annotations item, select two annotations in the Annotations editor (holding the Ctrl key at the same time):
50
And select the Select Sequence between selected annotations item in the context menu. The Sequence around selected annotations item selects the selected annotations and the sequences between these annotations.
51
Another way to select a sequence around annotations is to hold Shift and Ctrl keys while clicking on the annotations either in the Sequence details view or in the Sequence zoom view .
2. Using the following shortcuts: Ctrl-C copies direct sequence strand Ctrl-T copies direct amino translation Ctrl-Shift-C copies reverse-complement sequence Ctrl-Shift-T copies reverse-complement amino translation 3. Using the Copy submenu of the context menu:
52
By default, misc_feature annotations are created for regions that exactly match the pattern. To change these and/or other settings, click on the Show more options link. Find below the description of the available settings. Search algorithm
This group species the algorithm that should be used to search for a pattern. The algorithm can be one of the following: InsDel there could be insertions and/or deletions, i.e. a pattern and the searched region can vary in their length. You can specify the percentage of the pattern and a searched region match in the eld nearby. Note that this value also depends on the pattern length and is disabled when the pattern hasnt been specied. Substitute a pattern may contain characters dierent from the characters in the searched region. When this algorithm has been selected you can also specify the match percentage and additionally it is possible to take into account ambiguous bases. Regular expression a regular expression may be specied instead of a pattern. For example character . matches any character, .* matches zero or more of any characters. There is also the Limit result length option that species the maximum length of a result. Search in
53
In this group you can specify where to search for a pattern: in what region and in which strand (for nucleotide sequences). Also for nucleotide sequences it is possible to search for a pattern on the sequence translations. Strand for nucleotide sequences only. Species on which strand to search for a pattern: Direct, Reverse-complementary or Both strands. Search in for nucleotide sequences you can select the Translation value for this option. In this case the input pattern will be searched in the amino acid translations. Region species the sequence range where to search for a pattern. You can search in the whole sequence or specify a custom region. Other settings
This group contains additional common settings: Remove overlapped results annotates only one of the overlapped results. Limit results number to limits number of the searched results to the specied value. Annotations settings
In the Save annotation(s) to group you can set up a le to store annotations. It could either an existing annotation table object or a new document (le). In the Annotation parameters group you and specify the annotations name and a group in the Annotations Editor .
54
The Edit sequence submenu is available in the Actions main menu and in the Sequence View context menu. When you select the Insert subsequence item the following dialog is opened:
55
Description of the dialog parameters: Paste data here you must input the inserted subsequence. This parameter is mandatory. Annotated regions resolving denes either to Resize, Remove or Split an annotation (into two annotations) in case when the subsequence is inserted to the sequence position where some annotations are presented. Start position the sequence position where to insert the subsequence. Save resulted document to a new le the result sequence can be saves to a new le instead of modifying the current le. You must select the Document location. FASTA and Genbank le formats are available when you do not include annotations to the result le. If you check the Merge annotations to this le item, the annotations will also be saved to the result le (Genbank le format is only available in this case). In case a subsequence has been selected, the rst item in the Edit sequence submenu is called Replace subsequence instead of Insert subsequence. The dialog opened in this case is similar to the dialog described above, except it already contains the sequence to be edited an doesnt allow to input the start position.
Also it is possible to remove selected subsequence from a sequence. When you select corresponding item (in the context menu or in the Actions menu), the Remove subsequence dialog appears:
Description of the parameters: Region to remove species the region of the sequence that will be removed in the form. This parameter is mandatory. Annotated regions resolving species what to do with annotations that overlap with the region that is removed. You can select either Resize such annotations (i.e. make it smaller) or Remove them. Save resulted document to a new le similar to the same parameter in the Insert subsequence dialog (described above).
56
To unlock the scales click the same button again. You may use the Adjust scales button to synchronize scales without locking them. Note, that if you have a selected sequence region or a selected annotation the scales will be synchronized by the start position of the region or the annotation. If there are no active selection the regions are synchronized by the rst visible sequence position on the screen.
57
There are usually several objects with annotations in the Annotations editor . A special Auto-annotations object is always presented for each sequnce opened. It contains annotations automatically calculated for the sequence (see below for details). An object contains groups of annotations used by UGENE for logical organization of the annotations. An annotation must always belongs to some group.
58
For documents created not by UGENE annotations are grouped by their names. For annotations created in UGENE it is possible to use arbitrary group names. Groups can contain both annotations and other groups. The numbers in the brackets after a group name in the Annotations editor are the count of subgroups and annotations in the current group. A single annotation is allowed to be presented in several groups simultaneously. An annotation is physically removed from the document when it does not belong to any group.
To disable/enable the automatic annotations calculations use the Automatic Annotations Highlighting menu button on the Sequence View toolbar:
When you click on the value a web page is opened or a le is loaded specied in the reference. The loaded le is added to the current project.
59
The dialog asks where to save the annotation. It could be either an existing annotation table object or a new document (le). You can also specify the name of the group and the name of the annotation. If the group name is set to <auto> UGENE will use the annotation name as the name for the group. You can use the / characters in this eld as a group name separator to create subgroups. The Location eld contains annotation coordinates. The coordinates must be provided in the Genbank or EMBL le formats. If you want to annotate complement sequence strand surround the coordinates with the complement() word or press the last button in the Location row to do it automatically.
60
Note, that by default the Location eld contains the coordinates of the selected sequence region. Once the Create button is pressed the annotation is created and highlighted both in the Sequence overview and the Sequence details view areas:
61
If you want to see all annotation types, click the Show all annotation types link. Find below information about annotations types properties that you can congure. Annotations Color To change a color of all annotations of a certain type click on the corresponding color box in the annotations types table and select the required color in the appeared Select Color dialog. Annotations Visability You can show/hide annotations of a certain type by selecting the type in the annotations types table and checking/unchecking the Show annotations of this type check box. Show on Translation This option is available for nucleotide sequences only. It species to show the annotation on the corresponding amino sequence instead of the original nucleotide sequence in the Sequence Detailed View , for example:
62
You can enable/disable this option by checking/unchecking the Show on translation checkbox. Captions on Annotations It is possible to show a value of a qualier of an annotation instead of the annotation type name in the Sequence Zoom View . To enable this option for an annotation type check the Show value of qualier check box and input the values of the required qualiers in the text eld nearby this check box. See the image below.
If you input several qualiers names (separated by comma), then the rst found qualier is taken into account and shown on the annotation.
63
64
Here you can specify the name and the value of the qualier. You can use the F2 key to rename a qualier:
To edit a qualier, select the qualier and press the F4 key or use the Edit qualier context menu item:
65
Basically you need to specify the le to read annotations table from (required):
66
And the format of and the path to the le to write the annotations table into (required):
Check Add result le to project to link the annotations to the currently opened sequence.
To use a separator to split the table, check the Column separator item and specify the separator symbols. Also you can press Guess to try to detect the separator from the input le.
Alternatively, you can press Edit and edit the script which will specify the separator for each parsed line. It is possible to use line number in the script.
Using the arrows, you exclude the necessary number of lines at the beginning of the document from parsing. You can also skip all lines that start with the specied text.
By pressing Preview one can bring up the view of the current annotations table (which is produced from the input le with the specied parameters values). The input le contents will also be shown at the bottom part of the dialog.
67
The preview table headline indicates the types of the information contained in the corresponding columns. By default the values are [ignored]. To specify a column role, click on the corresponding headline element:
The annotation start and end positions must be specied. It is possible to add an oset to every read start position by checking the Add oset checkbox, and to shorten annotations by one from the end by uncheking the Inclusive checkbox. When all the roles are specied, press Run. With the Add to project checkbox specied and a Sequence View opened, on success you will see the Sequence View with annotations linked:
68
69
The 3D Structure Viewer adds 3D visualization for PDB and MMDB les:
70
The Chromatogram Viewer adds support for chromatograms visualization and editing:
71
The Dotplot provides a tool to build dotplots for DNA or RNA sequences.
A number of other instruments add graphical interface for popular sequence analysis methods:
72
73
Pressing the button will show the circular view of the sequence:
Note: The Circular Viewer is opened automatically when the Sequence View is opened for a plasmid. The inner circle represents the sequence clockwise and the scale marks show the corresponding sequence positions. The sequence annotations are represented as curved colored regions at the outer side of the circle. The Circular Viewer helps to navigate within the sequence. You can select an annotation on the circular view and the annotation will also be focused and highlighted in all Sequence View areas: Sequence overview , Sequence zoom view , Sequence details view and Annotations editor . 74 Chapter 6. Sequence View Extensions
This will also aect the Sequence View . Note that the circular view is zoomed automatically when the Circular Viewer area is resized:
75
So you can adjust it to an appropriate size. It is possible to rotate the circular view using the mouse wheel. Use the Export Save circular view as image context menu or the Actions main menu item to save the image of the circular view.
Dierent le formats are available, including *.png, *.bmp, *.jpg, *.svg and *.pdf. Note, that if a sequence le contains several sequences it is possible to view the circular views of the sequences in the same Circular Viewer area.
You can work with these circular views at the same time.
76
Notice the Links button on the toolbar. When you click the button the menu appears with quick links to online resources with detailed information about the molecule opened: PDB Wiki RSCB PDB
77
PDBsum NCBI MMDB Note that if youre online, you can access the Protein Data Bank directly from UGENE and load a required le by its PDB ID (see Fetching Data from Remote Database for details). Hint: Dont forget to select the correct database (PDB) while fetching.
78
Selecting Coloring Scheme You can select one of the following coloring schemes: Chemical Elements Molecular Chains Secondary Structure To change the coloring scheme open the Coloring Scheme menu (available in the context menu and in the Display menu on the toolbar).
Calculating Molecular Surface To calculate the molecular surface of a molecule select the Molecular Surface item in the 3D Structure Viewer context menu or in the Display menu on the toolbar and check one of the following items: SAS (solvent-accessible surface) SES (solvent-excluded surface) vdWS (van der Waals surface) To remove the molecular surface that has already been calculated select the O item. You can also select the Molecular Surface Render Style to modify the calculated molecular surface appearance: Convex Map Dots
79
Selecting Background Color To change the background color open the Settings dialog (choose the Settings item in the 3D Structure Viewer context menu or in the Display menu on the toolbar), press the Set background color button and select a color in the dialog appeared. Selecting Detail Level To select the detail level of a 3D Structure representation open the Settings dialog of the 3D Structure Viewer and drag the Detail level slider. Enabling Anaglyph View UGENE allows you to view a molecule in the anaglyph mode. To enable the anaglyph view open the Settings dialog of the 3D Structure Viewer and check the Anaglyph view check box. You can modify the color settings: select one of the available Glasses colors or set custom colors, swap the colors. The oset of the color layers can be adjusted by dragging the Eyes shift slider.
80
You can also overview the whole structure by spinning it automatically. Select the Spin item either in the 3D Structure Viewer context menu or in the Display menu on the toolbar to do it. To stop the spinning uncheck the Spin item.
To adjust the shading drag the Unselected regions shading slider in the Settings dialog.
To show all the models check the Select All item. To show only one model check the Exclusive item and then check the model you want to display. To show several models uncheck both the Select All and the Exclusive items and check the models you would like to display.
82
Here you can browse for the le name, select the width and height of the image as well as its format: svg, png, ps, jpg or ti.
2. Press the Add button on the toolbar. The Select Item dialog will appear. Select [3d] objects to add. Hint: Use the Ctrl keyboard button to select several objects.
83
Below you can see the 3D Structure Viewer with two views:
To select an active view click on the view area or select an appropriate value in the Active view combo box on the toolbar. To synchronize the views press the Synchronize 3D Structure Views sticky button on the toolbar (see the image above). When the button has been pressed the 3D structures are moved, zoomed and spinned synchronously. Press the button again to stop the views synchronization. The views that are no more required can be closed by selecting the Close button in the 3D Structure Viewer context menu. Also you can hide/show views for a while. Use the menu of the green arrow button on the toolbar to do it:
Notice that the 3D Structure Viewer can be closed from this menu.
84
To edit a sequence data, right-click on the chromatogram view and select the Edit new sequence item in the appeared context menu. The original DNA sequence is not allowed to be changed; however you can add and modify a new sequence stored in a separate le. The sequence being edited is displayed right above the original one. Symbols can be changed by clicking on interesting value, modications are shown in bold.
85
After clicking on the item, the Export chromatogram le dialog will appear:
Check the Reversed and Complemented options if you want to create a reverse and complement chromatogram. Press the Export button. The exported le will be opened in the Sequence View .
86
You can also use the Lock scales and Adjust scales global actions for the chromatograms. For example if you lock the scales you are able to scroll the sequences simultaneously. Also when you select a 6.3. Chromatogram Viewer 87
sequence region in one sequence, the same region is selected in the second sequence.
88
To see a graph select the corresponding graph item in the popup menu. A new area with the graph appears right above the Sequence zoom view :
Each point on a graph is calculated for a window of a specied size. The window is moved along the sequence by a step. See Graph Settings for instructions on how to modify these parameters. All graphs are always aligned to the range shown in the Sequence zoom view . It means that if you change the visible range in the overview (either by zooming or scrolling) the graph will also be updated. The minimum and maximum values of the visible range are shown at the right lower and upper corners of the graph.
89
For more detailed information see DNA Flexibility paragraph. GC Content (%) shows the percentage of nitrogenous bases (either guanine or cytosine) on a DNA molecule. It is calculated by the following formula:
(G+C)/(A+G+C+T)*100
AG Content (%) shows the percentage of nitrogenous bases (either adenine or guanine) on a DNA molecule. It is calculated by the following formula:
(A+G)/(A+G+C+T)*100
GC Frame Plot this graph is similar to the GC content graph but shows the GC content of the rst, second and third position independently. It is most eective in organisms with GC rich genomic sequence but it also works on all microbial sequences. GC Deviation (G-C)/(G+C) shows the dierence between the "G" content of the forward strand and the reverse strand. GC Deviation is calculated by the following formula:
(G-C)/(G+C)
AT Deviation (A-T)/(A+T) shows the dierence between the "A" content of the forward strand and the reverse strand. AT Deviation is calculated by the following formula:
(A-T)/(A+T)
Karlin Signature Dierence dinucleotide absolute relative abundance dierence between the whole sequence and a sliding window. Let:
f(XY) = frequency of the dinucleotide XY f(X) = frequency of the nucleotide X p(XY) = f(XY) / f(X) * f(Y) p_seq(XY) = p(XY) for the whole sequence p_win(XY) = p(XY) for a window
The Karlin Signature Dierence for a window is calculated by the following formula:
sum(p_seq(XY) - p_win(XY)) / 16
Informational Entropy is calculated from a table of overlapping DNA triplet frequencies. The use of overlapping triplets smooths the frame eect. Informational Entropy is calculated by the following formula:
-(triplet frequency)*log10(triplet frequency)/log10(2)
90
The following parameters are available: Window the number of bases in a window. Steps per window the number of steps in window. The Step is calculated as Window / Steps per window. Default color the default color of line of graph (or lines of graphs for GC Frame Plot). Checking of the Cuto for minimum and maximum values checkbox enables the following settings: Minimum the minimum value for cuto. Maximum the maximum value for cuto. Select an appropriate minimum and maximum value and click the OK button to show the graph of cutos. The graph is divided into 2 parts. The upper part shows values greater than the specied Maximum value. The lower part of the graph shows values lower than the specied Minimum value. For example:
91
6.5 Dotplot
The Dotplot plugin provides a tool to build dotplots for DNA or RNA sequences. This allows to compare these sequences graphically. Using a dotplot graphic, you can easily identify such dierences between sequences as mutations, invertions, insertions, deletions and low-complexity regions. Also the plugin provides advanced features: comparing multiple dotplots, navigation in a dotplot, dotplots synchronization, saving and loading a dotplot, etc. An example of a dotplot view:
Note: The Dotplot plugin uses the Repeat Finder plugin to build a dotplot, make sure you have the Repeat Finder plugin installed. The Dotplot features are described in more details below.
92
Here you should specify the File with rst sequence. Also you should either check the Compare sequence against itself option or select the File with second sequence. Optionally you can select to Join all sequences found in the le (for the rst and/or for the second le). If you select to join the sequences you can also select the Gap size. The gap of the specied size will be inserted between the joined sequences. After you press the Next button, the dialog to congure the dotplot parameters will appear:
The following parameters are available: X axis sequence the sequence for the X dotplot axis. Y axis sequence the sequence for the Y dotplot axis. If there are several sequences in the specied (the rst or the second) le and you havent selected to join the sequences in the previous dialog, then you can select a sequence in these elds. If you have selected to Join all sequences found in the le, then you cant select a separate sequence from the le, the joined Sequence can be selected instead. Search direct repeats check this option to search for direct repeats in the specied sequences. You can also select the color with which the repeats will be displayed in the picture. The default button sets the default color. Search inverted repeats check this option to search for inverted repeats in the specied sequences.
6.5. Dotplot
93
You can also select the color with which the repeats will be displayed in the picture. The default button sets the default color. Custom algorithm optionally you can select an algorithm to calculate the repeats: Auto Sux index Diagonals Note: The specied algorithm is provided to the Repeat Finder plugin as an input parameter. In most cases the Auto value is appropriate. Minimum repeat length allows to draw only such matches between the sequences that are continuous and long enough. For example if it equals to 3bp, then only repeats will be found that contain 3 and more base symbols. Press the 1k button to automatically adjust the Minimum repeat length value. Such value will be set, that there will be about 1000 repeats found. Repeats identity species the percents of the repeats identity. Press the 100 button to set the 100% identity. After the parameters are set, press the OK button. The dotplot will appear in the Sequence View :
Each dot on the plot corresponds to a matched base symbol at the "x" position of the horizontal sequence and the "y" position of the vertical sequence. Visible diagonal lines indicate matches between sequences in the given particular region. See also: Interpreting Dotplot: Identifying Matches, Mutations, Ivertions, etc. Building Dotplot for Currently Opened Sequence
Hold the middle mouse button and move the mouse cursor over the zoomed region of the doplot. Click on the desired region of the minimap in the right bottom corner. Activate the Scroll tool, hold the left mouse button and move the mouse cursor over the zoomed region:
6.5. Dotplot
95
hold down the left mouse button and drag the mouse cursor over the dotplot. When you select a region on a dotplot the corresponding region is also selected in other Sequence View areas (Sequence details view , Sequence zoom view , etc.). The opposite is true as well: if you select a region in a Sequence View area, the corresponding region is also selected in the dotplot view. To zoom to the region selected click the Zoom in on the left.
To deselect the repeat either click on other repeat or hold Ctrl and click somewhere on the dotplot.
96
2. Frame shifts a. Mutations Mutations are distinctions between sequences. On the graphic they are represented by gaps in diagonal lines. They interrupt matches. b. Insertions Insertions are parts of one sequence that are missed in the another, while the surrounding parts match. In other words, an insertion is a subsequence that was inserted into a sequence. Graphically, insertions are represented by gaps which lie only on one axis. A little shift towards the other axis indicates a mutation involved. c. Deletions A deletion is a subsequence that was deleted from a sequence. A deletion from sequence A found in sequence B can be considered as an insertion into sequence B and contained in sequence A.
3. Inverted repeats The Dotplot plugin allows to search for inverted repeats as well. Inverted repeats are shown contrary to the direct repeats. Use the Search direct repeats and Search inverted repeats options of the Dotplot parameters dialog to select which repeats to draw (the dialog is described here).
6.5. Dotplot
97
4. Low-complexity regions A low-complexity region is a region produced by redundancy in a particular part of the sequence. It is represented on a plot as a rectangular area lled with the matches.
The parameters dialog will be re-opened. See description of the available parameters here.
98
The Save Dotplot dialog will appear. A dotplot is saved in a le with the *.dpt extension. Later the dotplot can be loaded using the Dotplot Save/Load Load context menu item.
6.5. Dotplot
99
100
7 Alignment Editor
7.1 Overview
This chapter gives an overview of the Alignment Editor components and explains basic concepts of browsing an alignment.
101
The Alignment Editor components: Sequence area This is the main component of the editor. It displays aligned sequences. The upper part of the Sequence area is the ruler, which shows the coordinates of the currently visible row sequences. Consensus area This component is situated above the Sequence area. It shows the consensus sequence for the current alignment calculated using currently selected algorithm. Sequence list This component is located in the left part of the Sequence area. It shows names of the corresponding sequences in the alignment. Editor toolbar The toolbar contains shortcuts for important editor actions, such as Undo/Redo, Zooming and others. Sequence offsets These are the osets for the rst and the last visible base for each alignment row. Note that the oset value doesnt include gaps. For example, lets assume that the coordinate of the rst visible base of the row is N, but the row contains K gaps before the position N. The starting oset value will be N-K. The same rule is true for the ending oset. You can turn o the Sequence osets by unchecking the Actions View Show osets main menu item or View Show osets context menu item. Global coordinates This component displays the coordinates of the upper left corner of the current selection. If no region is selected it shows the starting alignment point.
102
Alignment lock status As in the Sequence View this component shows whether the alignment is locked. Locked documents are not allowed to be modied.
7.1.3 Navigation
The Sequence area provides several exible ways to navigate through an alignment. The simplest way is to use the mouse and the scrollbars. Alternatively you can use arrow keys on the keyboard to navigate. The list of hot keys for quick navigation: PageUp to move one screen left. PageDown to move one screen right. Home to center the starting columns of the alignment. End to move to the trailing columns of the alignment Hint: if you use Shift key with the hot keys above you will navigate through the rows. For example, Shift-PageDown will move one screen down. Finally you can use the Go to position dialog from the Actions menu, the context menu or the editor toolbar.
Enter the column number (base coordinate) and the view will be centered to the corresponding base.
7.1. Overview
103
By default, the base characters are visible when zooming. But for rather long sequences there is another zoom mode available. In this mode the bases are not shown. This allows viewing very large sequence regions (up to 500 bp).
You can zoom to the selected region by clicking the Zoom to selection button. It is very convenient operation, when the alignment size is rather large. For example, you can zoom out to some percentage, select an interesting region and then zoom to the selection. You can change font by clicking the Change font button. To reset zoom and font click the Reset zoom button.
Press the right arrow to search in the direction "From left to right, from top to bottom". Press the left arrow to search in the direction "From right to left, from bottom to top". If the pattern is found, the result will be focused and highlighted in the Sequence area. You can continue the search in any direction from this position.
104
7.1.7 Consensus
Each base of a consensus sequence is calculated as a function of the corresponding column bases. There are dierent methods to calculate the consensus. Each method reveals unique biological properties of the aligned sequences. The Alignment Editor allows switching between dierent consensus modes. To switch the consensus mode activate the context menu (using the right mouse button) or the Actions menu and select the Consensus mode item.
There are several modes: JalView (Default) it is based on the JalView algorithm. Returns + if there are 2 characters with high frequency. Returns symbol in lower case if the symbol content in a row is lower than the specied threshold. ClustalW emulates the ClustalW program and le format behavior. Levitsky this algorithm is proposed by Victor Levitsky to calculate consensus of DNA alignments. At rst, it collects global alignment frequencies for every symbol using extended (15 symbols) DNA alphabet. Then, for every column it selects the rarest symbol in the whole alignment with percentage in the column greater or equals to the threshold value. Strict the algorithm returns gap character () if symbol frequency in a column is lower than the threshold specied.
7.1. Overview
105
106
Extracting Selected as MSA It is possible to extract a subalignment and save it as new multiple sequence alignment (MSA). Select a subalignment and choose the Edit Extract selected as MSA item in the Actions main menu or in the context menu. The following dialog appears:
Specify the name of the new MSA le in the File name eld. The currently selected region is extracted by default when you press the Extract button. You can change the columns to be extracted using the From and to elds. And change the rows to be extracted by checking / unchecking required sequences in the Selected sequences list. Use buttons: Invert selection to invert the selection of the sequences. Select all to select all sequences. Clear selection to clear the selection of all sequences. The Add to project check box species to add the MSA le created from the subalignment to the active project. Removing All Gaps Use the Edit Remove all gaps item in the Actions main menu or in the context menu to remove all gaps from the alignment. Removing Selection To remove a subalignment select it and choose the Edit Remove selection item in the context menu or press the Delete key.
107
Removing Columns of Gaps To remove colums containg certain number of gaps select the Edit Remove columns of gaps item in the context menu. The dialog appears:
There are the following options: Remove columns with number of gaps removes columns with number of gaps greater than or equal to the specied value. Remove columns with percentage of gaps removes columns with percentage of gaps greater than or equal to the specied value. Remove all columns of gaps this option is selected by default. It species to remove columns from the alignment if they entirely consist of gaps. Select the option required and press the Remove button. Filling Selection with Gaps Select a region in the alignment and choose the Edit Fill selection with gaps item in the context menu or press the Spacebar. The region is lled with gaps shifting the subalignment from the region to the right.
108
You will see the Project View tree ltered to show only appropriate sequences. Select the items to add and press the Ok button. Copying Sequences To copy current selection click the Copy Copy selection item in the Actions main menu or the context menu. The hotkey for this action is Ctrl-C. To copy one or several sequences do the following: Select the sequences in the Sequence list area; Select the Copy Copy selection context menu item in the Sequence area or use hot key combination. Note, that if you activate context menu in the Sequence list area you will lose your current selection.
109
To copy consensus sequence use the Copy Copy consensus item. Sorting Sequences To sort sequences by name in the alphabetical order choose the View Sort sequences by name item from the Actions main menu or the context menu.
110
The le save dialog will appear where you should set name, location and format of the picture. UGENE supports export to the PNG, TIFF and JPEG image formats.
111
Two methods for building phylogenetic trees are supported: 1. The PHYLIP Neighbour-Joining method. The PHYLIP package implementation of the method is used under the hood. 2. The MrBayes external tool. Check MrBayes Web Site for more details.
The following parameters are available: Distance matrix model model to compute a distance matrix. The following values are available for a nucleotide multiple sequence alignment: F84 Kimura Jukes-Cantor LogDet
112
The following models are available for a protein alignment: Jones-Taylor-Thornton Heniko/Tillier PMB Dayho PAM Kimura Gamma distributed rates across sites species to take into account unequal rates of change at dierent sites. It is assumed that the distribution of the rates follows the Gamma distribution. Coecient of variation of substitution rate among sites becomes available if the Gamma distributed rates across sites parameter is checked. Species the coecient of the distribution of the rates. Transition/transversion ratio expected ratio of transitions to transversions. To enable bootstrapping check the Bootstrapping and Consensus Trees group check box. The following parameters are available: Number of replicates number of replicate date sets. Seed random number seed. By default, it is generated automatically. You can manually change this value in order to make results of dierent runs (of a tree building) reproducible. The should must be an integer greater than zero and less than 32767 and which is of the form 4n+1, that is, it leaves a remainder of 1 when divided by 4. Any odd number can also be used, but may result in a random number sequence that repeats itself after less than the full one billion numbers. Usually this is not a problem. Consensus type species the method to build the consensus tree. Select one of the following: Strict species that a set of species must appear in all input trees to be included in the strict consensus tree. Majority Rule (extended) species that any set of species that appears in more than 50% of the trees is included. The program then considers the other sets of species in order of the frequency with which they have appeared, adding to the consensus tree any which are compatible with it until the tree is fully resolved. This is the default setting. M1 includes in the consensus tree any sets of species that occur among the input trees more than a specied fraction of the time (see the Fraction parameter below). The Strict consensus and the Majority Rule consensus are extreme cases of the Ml consensus, being for fractions of 1 and 0.5 respectively. Majority Rule species that a set of species is included in the consensus tree if it is present in more than half of the input trees. Fraction becomes available when the Consensus type parameter is set to M1. Species the fraction. Save tree to le to save the tree built. Press the Build button to build a tree with the parameters selected.
7.4.2 MrBayes
The Building Phylogenetic Tree dialog for the MrBayes method has the following view:
113
There are two steps to a phylogenetic analysis using MrBayes: 1. Set the evolutionary model. 2. Run the Markov chain Monte Carlo (MCMC) analisys. The evolutionary model is dened by the following parameters: Substitution model species the general structure of a DNA substitution model. This parameter is available for the nucleotide sequences. It corresponds to the Nst setting of MrBayes. You may select one of the following: JC69 (Nst=1) HKY85 (Nst=2) GTR (Nst=6) Rate matrix (xed) species the xed-rate amino-acid model. This parameter is available for amino-acid sequences. The following models are available: poisson jones dayho mtrev mtmam wag rtrev cprev
114
vt blosum equaline The following parameters are common for nucleotide and amino-acid sequences: Rate sets the model for among-site rate variation. Select one of the following: equal no rate variation across sites. gamma gamma-distributed rates across sites. The rate at a site is drawn from a gamma distribution. The gamma distribution has a single parameter that describes how much rates vary. propinv a proportion of the sites are invariable. invgamma a proportion of the sites are invariable while the rate for the remaining sites are drawn from a gamma distribution. Gamma sets the number of rate categories for the gamma distribution. You can select the following parameters for the MCMC analisys: Chain length sets the number of cycles for the MCMC algorithm. This should be a big number as you want the chain to rst reach stationarity, and then remain there for enough time to take lots of samples. Subsampling frequency species how often the Markov chain is sampled. You can sample the chain every cycle, but this results in very large output les. Burn-in length determines the number of samples that will be discarded when convergence diagnostics are calculated. Heated chains number of chains will be used in Metropolis coupling. Set 1 to use usual MCMC analysis. Heated chain temp the temperature parameter for heating the chains. The higher the temperature, the more likely the heated chains are to move between isolated peaks in the posterior distribution. Random seed a seed for the random number generator. Save tree to le to save the built tree. Press the Build button to run the analysis with the parameters selected and build a consensus tree.
115
8 Assembly Browser
The UGENE Assembly Browser project started in 2010 was inspired by Illumina iDEA Challenge 2011 and multiple requests from UGENE users.The main goal of the Assembly Browser is to let a user visualize and eciently browse large next generation sequence assemblies. Currently supported formats are SAM (Sequence Alignment/Map) and BAM, which is a binary version of the SAM format. Both formats are produced by SAMtools and described in the following specication: SAMtools. Support of other formats is also planned, so please send us a request if youre interested in a certain format. To browse an assembly data in UGENE, a BAM or SAM le should be imported to a UGENE database le. After that you can convert the UGENE database le into a SAM le. The import to a UGENE database le has both advantages and disadvantages. The disadvantages are that the import may take time for a large le and there should be enough disk space to store the database le. On the other hand, this allows one to overview the whole assembly and navigate in it rather rapidly. In addition, during the import you can select contigs to be imported from the BAM/SAM le. So, there is no need to import the whole le if youre going to work only with some contigs. Note that in the future there are plans to support the other approach as well, namely, when a BAM/SAM le is opened directly. The Assembly Browser has been tested on dierent BAM/SAM les from the 1000 Genomes Project and other sources. Read the documentation below to learn more about the Assembly Browser features.
The Source URL eld in the dialog species the le to import. The Info button nearby can be used to obtain additional information about the le.
116
There is a list of contigs below the Source URL. Check the contigs that you want to import to the database. You can use the Select All, Deselect All and Invert Selection buttons to manage the selection. The Destination URL eld species the output database le. If you check the Import unmapped reads, then all unmapped reads in the assembly (i.e. read with the unmapped ag or without CIGAR) are imported. Note, however, that they are not vizualized in the current UGENE version. To start the import, click the Import button in the dialog. You can see the progress of the import in the Task View . To export a UGENE database le into the SAM format, select the Actions Export assembly to SAM format item in the main menu.
Each [as] object corresponds to an imported contig. When you double-click on an [as] object a new Assembly Browser window with the assembly data is opened. A window for the rst assembly object in the list is opened automatically after the import.
117
Note that for large assemblies it may take some time to calculate the overview and the well-covered regions. To see the reads, either select a region from the list or zoom in, for example, by clicking the link above the well-covered regions or by rotating the mouse wheel. You can also use the hotkeys. Tips about hotkeys are shown under the list of well-covered regions. To learn about available hotkeys refer to Assembly Browser Hotkeys.
118
119
If you hold Shift and select a region on the overview, the overview is zoomed to the selection. Note that when the Assembly Overview is in focus and you use either the zoom buttons on the toolbar, the zoom items in the Actions main menu, or a mouse wheel, the Reads Area is resized appropriately. The Assembly Overview can also be resized. To zoom in the overview, select either the Zoom in or the Zoom in 100x item in the Assembly Overview context menu. You can scroll the resized overview by dragging the mouse while pressing down the mouse wheel. To zoom out the overview, select the Zoom out item in the context menu. The Restore global overview item in the context menu restores the default overview size when the whole contig overview is shown. Notice that the Assembly Overview shows the coordinates of the assembly areas visible in the Reads Area and in the Assembly Overview:
To scroll the resized overview, drag the mouse while pressing down the mouse wheel. To learn about available hotkeys refer to Assembly Browser Hotkeys.
To show/hide the coordinates on the ruler you can click the following button on the toolbar:
To show/hide the coverage on the ruler you can click the following button on the toolbar:
Alternatively, you can use the Show coordinates and Show coverage under cursor check boxes located on the Assembly Browser Settings tab of the Options Panel .
120
Input the location and click the Go! button. A similar Go! eld is also available on the Navigation tab of the Options Panel .
Or uncheck the Show pop-up hint check box on the Assembly Browser Settings tab of the Options Panel . The hint shows the following information about the read: Read name Location Length Cigar Strand Read sequence The operations in the Cigar parameter are described as follows: M Alignment match (can be a sequence match or mismatch). I Insertion to the reference. Skipped when the read is aligned to the reference, i.e. it is not shown in the Reads Area, but is present in the read sequence.
121
D Deletion from the reference. Gaps are inserted to the read when the read is aligned to the reference. For example:
N Skipped region from the reference. Behaves as D, but has a dierent biological meaning: for mRNAto-genome alignment it represents an intron. S Soft clipping (clipped sequences are present in the read sequence, i.e. behaves as I). H Hard clipping (clipped sequences are not present in the read sequence). P Padding (silent deletion from padded reference). = Exact match to the reference. x Reference sequence mismatch. To copy the information about the read to the clipboard, select the Copy read information to clipboard item in the Reads Area context menu. Now you can paste it in any text editor. To copy the current position of the read select the Copy current position to clipboard item in the Reads Area context menu.
122
Strand direction highlights reads located on the direct strand in blue and reads on the complement strand in green.
Paired reads highlights all paired reads in green. Note that the information about the pair is shown in the hint.
123
To remove the association, select the Unassociate item in the Reference Area context menu.
To choose a consensus algorithm select the Consensus algorihtm item either in the context menu of the Consensus Area, in the context menu of the Reads Area or on the Assembly Browser Settings tab of the Options Panel . . The following algorithms are currently available: Default shows the most common nucleotide at each position. When there is equal numbers of dierent nucleotides in a position, the consensus sequence resulting nucleotide is selected randomly from these nucleotides. 124 Chapter 8. Assembly Browser
SAMtools uses an algorithm from the SAMtools Text Alignment Viewer to build the consensus sequence. The algorithm takes into account quality values of reads and nucleotides and works with the extended nucleotide alphabet. To leave only dierences between the reference and the consensus sequences highlighted on the consensus sequence, select the Show dierence from reference item in the context menu of the Consensus Area or the Dierence from reference item on the Assembly Browser Settings tab of the Options Panel :
To export a Consensus Sequence, right-click on it in the Consensus Area and select the Export Export consensus item in the context menu. For more information about consensus exporting see Exporting Consensus.
8.7 Exporting
8.7.1 Exporting Read
To export a read, right-click on it in the Reads Area and select the Export Current read item in the context menu. The Export Reads dialog appears:
Select a le to export the read to and the le format. The read can be exported either to a FASTA or FASTQ le. When the parameters are set click the Export button. The read is exported to the le and if the Add to project check box has been checked it is added to the current project from where you can open it.
8.7. Exporting
125
Select a le and the le format. The consensus can be exported to a FASTA, FASTQ, GFF or GenBank le. Modify, if required, the exported sequence name and choose the consensus algorithm. The consensus is exported with gaps if the Keep gaps check box has been checked. Also you can select the exporting region. It can be either a Whole sequence, a Visible region, or a Custom region. When all the parameters are set click the Export button. The consensus sequence is exported to the le and if the Add to project check box has been checked it is added to the current project and opened.
In the dialog you can select the image le name and its format (bmp, jpeg, png, etc.). For some le formats the Quality parameter also becomes available. When the parameters are set click the OK button.
126
To learn more about well-covered regions refer to the Assembly Browser Window chapter. To learn more about searching required position refer to the Go to Position in Assembly chapter.
127
To learn more about Reads Area settings refer to the Reads Area Settings chapter. To learn more about Consensus see the Consensus Sequence chapter. To learn more about Ruler see the Browsing and Zooming Assembly chapter.
128
129
Hotkey wheel double-click +/click + move mouse arrow Ctrl + arrow Page Up / Page Down Home / End Ctrl+G
Action Zoom the Reads Area Zoom in the Reads Area Zoom in / zoom out the Reads Area Move the Reads Area Move one base in the corresponding direction in the Reads Area Move one page in the corresponding direction in the Reads Area Move one page up / down in the Reads Area Move to the beginning / end of the assembly in the Reads Area Focus to the Go to position eld on the toolbar
130
To load a tree from a le follow the instruction described in the Opening Document paragraph. For example, you may open the $UGENE\data\samples\Newick\COI.nwk sample le provided within UGENE package. To build a tree from a multiple sequence alignment see the Building Phylogenetic Tree paragraph. To learn what you can do with a tree using UGENE Phylogenetic Tree Viewer read the documentation below.
131
In the dialog you can tune the width of the tree. If the tree layout is set to rectangular you can tune the height of the tree also. And you can select the tree view: Phylogram Cladogram
Here you can select the color and the line width of the tree branches. Note that when a clade has been selected the branch settings are applied to the clade only.
132
133
Here you can select color, font, size and attributes (bold, italic, etc.) of the labels. Note that when a clade has been selected the labels formatting settings are applied to the clade only.
You can see that the corresponding branches are highlighted. To select several clades at the same time hold the Shift key and click on the root nodes of the clades.
To show the collapsed clade select the Expand item in the nodes context menu.
135
136
10 Distributed Computing
Distributed computing allows to notably increase the performance of computational tasks by distributing the task data among computational units. However the distributed computing assumes complex solutions: specialized versions of algorithms, network communication etc. Unipro UGENE project provides advanced distributed computing capabilities. Despite the complexity of the internal structure, for users running computational tasks on a remote machine is as easy as running it on a local machine. Starting with version 1.7.2 UGENE supports cloud computing. For example, computational workows can be launched on the Amazon EC2 cloud. Check for details the following documentation section: Running Workows on Cloud There are also several distributed algorithms that can be executed on a remote machine: HMMER3 search Smith-Waterman search Muscle3 align
The Remote machines monitor allows you to add, remove or modify remote machines.
137
To add a new remote machine, click the Add button. In the appeared dialog select the protocol and ll other required elds:
Modication of a remote machine is as simple as adding a new one. Just select the machine and click the Modify button. To remove a remote machine from the monitor select the machine in the table and click the Remove button. You can Ping a remote machine to check if its still alive and UGENE is still running there. Some network protocols (for example, direct socket protocol) can do scanning of local network. To search for running UGENEs through such protocols click the Scan button. Also, you can use one of the public UGENE machines to run your tasks on it. To add public machines to the monitor click the Get public machines button.
138
139
Check the cloud machine status in the Remote machines monitor. If the session has been initialized successfully, the green tick is highlighted in the Ping column. If there is no green tick, check the Log View for details about the occurred problem.
Run the schema, e.g. by clicking the corresponding Workow Designer toolbar button. The Remote machines monitor dialog will appear. Select the remote machine that represents the EC2 service and click the Run button.
140
141
In the appeared HMM3 search dialog ll required parameters and click the Remote run... button.
The Remote machines monitor dialog will appear. You can also add, remove or modify remote machines here. Select a machine to run and click the Run button. Note that only 1 machine can be selected in the current version of UGENE. Thats all. After the task is nished you will see the task report in the Task View .
142
You will see the Smith-Waterman search dialog. Fill required elds and click the Remote run button.
The Remote machines monitor dialog will appear. You can also add, remove or modify remote machines here. Select a machine to run and click the Run button. Note that only 1 machine can be selected in the current version of UGENE. Thats all. After the task is nished you will see the task report in the Task View .
143
You will see the Align with MUSCLE dialog. Fill required elds and click the Remote run button.
The Remote machines monitor dialog will appear. You can also add, remove or modify remote machines here. Select a machine to run and click the Run button. Note that only 1 machine can be selected in the current version of UGENE. Thats all. After the task is nished you will see the task report in the Task View .
144
11 Plugins
145
To learn more about the Workow Designer read the Workow Designer Manual (follow the link on the UGENE documentation page).
146
Using this dialog you can search for DNA sequence regions that contain every annotation from the list on the left side. The found regions are displayed on the right side of the dialog. Use the Save regions as annotations... button to store the regions as new annotations to the sequence.
147
The calculation is made for overlapping windows along a given sequence. If there are two or more consecutive windows with an average exibility threshold (in each window) greater than the specied Threshold parameter, such area is marked by an annotation. The average threshold in a window is calculated by the following formula:
(average window threshold) = (sum of flexibility angles in the window) / (the window size - 1)
The following exibility angles are used during the calculation: Dinucleotide AA AC AG AT GA GC GG GT Angle 7.6 10.9 8.8 12.5 8.2 8.9 7.2 10.9 Dinucleotide CA CC CG CT TA TC TG TT Angle 14.6 7.2 11.1 8.8 25 8.2 14.6 7.6
A minimum value is used when N characters is present in a dinucleotide: CN, NC, GN, NG, NN: 7.2 AN, NA, TN, NT : 7.6
148
Once the Search button has been pressed, the annotations for the regions of the high DNA exibility are created.
Note: Using the DNA Graphs Package you can see the exibility graph of a DNA sequence.
149
The Export sequences as alignment dialog will appear where you can point the result alignment le location, select a multiple alignment le format and optionally add the created document to the current project:
Here you can select the location of the result le and a sequence le format (FASTA, Genbank or FASTQ). Also it is possible to specify whether to merge the exported sequences into a single sequence or store them as separate sequences. You can choose to add newly created document to the current project. If you merge the sequences, youre allowed to select the Gap length. This is the length of the insertion region between sequences that contain N symbols for nucleic or X for protein sequences. 11.4. DNA Export 151
Here it is possible to specify the result le location, to select a sequence le format, to dene whether to keep or remove gaps ( chars) in the aligned sequences and optionally add the created document to the current project.
152
The Export selected sequence region dialog will appear which is similar to the Export sequences dialog described above.
153
The Export sequence of selected annotations will appear which is similar to the Export sequences dialog described above.
Here you can set the path to the CSV le and optionally save the sequence along with annotations.
154
Here is a brief description of the options that can be set in the dialog: Prole mode: Counts/Percents select the Percents to have scores shown as percents in the report. Show scores for gaps check this item if you want gap characters () statistics to be shown in the report. 11.5. DNA Statistics 155
Show scores for symbols not used in alignment if a symbol is not used in the alignment at all it wont be shown in the report. Check this item to make all symbols of alignment alphabet reported. Skip gaps in consensus position increments consensus ruler conguration. If checked the gaps in consensus will not lead to ruler increments. Save prole to le allows to save prole to a le in the HTML or CSV format. The CSV format is convenient for further processing in worksheets editors like Excel. The result prole in the HTML mode:
156
The following search settings are available: Min length ORFs with length lower than Min length value will not be found. Must terminate within region this option ignores boundary ORFs located beyond the search region. Must start with init codon item switches the ORF Marker algorithm to the mode when any non-stop amino acid code is interpreted as region start position. Allow overlaps alternative (downstream) initiators, when another start codon is located within a longer ORF, i.e. all possible ORFs will be found, not only the longest ones. Allow alternative init codon option includes ORFs starting with alternative initiation codons, accordingly to the current translation table.
157
Include stop codon includes stop codons into resulting annotations. The other available parameters are: DNA-to-Amino translation table denes the way start, alternative start and stop codons are encoded. Strand where to search the ORFs: in the direct strand, in the complement strand or in both strands. Preview allow to preview the regions, strands and lengths of the found ORFs. Clear results becomes available when some results have been found, clears these results. Results: When the search parameters has been selected and the OK button has been pressed in the dialog, the auto-annotating becomes enabled. In the Annotations editor the ORFs annotations can be found in the Auto-annotations\orf group. After the search has been nished you can browse the results, sort them by length, strand or start position and save as annotations to the original sequence in the Genbank format.
158
The following dialog will appear where you can choose the search options:
159
General options are: Select search in the remote databases the blastn search is used for nucleotide sequences, blastp and cdd searches are used for amino sequences. UGENE also provides a way to use blastp and cdd searches for nucleotide sequences. This is achieved by translating the nucleotide sequence into the amino sequences. When a sequence is translated the translation table from the active Sequence View is used. Finally, all 6 translations are used to query the remote database with the selected blastp or cdd search. Expectation value this option species the statistical signicance threshold for reporting matches against database sequences. Lower expect thresholds are more stringent, leading to fewer chance matches being reported. Max hits the maximum number of hits that will be shown (not equal to number of annotations). Database the target database. Search for short, nearly exact matches automatically adjusts the word size and other parameters to improve results for short queries. Megablast select this option to compare query with closely related sequences. It works best if the target percent identity is 95% or more, but it is very fast. Search timeout sometimes a database doesnt respond, therefore you need to re-wait for the response. This option sets the time that will be spent for re-appeal to the database. Note that in case of long sequences time for request preparation increases and the search takes several minutes. Also there is Advanced options tab:
160
The view of the Advanced options tab depends on the selected search. For the blastn search it looks like on the picture above. Word size the size of the subsequence parameter for the initiated search. Gap costs costs to create and extend a gap in an alignment. Increasing the Gap costs will result in alignments which decrease the number of Gaps introduced. Match/Mismatch scores reward and penalty for matching and mismatching bases. Filters lters for regions of low compositional complexity and repeat elements of the humans genome. Masks for lookup table only this option masks only for purposes of constructing the lookup table used by BLAST so that no hits are found based upon low-complexity sequence or repeats (if repeat lter is checked). Mask lower case letters with this option selected you can cut and paste a FASTA sequence in upper case characters and denote areas you would like ltered with lower case. When the blastp search is selected in the general options, the view of the Advanced options tab is the following:
As you can see there is no Match scores option, but there are Matrix and Service options. Matrix key element in evaluating the quality of a pair-wise sequence alignment is the "substitution matrix", which assigns a score for aligning any possible pair of residues. Service blastp service which needs to be performed: plain, psi or phi.
161
The Advanced options tab is not available when the cdd search is selected.
162
The dialog will appear that allows specifying repeat parameters and the annotations table document to save the results into:
The dialogues status line displays approximate repeats number that will be found with the current settings. The Advanced tab provides additional repeats nding options:
163
The found repeats are saved and displayed as annotations to the DNA sequence:
164
The dialog parameters: Tandem preset specify the tandem repeats parameters with predened values by selecting the available preset:
Min period, Max period the minimum and maximum acceptable repeat length measured in base symbols. Region to process specify the region to search in the whole sequence, a custom region or the region of the current selection (if any). Save annotation(s) to specify the existing or new annotations table le to save the resulting annotations into. Annotation parameters you can change the default group name and annotation(s) name values of the resulting annotation(s).
165
Additional search options can be found in the Advanced tab: Algorithm the algorithm parameter allows to select the search algorithm. The default and a fast one is optimized sux array algorithm. Minimum tandem size the minimum tandem size sets the limit on minimum acceptable length of the tandem, i.e. the minimum total repeats length of the searched tandem. Minimum repeat count the minimum number of repeats of a searched tandem. Show overlapped tandems check if the plugin should search for the overlapped tandems, otherwise keep unchecked. Tandem Repeats Search Result An example of the search results for the micro-satellite preset:
166
Alternatively, select either the Actions Analyze Find restriction sites item in the main menu or the Analyze Find restriction sites item in the context menu. The Find restriction sites dialog appears:
You can see the list of restriction enzymes that can be used to search for restriction sites. The information about enzymes was obtained from the REBASE database. For each enzyme in the list a brief description is available (the accession ID in the database, the recognition sequence, etc.). If youre online you can get more detailed information about an enzyme selected by clicking the REBASE Info button.
167
168
11.9.6 Results
When at least one enzyme has been selected and the OK button has been pressed in the dialog, the autoannotating becomes enabled. In the Annotations editor the Restriction Sites annotations can be found in the Auto-annotations\enzyme group. The direct and complement cut site positions are visualized as triangles on an annotation in the Sequence details view :
169
On the Restriction Sites tab of the dialog you can see the name of the molecule, the list of restriction enzymes found during the restriction analysis that can cut the molecule and the list of enzymes selected to perform the digestion. To digest the sequence into fragments you should select at least one enzyme. To move an enzyme to the Selected enzymes list click on it in the Available enzymes list and press the Add button. Note that you can select several items in a list by holding the Ctrl key while clicking on the items. To select all available enzymes press the Add All button. To remove enzymes from the Selected enzymes list select them in the list and press the Remove button. To remove all items from the Selected enzymes list press the Clear Selection button. 170 Chapter 11. Plugins
On the Conserved Annotations tab of the dialog you can select the annotations that must not be disrupted during cloning. On the Output tab of the dialog you can select the le to save the new molecule to. As soon as the required parameters are selected press the OK button. The fragments will be saved as annotations. Also all the generated fragments are available in the task report:
171
If a region has been selected you can choose to create the fragment from this region. Otherwise you can either choose to create the fragment from the whole sequence or choose the Custom item and input the custom region. To add a 5 overhang to the direct strand check the Include Left Overhang check box and input the required nucleotides. To add a 5 overhang to the reverse strand in addition to the described steps select the Reversecomplement item in the same group box. Similarly, to add a 3 overhang check the Include Right Overhang check box, input the required overhang and select either the direct or the reverse-complement strand. On the Output tab of the dialog you can optionally modify the annotations output settings. Finally, press the OK button to create the fragment. The fragment will be saved as an annotation.
172
Available Fragments All the fragments available in the current project are shown in the Available fragments list. You can automatically create a fragment from a DNA molecule from the current UGENE project. Click the From Project button to do so. The Select Item dialog appears with the sequence objects available. Select a sequence and press the OK button. After that create a fragment in the appeared Create DNA Fragment dialog as described in the Creating Fragment paragraph. The fragment created from the sequence appears in the list of available fragments. Fragments of the New Molecule The next step is to add required fragments to the new molecule contents. To add fragments select them in the list of available fragments and click the Add button. To add all the fragments click the Add All button. Changing Fragments Order in the New Molecule To change the order of fragments in the new molecule select a fragment in the new molecule contents list and click either the Up or the Down button to move the fragment in the corresponding direction.
173
Removing Fragment from the New Molecule To remove a fragment from the new molecule select it in the new molecule contents list and click the Remove button. To remove all the fragments click the Clear All button. Editing Fragment Overhangs To edit a fragments overhangs select the fragment in the new molecule contents list and click the Edit button. The Edit Molecule Fragment dialog appears:
Here you can select the type of each DNA end and even input a custom overhang. The changes youve made are shown in the Preview area of the dialog. To conrm the changes and close the dialog click the OK button. Reverse Complement a Fragment To reverse complement a fragment check the Inverted check box for the fragment in the new molecule contents list. Other Constuction Options To save the fragments of the new molecule as annotations check the Annotate fragments in new molecule check box. 174 Chapter 11. Plugins
To make all DNA ends blunt check the Force "blunt" and omit all overhangs check box. All overhangs would be cut in this case. Check the Make circular check box to make the new molecule circular. Output On the Output tab of the dialog you can select the le to save the new molecule to. The molecule is opened by default as soon as it is created. To modify this behavior uncheck the Open view for new molecule check box on the same tab. To save the molecule le to the hard disk immediately after it is created check the Save immediately check box. Otherwise it would be stored in memory until you save or remove it.
175
It supports the following options: Algorithm you can choose the preferred algorithm. Currently, GORIV and PsiPred algorithms are available. Range start / Range end select the sequence range for prediction. Results visual representation of the prediction results, for example:
176
Save as annotation select this button to save the results as annotations of the current protein sequence.
177
11.12 SITECON
SITECON is a program package for recognition of potential transcription factor binding sites basing on the data about conservative conformational and physicochemical properties revealed on the basis of the binding sites sets analysis. To cite SITECON use the following article: "Oshchepkov D.Y., Vityaev E.E., Grigorovich D.A., Ignatieva E.V., Khlebodarova T.M.SITECON: a tool for detecting conservative conformational and physicochemical properties in transcription factor binding site alignments and for siterecognition. //Nucleic Acids Res. 2004 Jul 1;32(Web Server issue):W208-12." UGENE version of SITECON provides a tool for recognition of potential binding sites for over 90 types of transcription factors. Also UGENE version of SITECON provides a tool for recognition of potential binding sites basing site alignment proposed by user. For the detailed method description see the original SITECON site. Data about used context-dependent conformational and physicochemical properties are available in the PROPERTY Database.
178
The regions found by SITECON algorithm can be saved as annotations to the DNA sequence in the Genbank format. Every SITECON prole supplied with UGENE contains complete information about calibration settings provided to UGENE team by the author of SITECON. The original TFBS alignments used to calculate proles can be requested directly from the author of SITECON.
11.12. SITECON
179
180
Table 11.1 continued from previous page Nrf2 Oct-1 Oct_all p53 PPRF Pu1 setCREB setCREBzag SRE_san SRF STAT1 STAT TTF1 USF yy1 Nuclear factor (erythroid-derived 2)-like 2 Octamer transcription factor 1 Octamer transcription factors Protein 53 Paramedian pontine reticular formation Is a protein that in humans is encoded by the SPI1 gene cAMP response element-binding cAMP response element-binding Serum response element Serum response factor Signal Transducer and Activator of Transcription 1 Signal Transducer and Activator of Transcription Thyroid transcription factor 1 Upstream stimulatory factors Is a protein that in humans is encoded by the YY1 gene
Prokaryotic Name AgaR AgaC ArcA ArgR Description N-acetylgalactosamine repressor, AgaR, negatively controls the expression of the aga gene cluster AgaC is the Enzyme IIC domain of a predicted N-acetylgalactosamine-transporting PEPdependent phosphotransferase system ArcA transcriptional dual regulator ArgR complexed with L-arginine represses the transcription of several genes involved in biosynthesis and transport of arginine, transport of histidine, and its own synthesis and activates genes for arginine catabolism. DNA-binding response regulator in two-component regulatory system with CpxA cAMP receptor protein Cysteine B Cytidine Regulator Deoxyribose Regulator DnaA is the linchpin element in the initiation of DNA replication in E. coli. Fatty acid degradation Regulon Factor for inversion stimulation Operon that encodes two transcriptional regulators FNR is the primary transcriptional regulator that mediates the transition from aerobic to anaerobic growth through the regulation of hundreds of genes. Continued on next page
11.12. SITECON
181
Table 11.2 continued from previous page Frur FUR GALR GALS GLPR GNTP HNS ICLR IHF ISCR1 ISCR3 LEXA Lrp MALT MARA MELR MEtJ MetR1 MLC Fructose repressor Ferric Uptake Regulation Galactose repressor Galactose isorepressor sn-Glycerol-3-phosphate repressor Is a member of the GntP family transporters Histone-like nucleoid structuring protein Isocitrate lyase Regulator Integration host factor Iron-sulfur cluster Regulator 1 Iron-sulfur cluster Regulator 3 LexA represses the transcription of several genes involved in the cellular response to DNA damage or inhibition of DNA replication Leucine-responsive regulatory protein Maltose regulator Multiple antibiotic resistance Melibiose regulator MetJ represses the expression of genes involved in biosynthesis and transport of methionine MetR participates in controlling several genes involved in methionine biosynthesis [ Weissbach91 ] and a gene involved in protection against nitric oxide DgsA, better known as Mlc, "makes large colonies," is a transcriptional dual regulator that controls the expression of a number of genes encoding enzymes of the Escherichia coli phosphotransferase (PTS) and phosphoenolpyruvate (PEP) systems Molybdate-responsive transcription factor Nitrogen assimilation control N-acetylglucosamine N-acetyl-neuraminic acid regulator Nitrate/nitrite response regulator NarL Nitrate/nitrite response regulator NarL Nitrate/nitrite response regulator NarP NirC is a nitrite transporter which is a member of the FNT family of formate and nitrite transporters OmpC is a member of the GMP family Oxidative stress regulator PhoB is a dual transcription regulator that activates expression of the Pho regulon in response to environmental Pi Member of the two-component regulatory system phoQ/phoP involved in adaptation to low Mg2+ environments and the control of acid resistance genes Continued on next page
MODE NAC NAGC_new2 NANR NARL2 NARL NARP NIRC OmpC OxyR PHOB PHOP
182
Table 11.2 continued from previous page PurR RcsB_1 RcsB_2 Rob2 ROB soxS TORR TRPR TyrR PurR dimer controls several genes involved in purine nucleotide biosynthesis and its own synthesis Regulator capsule synthesis B Regulator capsule synthesis B Right origin-binding protein Right origin-binding protein SoxS is a dual transcriptional activator and participates in the removal of superoxide and nitric oxide and protection from organic solvents and antibiotics TorR response regulator Tryptophan (trp) transcriptional repressor Tyrosine repressor
11.12. SITECON
183
First of all you need to specify the pattern to search for. The rest parameters are optional: Search in select either to search in the sequence or in its translation. Strand select the strand to search in: direct, complementary or both strands. Range species the region of the sequence that will be used to search for the pattern. By default, if a subsequence has been selected when the dialog has been opened, then the selected subsequence is searched for the pattern. Otherwise, the whole sequence is used. You can also input a custom range. Algorithm version version of the algorithm implementation. Non-classic versions produce the same results as classic but much faster. To use these optimizations your system must support these capabilities. Classic 2 SSE2 CUDA OPENCL Scoring matrix can be chosen from a bunch of matrices supplied with UGENE. To view a matrix selected click the View button. Gap open penalty for opening a gap. Gap extension penalty for extending a gap 184 Chapter 11. Plugins
Report results simple heuristic which allows to lter intersected hits. If it is set to none, the algorithm may report large set of almost identical results in the same region. Minimal score another simple heuristic which measures sequences similarity. It is more convenient than using some abstract scores. If set to 100%, the algorithm will search for exact substring match. The results of the search are saved as annotations. To set the annotations parameters (Annotation name, Group name, a le to save the annotation to) go to the Input and output tab of the dialog.
185
11.14 HMM2
The HMM2 plugin is a toolkit based on the Sean Eddys HMMER2 package. While working on this plugin we were guided by the following principles: Make the HMMER2 tools accessible to a wider user audience by providing graphical interface for all supported utilities for most of the platforms. Be compatible with the original HMMER2 package. Create the high-performance solution utilizing modern multi-core processors and SIMD instructions. The current version of UGENE provides user interface for three HMM2 tools: HMM build , HMM calibrate and HMM search. In the original program the corresponding commands are: hmmbuild, hmmcalibrate and hmmsearch. To access these tools select the Tools HMMER2 tools submenu of the program main menu:
We highly recommend reading the original HMMER2 documentation to learn how to use utilities provided by the plugin. Note: SSE2 algorithm is implemented by Leonid Konyaev, Novosibirsk State University. Use of the SSE2 optimized version of the HMM search algorithm with quad-core CPU gives >30x performance boost when compared with the original single-threaded algorithm (single sequence mode).
186
Note: The HMM build tool does not automatically calibrate a prole. Use the HMM calibrate tool to calibrate the prole.
11.14. HMM2
187
The search results are stored as sequence annotations in the Genbank le format. Warning: All HMM2 UGENE tools work only with les that contain a single HMM model.
188
11.15 HMM3
The HMM3 plugin is a toolkit based on the Sean Eddys HMMER3 package. While working on this plugin we were guided by the following principles: Make the HMMER3 tools accessible to a wider user audience by providing graphical interface for all supported utilities for most of the platforms. Be compatible with the original HMMER3 package. Create the high-performance solution utilizing modern multi-core processors. The current version of UGENE provides user interface for three HMM3 tools: HMM3 build , HMM3 search and Phmmer search. In the original program the corresponding commands are: hmmbuild, hmmsearch and phmmer. To access these tools select the Tools HMMER3 tools submenu of the program main menu:
We highly recommend reading the original HMMER3 documentation to learn how to use utilities provided by the plugin.
11.15. HMM3
189
The HMM3 conguration dialog provides an easy way to set appropriate search parameters. Here you can see eective weighting strategies options:
190
For example, reporting thresholds options can be congured using the dialog:
The search results are stored as sequence annotations in the Genbank le format.
Warning: The HMM3 search works only with les that contain a single HMM model.
11.15. HMM3
191
You can set options of the Phmmer search by choosing the needed dialog tab. Here you can see the e-value calibration options:
192
11.15. HMM3
193
11.16 uMUSCLE
UGENE contains graphical ports of the Robert C. Edgars MUSCLE tool for multiple alignment. Note: MUSCLE4 is not supported since UGENE version 1.7.2. The package is integrated completely, so there is no need in extra les for using it. It is possible to run several multiple alignment tasks in parallel, check the progress and cancel the running tasks safely. Note: The k-mer clustering part of the MUSCLE algorithm was optimized for multicore systems by Timur Tleukenov, Novosibirsk State Technical University.
The dialog contains the list of MUSCLE modes: MUSCLE default, Large alignment, Rene only.
Warning: By default UGENE does not rearrange sequence order in an alignment, but the original MUSCLE package does. To enable sequence rearrangement uncheck the Do not re-arrange sequences (-stable) option in the dialog.
194
One of the improvements to the original MUSCLE package is the ability to align only a part of the model. When the Column range item is selected the region of the specied columns is only passed to the MUSCLE alignment engine. The resulted alignment is inserted into the original one with gaps added or removed on the region boundaries. Note: To visually select the column range to align, make a selection in the alignment editor rst. Then invoke the MUSCLE plugin. Its column range boundary values will automatically match the given selection.
There are two gap columns inserted into the source prole, and two gap columns inserted into the added one. Therefore the proles columns kept intact and the alignments havent been changed. Note: Aligning a prole to the active alignment you will modify the original alignment le, since it will contain 2 proles after the operation is completed.
11.16. uMUSCLE
195
The original alignment is not modied, only columns with gap () character can be inserted. The second prole was considered as a set of sequences and therefore is modied. Note that if a le with another alignment is used as a source of unaligned sequences, the gap characters are removed and each input sequence is processed independently. This method is quite fast, for example an alignment of 3000 sequences (1000 bases each) to the existing prole takes about 5 minutes on the usual Core2Duo computer.
196
11.17 Bowtie
Bowtie is a popular short read aligner. Click this link to open Bowtie homepage. Bowtie is embedded as an external tool into UGENE. Open Tools DNA Assembly submenu of the main menu.
11.17. Bowtie
197
There are the following parameters: Reference sequence DNA sequence to align short reads to. This parameter is required. Result le name le in SAM format to write the result of the alignment into. This parameter is required. Prebuilt index check this box to use an index le instead of a source reference sequence. The index is a set of 6 les with suxes .1.ebwt, .2.ebwt, .3.ebwt, .4.ebwt, .rev.1.ebwt, and .rev.2.ebwt. The index is created during the alignment. Also you can build it manually .
198
SAM output always save the output le in the SAM format (the option is disabled for Bowtie). Short reads each added short read is a small DNA sequence le. At least one read should be added. Note: Short reads length for Bowtie cant be more than 1024. You can also congure other parameters. They are the same as in the original Bowtie (you can read detailed description of the parameters on the Bowtie manual page). Select one of the following alignment modes: The -n alignment mode: When the -n mode is selected, Bowtie determines which alignments are valid according to the following policy. Alignments may have no more than N mismatches (where N is a number 0-3) in the rst L bases (where L is a number 5 or greater, set with Seed length) on the high-quality (left) end of the read. The sum of the Phred quality values at all mismatched positions (not just in the seed) may not exceed E (set with Maq error ). Where qualities are unavailable (e.g. if the reads are from a FASTA le), the Phred quality defaults to 40. The -v alignment mode: In -v mode, alignments may have no more than V mismatches, where V may be a number from 0 through 3. Quality values are ignored. The -v mode is mutually exclusive with the -n mode. The following parameters are available: Maq error (maqerr) maximum permitted total of quality values at all mismatched read positions throughout the entire alignment, not just in the "seed". The default is 70. By default, Bowtie rounds quality values to the nearest 10 and saturates at 30. Note that the rounding can be disabled with No Maq rounding. Seed Length (seedlen) the number of bases on the high-quality end of the read to which the -n applies. The lowest permitted setting is 5 and the default is 28. Maximum of backtracks (-maxbts) the maximum number of backtracks (default: 125 without Best, 800 with Best). A "backtrack" is the introduction of a speculative substitution into the alignment. Descriptors memory usage (chunkmbs) the number of megabytes of memory a given thread is given to store path descriptors in the Best ag. Default: 64. This parameter is available if the Best ag is checked. Seed (seed) pseudo-random number generator. Threads launch the specied number of parallel search threads. Threads will run on separate processors/cores and synchronize when parsing reads and outputting alignments. The following ags are available: Colorspace (color) the input is read in colorspace, colors are encoded as characters A/C/G/T (A=blue, C=green, G=orange, T=red). No Maq rounding (nomaqround) Maq (Mapping and Assembly with Quality) accepts quality values in the Phred quality scale, but internally rounds values to the nearest 10, with a maximum of 30. By default, Bowtie also rounds this way. No Maq rounding prevents this rounding in Bowtie. No forward orientation (nofw) do not attempt to align against the forward reference strand. No reverse-complement orientation (norc) do not attempt to align against the reverse-complement reference strand. Try as hard (tryhard) try as hard as possible to nd valid alignments when they exist, including paired-end alignments.
11.17. Bowtie
199
Best alignments (best) make Bowtie guarantee that reported singleton alignments are "best" in terms of stratum (i.e. number of mismatches, or mismatches in the seed for the case of -n mode) and in terms of the quality values at the mismatched position(s). All alignments (all) report all valid alignments per read or pair. Validity of alignments is determined by the alignment policy (combined eects of -n mode, -v mode, Seed length, and Maq error ). Select the required parameters and press the Start button.
There are the following parameters: Reference sequence DNA sequence to which short reads would be aligned to. This parameter is required. Index le name a le to save the created index to. This parameter is required. Colorspace (color) the input is read in colorspace, colors are encoded as characters A/C/G/T (A=blue, C=green, G=orange, T=red).
200
11.18 BWA
BWA is a fast light-weighted tool that aligns relatively short reads to a reference sequence. Click this link to open BWA homepage. BWA is embedded as an external tool into UGENE. Open Tools DNA assembly submenu of the main menu.
11.18. BWA
201
There are the following parameters: Reference sequence DNA sequence to align short reads to. This parameter is required.
202
Result le name le in SAM format to write the result of the alignment into. This parameter is required. Prebuilt index check this box to use an index le instead of a source reference sequence. Also you can build it manually . SAM output always save the output le in the SAM format (the option is disabled for BWA). Short reads each added short read is a small DNA sequence le. At least one read should be added. You can also congure other parameters. They are the same as in the original BWA (you can read detailed description of the parameters on the BWA manual page). Select one of the following parameters, that correspond to the -n option in the original BWA. Max #di (-n) maximum edit distance. An integer value should be input. Missing prob (-n) the fraction of missing alignments given 2% uniform base error rate. A oat value is used. Max gap opens (-o) maximum number of gap opens. Index algorithm (-a) algorithm for constructing BWT index. It implements three dierent algorithms: 1. is designed for short reads up to ~200bp with low error rate (<3%). It does gapped global alignment w.r.t. reads, supports paired-end reads, and is one of the fastest short read alignment algorithms to date while also visiting suboptimal hits. 2. bwtsw is designed for long reads with more errors. It performs heuristic Smith-Waterman-like alignment to nd high-scoring local hits. Algorithm implemented in BWT-SW. On low-error short queries, BWA-SW. is slower and less accurate than the is algorithm, but on long reads, it is better. 3. div does not work for long genomes. Enable long gaps checking this box allows one to set the Max gap extentions parameter. Max gap extensions (-e) maximum number of gap extensions. Indel oset (-i) disallow insertions and deletions within the specied number of base pairs towards the ends. Max long deletion extensions (-d) disallow a long deletions within the specied number of base pairs towards the 3-end. Seed length (-l) take the subsequence of the specied length as seed. If the specied length is larger than the query sequence, seeding will be disabled. For long reads, this option is typically ranged from 25 to 35. Max seed dierences (-k) maximum edit distance in the seed. Max queue entries (-m) maximum queue entries. Threads (-t) number of threads. Mismatch penalty (-M) BWA will not search for suboptimal hits with a score lower than the specied value. Gap open penalty (-O) gap open penalty. Gap extension penalty (-E) gap extension penalty.
11.18. BWA
203
Best hits (-R) proceed with suboptimal alignments if there are no more than specied number of equally best hits. This option only aects paired-end mapping. Increasing this threshold helps to improve the pairing accuracy at the cost of speed, especially for short reads (~32bp). Quality threshold (-q) parameter for read trimming. Barcode length (-B) length of barcode starting from the 5-end. When the specied length is positive, the barcode of each read will be trimmed before mapping and will be written at the BC SAM tag. For paired-end reads, the barcode from both ends are concatenated. Colorspace (color) the input is read in colorspace, colors are encoded as characters A/C/G/T (A=blue, C=green, G=orange, T=red). Long-scaled gap penalty for long deletion (-L) long-scaled gap penalty for long deletion. Non-iterative mode (-N) disable iterative search. All hits with no more than Max #di dierences will be found. This mode is much slower than the default. Select the required parameters and press the Start button.
There are the following parameters: Reference sequence DNA sequence to which short reads would be aligned to. This parameter is required. Index le name le to save index to. This parameter is required. Index algorithm (-a) Algorithm for constructing BWT index. Available options are: It implements three dierent algorithms 1. is designed for short reads up to ~200bp with low error rate (<3%). It does gapped global alignment w.r.t. reads, supports paired-end reads, and is one of the fastest short read alignment algorithms to date while also visiting suboptimal hits. 2. bwtsw is designed for long reads with more errors. It performs heuristic Smith-Waterman-like alignment to nd high-scoring local hits. Algorithm implemented in BWT-SW. On low-error short queries, BWA-SW. is slower and less accurate than the is algorithm, but on long reads, it is better. 204 Chapter 11. Plugins
3. div does not work for long genomes. Colorspace (color) the input is read in colorspace, colors are encoded as characters A/C/G/T (A=blue, C=green, G=orange, T=red).
11.18. BWA
205
206
The following parameters are available: Reference sequence DNA sequence to align short reads to. This parameter is required. Result le name le in UGENE database format or SAM format (if the box SAM output check), to write the result of the alignment into. This parameter is required. Prebuilt index check this box to use an index le instead of a reference sequence. Also you can build it manually .
207
SAM output checking this box allows one to save output les in the SAM format. The default format of output les is the UGENE database format (ugenedb). Short reads each added short read is a small DNA sequence le. At least one read should be added. Note: The Aligning Short Reads with UGENE Genome Aligner has no limitation on short reads length. Common parameters: Mismatches allowed check this box to allow mismatches between the reference sequence and a short read. Select one of the following: Mismatches number to set the number of mismatched nucleotides allowed. This parameter can take values: 1, 2 and 3. Percentage of mismatches to set the number of mismatches in percents. Note, that in this case the absolute number of mismatches can vary for dierent reads. This parameter can take values: 1 - 10 %. Align options: Use GPU-optimization use an openCL-enabled GPU during the alignment (the corresponding hardware should be available on your computer). Align reverse complement reads use both: a read and its reverse complement during the alignment. Use "best"-mode during the alignment report only about best alignments (in terms of mismatches). Omit reads with qualities lower than omit all reads with qualities lower than the specied value. Reads that have no qualities are not omited. Advanced parameters: Maximum memory for short reads maximum memory usage for short reads. This parameter allows one to decrease the load on the computer on one side and to increase the computer speed of the task on the other side. Total memory usage shows the total memory usage. System memory size shows the total system memory size. Index parameters: Reference fragmentation this parameter inuences the number of parts the reference will be divided. It is better to make it bigger, but it inuences the amount of memory used during the alignment. Index memory usage size shows the index memory usage. Directory for index les temporary directory for saving index les. You can choose a temporary directory for saving index les for the reference that will be built during the alignment. If you need to run this algorithm one more time with the same reference and with the same reference fragmentation parameter, you can use this prebuilt index that will be located in the temporary directory.
208
The parameters are the following: Reference sequence DNA sequence to which short reads would be aligned to. This parameter is required. Index le name le to save index to. This parameter is required. Reference fragmentation this parameter inuences the amount of parts the reference will be devided. It is better to make it bigger, but it inuences the amount of memory used during the alignment. Total memory usage shows the total memory usage. System memory size shows the total system memory size.
209
11.20 CAP3
CAP3 (CONTIG ASSEMBLY PROGRAM Version 3) is a sequence assembly program for small-scale assembly with or without quality values. Click this link to open CAP3 homepage. CAP3 is embedded as an external tool into UGENE. Open Tools DNA assembly submenu of the main menu.
Select the Contig assembly with CAP3 item to use the CAP3. The Contig Assembly With CAP3 dialog appears.
You can add or remove input les using Add and Remove buttons. To remove all les click the Remove all button. Input les are les with a long DNA reads in FASTA or FASTQ formats. At least one input le should be added. Input a Result contig name and press the Run button. CAP3 produces assembly results in the ACE le format (".ace"). The le contains one or several contigs assembled from the input reads.
210
Clipping for poor regions parameters: Clipping of a poor end region of a read is controlled by parameters Base quality cuto for clipping (-c) (the specied value should be more than 5), and Clipping range (-y) (the specied value should be more than 5). Quality dierence score of an overlap parameters: Base quality cuto for dierences (-b) if an overlap contains a dierence at bases of quality values q1 and q2, then the score at the dierence is max(0, min(q1, q2) - b), where b is the specied value. The specied value should be more than 15. The dierence score of an overlap is the sum of scores at each dierence. Max qscore sum at dierences (-d) remove an overlap if its dierence score is greater than the specied value. The specied value should be more than 20. Similarity score of an overlap parameters: The following parameters are used to calculate the similarity score of an overlapping alignment: Match score factor (-m) a match at bases of quality values q1 and q2 is given a score of m * min(q1, q2), where m is the specied value. The specied value should be more than 0. Mismatch score factor (-n) a mismatch at bases of quality values q1 and q2 is given a score of n * min(q1, q2), where n is the specied value. The specied value should be less than 0. Gap penalty factor (-g) a base of quality value q1 in a gap is given a score -g * min(q1, q2), where g is the specied value; q2 is the quality value of the base in the other sequence right before the gap. The specied value should be more than 0. 11.20. CAP3 211
The similarity score is caclulated as the sum of scores of each match, each mismatch and each gap. Based on this value and the following value some overlaps are removed: Overlap similarity score cuto (-s) remove overlaps with similarity scores less than the specied value. The specied value should be more than 250. Length and percent identity of an overlap parameters: Overlap length cuto (-o) minimum length of an overlap (in base pairs). The specied value should be more than 15 base pairs. Overlap percent identity cuto (-p) minimum percent identity of an overlap. The specied value should be more than 65%. Other parameters: Maximum number of word matches (-t) an upper limit of word matches between a read and other reads. Increasing the value would result in more accuracy, however this could slow down the program. The specied value should be more than 0. Band expansion size (-a) a number of bases to expand a band of diagonals for an overlapping alignment between two sequence reads. The specied value should be more than 10. Max gap length in any overlap (-f) reject overlaps with a gap longer than the specied value. A small value may cause the program to remove true overlaps and to produce incorrect results. This option may be used by the user to split reads from alternative splicing forms into separate contigs. The specied value should be more than 1. Assembly reverse reads (-r) consider reads in reverse orientation for assembly. The default value is "checked".
212
213
In the search dialog you must specify a le with PWM or PFM. You can do so by pressing the browse button [1] and selecting the le. Also you can use the special interface to choose a JASPAR matrix by pressing the Search JASPAR database button [2]. Alternative way to specify the position weight/frequency matrix is to create a specic one from an alignment or a le with several sequences with the build a new matrix tool. After the prole (the matrix) is loaded, you can adjust the threshold value [3]. The threshold sets the minimal identity score for a result to pass. The more the result score is, the more it is homologically related to the aligned region. By changing the threshold you can lter low- scoring results. If the loaded matrix is a position frequency matrix, you must also specify the algorithm to build the corresponding position weight matrix which will represent the transcription factor. There are four algorithms available.
Also you can add a selected matrix with the specied Minimal score and the Algorithm to the matrices list. To do it, select the matrix and other options and press the Add to queue button. The plugin will search with all matrices specied in the list. You can use the Save list... button to export the list of matrices to a *.csv le. Later the list can be loaded from the le using the Load list... button. The rest options are standard sequence search options: the strand and the sequence region where to search for matches. After specifying the necessary options press the Search button. The found results will appear in the dialog table. The corresponding results identity scores are in the Score column.
The regions found by the weight matrix algorithm can be saved as annotations to the DNA sequence in the Genbank format by pressing the Save as annotations button. After saving, the le with resulting annotations will be automatically added to the current project, and the annotations will be added to the original sequence. Note that in case of selecting JASPAR or UNIPROBE matrix, the resulting annotations will contain the given matrix properties.
214
215
Here the matrices are divided into categories and you can read detailed information of a matrix which is represented by its properties. It could help you to choose the matrix properly. Note: The matrices provided with UGENE are located in the $UGENE/data/position_weight_matrix folder.
The following parameters are available: Input le an alignment or a le with several sequences to build the matrix from. The parameter is mandatory. Output le the resulting matrix will be saved in this le. The parameter is mandatory. Statistic type denes the way in which the statistics will be collected. The Mononucleic option is basically good for small alignments, and the Dinucleic option must give more appropriate results for big alignments. Matrix type denes the type of the resulting matrix.
216
If the Frequency matrix option is selected then the frequency matrix will be created and saved into the resulting le. If the Weight matrix option is selected then the intermediate frequency matrix will be created and then transformed into a weight matrix on basis of the selected Weight algorithm. Then the weight matrix will be saved into the resulting le. For some input les the colored Alignment Logo appears at the bottom of the dialog. It gives the representation of the selected alignment.
Note: The Alignment logo appears when: The input le format is *.pfm, *.aln or it is a le with several sequences; The size of the input le is small enough. To start the operation, press the Start button. The matrix will be created and saved. If the Build weight or frequency matrix dialog was invoked from the Weight matrix search dialog, then the matrix also will be chosen as the current prole.
217
11.22 Primer3
The Primer3 plugin is a port of the Primer3 tool. It is intended to pick primers from a DNA sequence. To use the Primer3 , open a DNA sequence and select the Analyze Primer3... context menu item. The dialog will appear:
218
219
BLAST/BLAST+ The Basic Local Alignment Search Tool (BLAST) nds regions of local similarity between sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical signicance of matches. BLAST can be used to infer functional and evolutionary relationships between sequences as well as help identify members of gene families. BLAST+ is a new version of the BLAST package from the NCBI. From UGENE you can use the following tools of the old BLAST package: blastall the old program developed and distributed by the NCBI for running BLAST searches. formatdb formats protein or nucleotide source databases before these databases can be searched by blastall. And the following tools of the new BLAST+ package: blastn searches a nucleotide database using a nucleotide query. blastp searches a protein database using a protein query. blastx searches a protein database using a translated nucleotide query. tblastn compares a protein query against a translated nucleotide database (the all six reading frames). tblastx translates the query nucleotide sequence in all six possible frames and compares it against the six-frame translations of a nucleotide sequence database. makeblastdb formats protein or nucleotide source databases before these databases can be searched by other BLAST+ tools. BLAST home page: http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastHome To make BLAST (or BLAST+) tools available from UGENE: 1. Install the required verion of BLAST (or BLAST+) on your system. 2. Set the paths to the executables, you are going to use, on the External tools tab of UGENE Application Settings dialog. After youve nished this conguration you can access the tools from the Tools BLAST submenu of the main menu.
Creating Database
To format a BLAST database do the following: If youre using BLAST open Tools BLAST FormatDB. If youre using BLAST+ open Tools BLAST BLAST+ make DB. The Format database dialog appears:
220
Here you must select the input les. If all the les you want to use are located in one directory, you can simply select the directory with the les. By default only the les are taken into account with *.fa and *.fasta extensions. You can change this by specifying either Include les lter or Exclude les lter. You can choose either protein or nucleotide type of the les. Then you must select the path to save the database le and specify a Base name for BLAST les and a Title for database le.
221
The dialog is very similar to the dialog described in the Remote BLAST chapter, except the following parameters: Select input le this parameter is only presented if the dialog has been opened from the Tools main menu. Here you must input a query sequence le that would be used to search the BLAST database. If the dialog has been opened e.g. using the Sequence View context menu, then the currently active sequence is used as a query sequence. Search type here you should select the tool you would like to use. If the query sequence is a nucleotide sequence then blastn, blastx and tblastx items are available. For a protein sequence the items are blastp and tblastn. Select database path path to the database les. Base name for BLAST DB les base name for the BLAST database les. Number of CPUs being used number of processors to use. To learn about other parameters, please, refer to the Remote BLAST chapter. ClustalW Clustal is a widely used multiple sequence alignment program. It is used for both nucleotide and protein sequences. ClustalW is a command-line version of the program. Clustal home page: http://www.clustal.org 222 Chapter 11. Plugins
If you are using Windows OS, there are no additional conguration steps required, as ClustalW executable le is included to the UGENE distribution package. Otherwise: 1. Install the Clustal program on your system. 2. Set the path to the ClustalW executable on the External tools tab of UGENE Application Settings dialog. Now you are able to use Clustal from UGENE. Open a multiple sequence alignment le and select the Align with ClustalW item in the context menu or in the Actions main menu. The Align with ClustalW dialog appears (see below), where you can adjust the following parameters: Gap opening penalty cost of opening up a new gap in the alignment. Increasing this value will make gaps less frequent. Gap extension penalty cost of every item in a gap. Increasing this value will make gaps shorter. Weight matrix species a single weight matrix for nucleotide sequences or series of matrices for protein sequences. For nucleotide sequences the weight matrix selected denes the scores assigned to matches and mismatches (including IUB ambiguity codes), it can take values: IUB default scoring matrix used by BESTFIT for the comparison of nucleic acid sequences. Xs and Ns are treated as matches to any IUB ambiguity symbol. All matches score 1.9; all mismatches for IUB symbols score 0. CLUSTALW previous system used by ClustalW, in which matches score 1.0 and mismatches score 0. All matches for IUB symbols also score 0. For protein sequences it describes the similarity of each amino acid to each other. The following values are available: BLOSUM BLOcks of Amino Acid SUbstitution Matrices rst introduced in a paper by Heniko and Heniko. These matrices appear to be the best available for carrying out data base similarity (homology searches). PAM Point Accepted Mutation matrices introduced by Margaret Dayho. These have been extremely widely used since the late 70s. GONNET these matrices were derived using almost the same procedure as the Dayho one (above) but are much more up to date and are based on a far larger data set. They appear to be more sensitive than the Dayho series. ID identity matrix which gives a score of 1.0 to two identical amino acids and a score of zero otherwise. Iteration type species the iteration type to use. During the iteration step each sequence is removed in turn and realigned. It is kept if the resulting alignment is better than the one has been made before. This process is repeated until the score converges or until the maximum number of iterations is reached. Available values are: NONE species not to use iterations. TREE species to iterate at each step of the progressive alignment. ALIGNMENT species to iterate on the nal alignment. Max iterations maximum number of iterations.
223
The following parameters are only available for protein sequences: Gap separation distance tries to decrease the chances of gaps being too close to each other. Gaps that are less than this distance apart are penalized more than other gaps. This does not prevent close gaps; it makes them less frequent, promoting a block-like appearance of the alignment. Hydrophilic gaps o increases the chances of a gap within a run of hydrophilic amino acids. No end gap separation penalty treats end gaps just like internal gaps to avoid gaps that are too close. Residue-specic gaps o amino acid specic gap penalties that reduce or increase the gap opening penalties at each position in the alignment or sequence. For example, positions that are rich in glycine are more likely to have an adjacent gap than positions that are rich in valine. MAFFT Originally, MAFFT is a multiple sequence alignment program for unix-like operating systems. However, currently it is available for Mac OS X, Linux and Windows. It is used for both nucleotide and protein sequences. MAFFT home page: http://mat.cbrc.jp/alignment/software To make MAFFT available from UGENE: 1. Install the MAFFT program on your system. 2. Set the path to the MAFFT executable on the External tools tab of UGENE Application Settings dialog. For example, on Windows you need to specify the path to the mafft.bat le. To use MAFFT open a multiple sequence alignment le and select the Align with MAFFT item in the context menu or in the Actions main menu. The following dialog appears:
224
The following parameters are available: Gap opening penalty Gap opening penalty at group-to-group alignment. Oset (works like gap extension penalty) oset value, which works like gap extension penalty, for group-to-group alignment. Maximum number of iterative rene species the number of cycles of iterative renement to perform. T-Coffee T-Coee is a multiple sequence alignment package. T-Coee home page: T-Coee To make T-Coee available from UGENE see the External Tools. To use T-Coee open a multiple sequence alignment le and select the Align with T-Coee item in the context menu or in the Actions main menu. The following dialog appears:
The following parameters are available: Gap opening penalty indicates the penalty applied for opening a gap. The penalty must be negative. Gap extension penalty indicates the penalty applied for extending a gap. Number of iterations species the number of iterations.
225
Alternatively, you can create / edit a schema using a text editor. When the schema has been created and all its parameters have been set you can run it for a nucleotide sequence. The results are saved as a set of annotations to the specied le in the Genbank format. To learn more about the Query Designer read the Query Designer Manual (follow the link on the UGENE documentation page).
226
Here: task_name task to execute, it can be one of the predened tasks or a task you have created . task_parameter parameter of the specied task. Some parameters of a task are required, like in and out parameters of some tasks. option one of the CLI options. See the example below:
ugene align --in=COI.aln -out result.aln -log-level-details
--task=<task_name> [<task_parameter>=value ...] Species the task to run. A user-dened UGENE workow schema can be used as a task name. For example:
ugene --task=align --in=COI.aln -out result.aln ugene --task=C:\myschema.uwl --in=COI.aln --out=res.aln
227
--log-no-task-progress A task progress is shown by default when a task is running. This option species not to show the progress. --log-level="[<category1>=]<level1> [, ...]" Sets the log level per category. If a category is not specied, the log level is applied to all categories. The following categories are available: "Algorithms" "Console" "Core Services" "Input/Output" "Performance" "Remote Service" "Scripts" "Tasks". The following log levels are available: TRACE, DETAILS, INFO, ERROR or NONE. By default, loglevel=ERROR. For example:
ugene --log-level=NONE ugene --log-level="Tasks=DETAILS, Console=DETAILS"
--log-format="<format_string>" Species the format of a log line. Use the following notations: L - level, C - category, YYYY or YY - year, MM - month, dd - day, hh hour, mm - minutes, ss - seconds, zzz - milliseconds. By default, logformat="[L][hh:mm]". --license Shows license information. --lang=language_code Species the language to use (e.g. for the log output). The following values are available: CS (Czech) EN (English) RU (Russian) --log-color-output If log output is enabled, this option make it colored: ERROR messages are displayed in red, DETAILS messages are displayed in green, TRACE messages are displayed in blue.
228
229
230
231
Example:
ugene find-orfs --in=human_T1.fa --out=result.gb --require-init-codon=false
232
pam250 etc. The matrices available are stored in the $UGENE\data\weight_matrix directory. filter results ltering strategy. [String, Optional, Default: "lter-intersections"] The following values are available: lter-intersections none Example:
ugene find-sw --in=human_T1.fa --out=sw.gb --ptrn=TGCT --filter=none
233
p type of the BLAST search. [String, Optional, Default: "blastn"] The following values are available: blastn blastp blastx tblastn tblastx e expectation value threshold. [Number, Optional, Default: 10] Example:
ugene local-blast --in=input.fa --dbpath=. --dbname=mydb --out=output.gb
234
235
Example:
ugene query --in=input.fa --out=result.gb --schema=RepeatsWithORF.uql
seed seed for pseudo-random number generator. [Number, Optional] seedlen number of bases on the high-quality end of the read to which the n ceiling applies. The lowest permitted setting is 5 and the default is 28. Bowtie is faster for larger values of seedlen. [Number, Optional] tryhard nds valid alignments when they exist, including paired-end alignments. [Boolean, Optional] chunkmbs number of megabytes a certain thread is given to store path descriptors in best mode. [Number, Optional] best guarantees that reported singleton alignments are "best" in terms of stratum (i.e. number of mismatches, or mismatches in the seed in the case of n mode) and in terms of the quality values at the mismatched position(s). Example:
ugene bowtie --reads=r1.fa;r2.fa;r3.fa --ebwt=refindex --out=result.aln
237
score score based ltering which is an alternative to e-value ltering to exclude low-probability hits from the result. [Number, Optional, Default: -1000000000] Example:
ugene hmm2-search --seq=CBS_seq.fa --hmm=CBS.hmm --out=CBS_hmm.gb
238
Parameters: toolpath path to the MAFFT executable. By default, the path specied in the Application Settings is applied. [String, Optional, Default: "default"] tmpdir directory for temporary les. [String, Optional] in semicolon-separated list of input les. [String, Required] out output le. [String, Required] format format of the output le. [String, Required] op penalty for opening a gap. [Number, Optional] ep penalty for extending a gap. [Number, Optional] maxiterate maximum number of cycles of iterative renement. [Number, Optional] Example:
ugene align-mafft --in=COI.aln --out=COI_aligned.aln
239
Example:
ugene pfm-build --in=COI.aln --out=result.pfm
240
241
Example:
ugene pwm-search --seq=input.fa --matrix=Aro80.pwm;Aft1.pwm --out=res.gb
max-err2 setting for ltering results, maximum value of Error type II. [Number, Optional, Default: 0.001] strand strands to search in. [Number, Optional, Default: 0] The following values are available: 0 (both strands) 1 (direct strand) -1 (complement strand) Example:
ugene sitecon-search --in=input.fa --inmodel=profile.sitecon --out=res.gb
243
13 APPENDICES
13.1 Appendix A. Supported File Formats
Note: UGENE is able to read and write les compressed with Unix/Linux gzip utility. You dont have to unpack the les.
+ + +
+ +
EBWT EMBL
+ +
+ -
FASTA
FASTQ
Genbank
244
GFF
*.g
The Gene Finding Format (GFF) format is used to store features and annotations. See also: Sequence View A le format to store HMM proles. See also: HMM2 , HMM3 ASN.1 format used by the Molecular Modeling Database (MMDB). See also: 3D Structure Viewer A multiple sequence alignments le format. See also: Alignment Editor A tree le format. See also: Building Phylogenetic Tree, Phylogenetic Tree Viewer A multiple alignment and phylogenetic trees le format. See also: Alignment Editor , Building Phylogenetic Tree, Phylogenetic Tree Viewer The Protein Data Bank (PDB) format allows to view the 3D structure of the sequence. See also: 3D Structure Viewer A sequence le format used by pDRAW32 software. See also: Sequence View A le format for a position frequency matrix. See also: Weight Matrix A le format for a position weight matrix. See also: Weight Matrix A raw sequence format. See also: Sequence View The Sequence Alignment/Map (SAM) format is a generic alignment format for storing read alignments against reference sequences. See also: Alignment Editor , Bowtie, UGENE Genome Aligner It is a Standard Chromatogram Format. See also: Chromatogram Viewer A le format to store TFBS prole. See also: SITECON A multiple sequence alignments le format. See also: Alignment Editor
HMM MMDB
*.hmm *.prt
+ +
+ -
MSF Newick
+ +
+ +
Nexus
*.nex *.nxs
PDB
*.pdb
pDRAW32
*.pdw
+ + + +
+ + + +
+ + +
245
Swiss-Prot
*.txt *.sw
An annotated protein sequence in format of the UniProtKB/Swiss-Prot database. See also: Sequence View
246
Reads
+ +
+ +
UGENE Workow Designer schema Workow ement command tool elfor line
*.etc
247