0% found this document useful (0 votes)
62 views14 pages

Ten Quick Tips For Harnessing The Power of ChatGPT-GPT-4 in Computational Biology

The document provides 10 tips for harnessing the power of ChatGPT/GPT-4 in computational biology. The tips include embracing new technologies, improving code readability and documentation, writing code efficiently, using ChatGPT to enhance data cleanup, and using ChatGPT to improve data visualization. The tips are aimed at helping computational biologists optimize their workflows with ChatGPT while maintaining research integrity.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
62 views14 pages

Ten Quick Tips For Harnessing The Power of ChatGPT-GPT-4 in Computational Biology

The document provides 10 tips for harnessing the power of ChatGPT/GPT-4 in computational biology. The tips include embracing new technologies, improving code readability and documentation, writing code efficiently, using ChatGPT to enhance data cleanup, and using ChatGPT to improve data visualization. The tips are aimed at helping computational biologists optimize their workflows with ChatGPT while maintaining research integrity.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Ten Quick Tips for Harnessing the Power of ChatGPT/GPT-4 in Computational

Biology

Tiago Lubiana1,*, Rafael Lopes2, Pedro Medeiros3, Juan Carlo Silva1, Andre Nicolau Aquime
Goncalves4, Vinicius Maracaja-Coutinho5,6,7,8,9, Helder I Nakaya1,10*
1
School of Pharmaceutical Sciences, University of São Paulo, São Paulo, Brazil
2
Department of Epidemiology of Microbial Diseases and Public Health Modeling Unit, Yale
School of Public Health, New Haven, CT, USA.
3
TauGC Bioinformatics, São Paulo, Brasil
4
Oxford Vaccine Group, University of Oxford, Oxford, United Kingdom
5
Advanced Center for Chronic Diseases, Universidad de Chile, Santiago, Chile
6
Centro de Modelamiento Molecular, Biofísica y Bioinformática - CM2B2, Facultad de
Ciencias Químicas y Farmacéuticas, Universidad de Chile, Santiago, Chile.
7
ANID Anillo ACT210004 SYSTEMIX, Rancagua, Chile.
8
Anillo Inflammation in HIV/AIDS - InflammAIDS, Santiago, Chile.
9
Beagle Bioinformatics, São Paulo, Brasil & Santiago, Chile
10
Hospital Israelita Albert Einstein, São Paulo, Brazil
* Corresponding authors

Email:

[email protected]

[email protected]

Introduction

The rise of advanced chatbots, such as ChatGPT, has stirred excitement and curiosity in the

scientific community. Powered by large language models (LLMs) GPT-3.5 and GPT-4,

ChatGPT is a General Purpose Technology with the potential to impact the job market and

research endeavors in numerous fields [1]. Although similar models have been fine-tuned for

biology-specific projects, including text-based analysis and biological sequence decoding

[2,3], ChatGPT provides a natural interface for bioinformaticians to begin using LLMs in

their activities. This tool is already accelerating various activities undertaken by

computational biologists, ranging from data cleaning to interpretating results and publishing.

1
However, with great power comes great responsibility. As scientists, we must harness the full

potential of ChatGPT while adhering to ethical guidelines and avoiding pitfalls associated

with the technology.

Figure 1: Ten Quick Tips for ChatGPT in Computational Biology. The tips are

categorized into five mindset and study suggestions and five practical tips, each with

simplified but effective prompt suggestions.

Here, we provide ten insightful tips designed to help computational biologists optimize their

workflows with ChatGPT, ranging from basic prompts to more advanced techniques.

Although our primary focus is on the current ChatGPT/GPT-4 model, we believe that these

tips will remain relevant for future iterations of the technology, as well as other LLMs and

2
chatbots (such as Meta’s LLaMa and Google’s Bard)[4][5]. We invite you to explore our ten

tips (summarized in Fig 1) aimed at effectively utilizing ChatGPT to advance computational

biology research while maintaining a strong commitment to research integrity.

Tip 1. Embrace the Technology and Be Ready for Novelty

ChatGPT, a powerful tool for coding and academic writing tasks, is rapidly gaining traction

in the scientific community. While exercising critical judgment and not blindly accepting

everything it produces is important, incorporating ChatGPT into your workflow can

undoubtedly improve efficiency. We echo van Dis and colleagues' recommendation that

every research group should immediately explore and discuss the potential uses of chatbots

for their work[6].

Chatbot technology is evolving very fast. Although our tips will be valuable in the near

future, new tools and applications are emerging every day. As we finalize this manuscript,

ChatGPT has introduced support for plugins and a new partnership with Wolfram Alpha,

significantly extending its mathematical and computational capabilities[7]. Thus, one of the

most valuable tips we can offer is to be prepared for novelty and remain open to testing new

AI advances.

The speed and quality improvements introduced by these novelties are rapidly changing the

way we work. [1] By embracing technology, you can increase your changes in the job

market and in competitive academic settings. In other words, while ChatGPT will not replace

computational biologists, it is likely that researchers who do not use it (and similar tools) will

lag behind in competitiveness.

3
Tip 2. Improve Code Readability and Documentation

Programming is a central skill of computational biologists. However, code outputs in

academia, such as software, packages, web applications, and analysis scripts are written by

time-constrained students and postdocs. Often, these codes do not follow industry-level best

practices [8] and require some cleaning and better documentation [9]. Nevertheless, these

pieces of code work fine in practice - we just generally wish they were more readable.

Thus, a good starting point to begin harnessing the power of ChatGPT is to make your

favorite scripts more readable. Simple prompts such as “Add explanatory comments to this

code:” or “Rename the variables for clarity:” can already do wonders for future readers of

the code. ChatGPT can also help document functions by generating full roxygen2 syntax in R

and docstrings in python, inferring meaning from variable names and code logic. A sample

prompt to start documenting can be “Render roxygen2 documentation for the function:”.

Tip 3. Write Code Efficiently

In addition to improving the appearance of the code, ChatGPT can be of great help in

constructing the logic of scripts. Bioinformatics settings are diverse, and computational

biologists often act as jacks of all trades, handling multiple analyses across collaborations.

ChatGPT accelerates the learning of new tools, as it provides an interactive environment

capable of commenting on different parts of pipelines. It can provide reasonable code chunks

on demand and help fix errors by simply copying and pasting error messages into the dialog

[10][11]. Of course, expert humans should review the newly produced code and prevent any

semantic error (see Tip. 7).

Furthermore, ChatGPT can perform several functional refactorings. Prompts such as “Extract

functions for increased clarity:” or “Re-write and optimize this for loop:” can improve code

4
modularity and even save computational resources. When refactoring, it is important to set up

good tests to prevent introducing bugs [12]. While ChatGPT can also help you with setting

up testing infrastructure (with prompts like “Write a unit test for the following function and

help me implement it:”), it is crucial to double-check what it generates to ensure it is covering

what it should.

A middle ground between using ChatGPT and implementing full-scale LLM applications is

to add ChatGPT to integrated development environments (IDEs) via plugins. For example, it

is currently possible to use GPT-3.5 and GPT-4 in Visual Studio Code (VSCode) and open-

source plugins are available (https://github.com/gencay/vscode-chatgpt). For the

bioinformaticians using R and RStudio, there are options such as gptstudio

(https://github.com/MichelNivard/gptstudio).

Tip 4. Use ChatGPT to Enhance Data Cleanup

In addition to writing scripts, computational biology research involves cleaning and

reconciling data, ensuring it is consistent and free of errors before running the analysis. Data

and metadata come in various formats, and while ChatGPT will not identify outliers or fix

missing data, it can suggest tools for most common tasks and provide code snippets. It can

also partner up with Excel, offering guidance and writing macros [13].

As expected, ChatGPT proves most useful when processing datasets with natural language

entries. If you manage a database or re-analyze public datasets, you likely have to deal with

inconsistent input entered by submitters. While the current tool cannot consistently match

data to unique identifiers (such as those provided by databases or ontologies [14]), it can add

more consistency and facilitate manual or automatic biocuration steps [15]. A clear

application is to write regular expressions given a few examples, with prompts such as “Write

me regex for R/python/Excel with a pattern that will extract {} from {}”.

5
ChatGPT can greatly help in normalizing labels directly and executing human-like complex

natural language cleanups, like those found in open-field formularies. For small datasets, you

can clean up data directly in the ChatGPT interface, with prompts such as “Act as a table.

Add a new column with consistent labels to this dataset:”. For larger applications, one can

use add-ons, such as GPT for Google Sheets (https://gptforwork.com/), or even write code

that uses the API directly (see Tip 9).

Tip 5. Use ChatGPT to Improve Your Data Visualization

Data visualization is an essential component of computational biology research, and

ChatGPT can be a valuable tool to assist in creating effective and informative figures. One

remarkable capacity of this tool is its proficiency in popular visualization libraries, such as

ggplot2 and matplotlib (e.g. “Create a ggplot2 violin plot with a log10 Y axis”). This

expertise enables it to assist users in overcoming syntax challenges, suggesting new

visualization techniques, and enhancing existing figures.

Image-parsing by GPT-4 has been announced, but as of the time of writing is not yet

available for common users. [16]Thus, while we may soon be able to get direct feedback on

images, we can still leverage GPT-4's ability to parse code for plotting and receive valuable

guidance on areas for improvement. For example, ChatGPT can help you choose appropriate

colors for your figures, make the figures more accessible for color-blind individuals, and

suggest ways to improve the layout of your visualizations. A practical example of a prompt

that can lead to meaningful improvements in your visualizations is asking ChatGPT to

"Change my code to make the plot color-blind friendly".

6
It's important to note that ChatGPT's suggestions should be used as a starting point for further

exploration and refinement, as good figure design involves careful consideration of data,

layout, and style. To make the most of ChatGPT's capabilities, it is essential to familiarize

oneself with the principles of good figure design, which can be found in resources such as the

PLOS Computational Biology article "Ten Simple Rules for Better Figures" [17]. Overall,

by harnessing ChatGPT's potential in generating and refining visualizations, computational

biologists can enhance their research output, create more accessible figures, and

communicate their findings more effectively.

Tip 6. Use ChatGPT to Improve Your Writing

While AI-assisted writing in science has been steadily growing [18] , ChatGPT has made this

technology accessible to a much wider range of scientists and researchers. One of the most

valuable features for authors, especially non-native English speakers, is its aid in expressing

ideas more clearly. Clear and effective communication is especially important in

computational biology, where experts must be capable of conveying complex ideas to

colleagues with varying scientific backgrounds, using language that is understandable by

mathematicians, biologists, and computer scientists alike. ChatGPT improves the clarity of

text, by providing new ways of ordering thoughts, with prompts like "Provide me some

different versions of the following sentence:".

ChatGPT can also help with reformatting text and summarizing thoughts, with prompts such

as “Summarize this text in a 200-word conference abstract:”. Although it will rarely produce

an output that you will fully like, it can break the initial barrier, helping to overcome writer's

blocks. It can do so also by helping outline documents, from papers to teaching plans, both by

creating bulleted lists from natural language and by converting bulleted lists into a final

format.

7
Besides scientific writing, ChatGPT can be utilized for several other writing tasks, such as

creating emails, grant reports, tutorials, and documentation (see Tip 2), and selecting

appropriate keywords for publications. Furthermore, it can modify the text to cater to various

readerships, including composing media releases, simplifying research for non-specialists, or

adapting language from a biologist-based audience to a computer-science-based one.

Regardless of where you use ChatGPT to improve your writing, be sure to disclose its usage

(or other language models) as a writing tool to prevent any misunderstandings.[19]

Guidelines for responsible usage are emerging regarding the ethical use of chatbots as writing

aids, particularly in the context of publishing manuscripts. [20,21] We advise researchers to

familiarize themselves with the discussions and check publisher guidelines whenever using

ChatGPT for publishable research.

Tip 7. Ensure You Understand - or Know How to Test - What it Generates

While ChatGPT can be a powerful tool for writing code and text in computational biology

pipelines, it's important to be careful when applying it to complex analysis. In some cases,

ChatGPT may hallucinate or add bugs that can produce silent errors and lead to false

conclusions.

For beginners in computational programming, the suggestion of functions or libraries that do

not exist can be a significant hurdle and reinforces the need for human intervention.

Therefore, it's important to study tutorials provided by developers and publications related to

8
the topic of interest. When using ChatGPT to help with syntax, it's crucial to only ask for help

with syntax that you have already studied and can understand - or at least test - the results.

A similar caution should be applied when using ChatGPT for writing articles or interpreting

results. Double-check what you read, understand, and agree with everything the chatbot has

generated. In the end, you will be responsible for the text, not OpenAI or ChatGPT.

Tip 8. Learn the Basics of Prompt Engineering/Design

Being an emerging field, the terms are still being discussed, but the importance of knowing

how to interact with a non-deterministic system aiming for an objective result is vital. Prompt

engineering/design involves crafting prompts that effectively communicate, examples,

personas, and goals, to generate response templates that fit your objectives [22,23]. It is also

important to set evaluation metrics to feed the model toward more assertive results within the

limits of available tokens.

A good example of a prompt is: "ChatGPT, I'd like to learn about the use of GATK tools in

bioinformatics. Could you provide a brief overview of GATK, its main applications, and some

popular tools within the GATK suite that are commonly used in the field of bioinformatics?

Please include any advantages and limitations associated with these tools." This prompt is

effective because it clearly states the context (bioinformatics), specifies the topic (GATK

tools), outlines the desired information (overview, applications, popular tools, advantages,

and limitations), and provides a concise and focused question for the AI to address.

In contrast, a bad example would be "Tell me about GATK." This prompt is ineffective

because it lacks context (no mention of bioinformatics), is vague about the topic (just

mentioning GATK, not specifically GATK tools), doesn't specify desired information (no

9
details about what aspects of GATK to discuss), and provides an overly broad and open-

ended question, which may result in less relevant or less focused responses.

By providing more context, details, and specific goals, the good example is more likely to

generate a relevant and informative response from ChatGPT, while the bad example may lead

to a less satisfying outcome. The addition of new parameters after the first outputs for the

refinement is an open possibility, yet caution must be exercised as the risk of loss of context

increases as dialogues become longer, subtle, and more complex. As such, it is imperative to

prioritize specificity, objectivity, and completeness in initial interactions to mitigate the

potential for hallucinations and deviations.

Tip 9. Consider the GPT API to Extend Your Applications

In addition to using the graphical interface, OpenAI’s API allows fine-tuning GPT to better

fit your work. You can use the API to improve interfaces for user-friendly applications,

allowing the user to interact with your software using human language and have GPT convert

it into executable code. The API can also be part of pipelines on your own workflow. For

instance, in a text mining and tokenization pipeline, it can be used to extract entities from the

text database or to summarize text based on desired stopwords.

Fine-tuning involves the manipulation of four parameters that modulate the creativity of the

system: temperature, top_p, frequency_penalty, and presence_penalty. The temperature and

top_p parameters control the degree of boldness and non-determinism exhibited in the output,

and high values reduce the repetitiveness of responses in terms of content and meaning. The

frequency_penalty and presence_penalty parameters regulate the likelihood of token (word)

repetition in the output, and higher values of these parameters minimize repeated tokens.

Note that reproducibility is not guaranteed even when fixing parameters, as GPTs are non-

10
deterministic. Nevertheless, fine-tuning can potentially result in cleaner, less repetitive, and

more concise outputs.

The API can also help when input contains text larger than allowed in web prompts (around

4,000 characters). Large documents can be parsed with GPT by employing tools such as

LangChain (https://github.com/hwchase17/langchain), which are capable of modifying

extensive documents from diverse sources for access by the model and facilitating responses

in a more organized manner.

However, this field is evolving rapidly, and developers are working swiftly to incorporate the

model with tools that address its limitations. New features must be promptly available to keep

up with the accelerated pace of advancement.

Tip 10. Don’t Become Too Dependent on ChatGPT

While ChatGPT is a game changer, it is important to remember that it is still in the early

stages of development. While it may seem like a magic bullet for many researchers, there are

still some issues that need to be considered. It's essential not to become too dependent on

ChatGPT and to have backup plans in place, remembering how to do things "by hand" when

necessary.

One of the key challenges of ChatGPT is that it is being tested to the limit, and the platform

has experienced shutdowns and outages recently. This can be especially problematic for

researchers who rely heavily on ChatGPT for their work. Moreover, there are currently no

commercial alternatives to ChatGPT, and no open-source or non-profit endpoints available.

Over-reliance on any single entity may disrupt your scientific workflow and can be

particularly difficult for those in the Global South, where price surges can be prohibitive.

11
If you're a mentor or team leader, it's essential to ensure that your team is not overly

dependent on ChatGPT and that they have the support they need to succeed. While ChatGPT

is a powerful tool, it should not replace mental health professionals or the social interactions

that come from collaborating with coworkers. If ChatGPT is providing help that was

previously coming from colleagues, it is important to find alternative ways to foster social

interaction, such as coding dojos, pair programming, or social and sports events. Always

strive for a balanced approach when using any AI tools, making sure your team continues to

develop essential skills and knowledge independently.

Conclusion

ChatGPT and other LLM chatbots are powerful tools that are increasingly becoming essential

to scientists and programmers, as well as the various other professionals in between. They

offer the potential to improve productivity and simplify complex workflows, especially in

cases involving repetitive or minor tasks. It pays to invest time in understanding the tool's

applicability and limitations and avoid over-reliance.

Keep in mind they are general-purpose tools [23] To keep track of new, creative uses for

these tools in bioinformatics, we have set up a GitHub repository to crowd-curate content

arising on the matter: https://github.com/csbl-br/awesome-compbio-chatgpt . We believe that

these technologies will help computational biologists to perform their activities more

efficiently, ultimately improving the pace of scientific discovery. We hope that these tips will

help you use ChatGPT to complement (and not substitute) your workflows while remaining

aware of the various applications and implications of this technology.

LLM assistance statement

GPT-4 and ChatGPT were used for writing, coding, and formatting assistance in this project.

12
Funding statement

T.L. is funded by FAPESP Grant #19/26284-1, J.C.S. is funded by FAPESP Grant

#19/27139-5. V.M.C. is funded by FONDECYT-ANID (1211731), FONDAP-ANID

(15120011), STIC/AmSud-ANID (STIC2020008) and Anillo-ANID (ACT210004 and

ATE220016).

References

1. Owens B. How Nature readers are using ChatGPT. Nature. 2023;615: 20.

2. Ferruz N, Schmidt S, Höcker B. ProtGPT2 is a deep unsupervised language model for


protein design. Nat Commun. 2022;13: 4348.

3. Thorp HH. ChatGPT is fun, but not an author. Science. 2023;379: 313.

4. Touvron H, Lavril T, Izacard G, Martinet X, Lachaux M-A, Lacroix T, et al. LLaMA:


Open and Efficient Foundation Language Models. 2023. Available:
http://arxiv.org/abs/2302.13971

5. Bard. [cited 23 Mar 2023]. Available: https://bard.google.com/

6. van Dis EAM, Bollen J, Zuidema W, van Rooij R, Bockting CL. ChatGPT: five
priorities for research. Nature. 2023;614. doi:10.1038/d41586-023-00288-7

7. ChatGPT Gets Its “Wolfram Superpowers”!—Stephen Wolfram Writings. [cited 24 Mar


2023]. Available: https://writings.stephenwolfram.com/2023/03/chatgpt-gets-its-
wolfram-superpowers/

8. Trisovic A, Lau MK, Pasquier T, Crosas M. A large-scale study on research code quality
and execution. Sci Data. 2022;9: 60.

9. Filazzola A, Lortie CJ. A call for clean code to effectively communicate science.
Methods Ecol Evol. 2022. doi:10.1111/2041-210x.13961

10. Shue E, Liu L, Li B, Feng Z, Li X, Hu G. Empowering Beginners in Bioinformatics with


ChatGPT. bioRxiv. 2023. p. 2023.03.07.531414. doi:10.1101/2023.03.07.531414

11. Sobania D, Briesch M, Hanna C, Petke J. An analysis of the automatic bug fixing
performance of ChatGPT. 2023. doi:10.48550/ARXIV.2301.08653

12. Hunter-Zinck H, de Siqueira AF, Vásquez VN, Barnes R, Martinez CC. Ten simple
rules on writing clean and reliable open-source scientific software. PLoS Comput Biol.
2021;17: e1009481.

13. Williams KL. Using ChatGPT with Excel. In: Journal of Accountancy [Internet]. 30 Jan
2023 [cited 26 Mar 2023]. Available:

13
https://www.journalofaccountancy.com/news/2023/jan/using-chatgpt-with-excel.html

14. McMurry JA, Juty N, Blomberg N, Burdett T, Conlin T, Conte N, et al. Identifiers for
the 21st century: How to design, provision, and reuse persistent identifiers to maximize
utility and impact of life science data. PLoS Biol. 2017;15: e2001414.

15. Amy Tang Y, Pichler K, Füllgrabe A, Lomax J, Malone J, Munoz-Torres MC, et al. Ten
quick tips for biocuration. PLoS Comput Biol. 2019;15: e1006906.

16. OpenAI. GPT-4 Technical Report. 2023. doi:10.48550/ARXIV.2303.08774

17. Rougier NP, Droettboom M, Bourne PE. Ten simple rules for better figures. PLoS
Comput Biol. 2014;10: e1003833.

18. Hutson M. Could AI help you to write your next paper? Nature. 2022;611: 192–193.

19. Stokel-Walker C. ChatGPT listed as author on research papers: many scientists


disapprove. In: Nature Publishing Group UK [Internet]. 18 Jan 2023 [cited 26 Mar
2023]. doi:10.1038/d41586-023-00107-z

20. Stokel-Walker C, Van Noorden R. What ChatGPT and generative AI mean for science.
Nature. 2023;614: 214–216.

21. Tools such as ChatGPT threaten transparent science; here are our ground rules for their
use. Nature. 2023. p. 612.

22. White J, Fu Q, Hays S, Sandborn M, Olea C, Gilbert H, et al. A prompt pattern catalog
to enhance prompt engineering with ChatGPT. 2023. doi:10.48550/ARXIV.2302.11382

23. Beurer-Kellner L, Fischer M, Vechev M. Prompting is programming: A query language


for large language models. 2022. doi:10.48550/ARXIV.2212.06094

14

You might also like